Visiting PhD student, Princeton University
Build up a real-time 3D reconstruction system by leveraging deep neural networks
RGB-D scanning of indoor environments is important for many applications, including real estate, interior design, and virtual reality. However, it is still challenging to register RGB-D images from a handheld camera over a long video sequence into a globally consistent 3D model. It has been shown that structured registration can significantly alleviate drift issue when being conducted over more reliable geometric proxies such as planes. However, robust plane detection algorithms are mostly confined to large planar structures (e.g. walls, tabletop), which greatly limits the utility of structured approach, especially to cluttered scenes. Working with small size planar patches (screen, box face, etc.) gains flexibility and generality, but renders their detection and matching infeasible for traditional geometric methods.
We opt to harness deep neural networks to address these two tasks robustly. To achieve that, we propose two novel deep architectures for planar patch detection from an RGB-D frame and for patch similarity measuring across two frames, respectively. The fast testing of the networks guarantees real-time frame rate. Based on the extracted planar patches and their correspondences, we would like to realize local frame-to-frame registration as well as global optimization accounting for temporal coherence.
Abstract: A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available -- current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval. The dataset is freely available at http://www.scan-net.org.
Pub.: 14 Feb '17, Pinned: 30 Jun '17
Abstract: RGB-D scanning of indoor environments (offices, homes, museums, etc.) is important for a variety of applications, including on-line real estate, virtual tourism, and virtual reality. To support these applications, we must register the RGB-D images acquired with an untracked, hand-held camera into a globally consistent and accurate 3D model. Current methods work effectively for small environments with trackable features, but often fail to reproduce large-scale structures (e.g., straight walls along corridors) or long-range relationships (e.g., parallel opposing walls in an office). In this paper, we investigate the idea of integrating a structural model into the global registration process. We introduce a fine-to-coarse algorithm that detects planar structures spanning multiple RGB-D frames and establishes geometric constraints between them as they become aligned. Detection and enforcement of these structural constraints in the inner loop of a global registration algorithm guides the solution towards more accurate global registrations, even without detecting loop closures. During experiments with a newly created benchmark for the SUN3D dataset, we find that this approach produces registration results with greater accuracy and better robustness than previous alternatives.
Pub.: 28 Jul '16, Pinned: 30 Jun '17
Abstract: Active vision is inherently attention-driven: The agent selects views of observation to best approach the vision task while improving its internal representation of the scene being observed. Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images, we propose to address the multi-view depth-based active object recognition using attention mechanism, through developing an end-to-end recurrent 3D attentional network. The architecture comprises of a recurrent neural network (RNN), storing and updating an internal representation, and two levels of spatial transformer units, guiding two-level attentions. Our model, trained with a 3D shape database, is able to iteratively attend to the best views targeting an object of interest for recognizing it, and focus on the object in each view for removing the background clutter. To realize 3D view selection, we derive a 3D spatial transformer network which is differentiable for training with back-propagation, achieving must faster convergence than the reinforcement learning employed by most existing attention-based models. Experiments show that our method outperforms state-of-the-art methods in cluttered scenes.
Pub.: 13 Oct '16, Pinned: 30 Jun '17