Our research aims at developing a fully automated system for the
accurate and rapid 3D reconstruction of large scale urban environments from video
streams. It is based on two factors:
the abundance of video data that came about with the recent progress in camcorder technology combined with decreased prices
the need for compact, 3D descriptions of what has been filmed.
For many applications, 3D models are more descriptive than the
frames of the original video. In a model of a city users can see a very
large area at once, realize the spatial arrangement of the buildings at
a single glance and navigate freely to the parts that most interest
them. These tasks are considerably more difficult and time-consuming
using the original video.
Sample screenshots of our ground
and aerial reconstructions
To achieve accurate 3D reconstructions one is faced with many
difficulties. The core problems that need to be addressed are
estimating the motion of the camera
and the structure
of the scene. These problems are generally ill-posed since they are
attempts to recover 3D information from 2D image data. Given structure
and motion estimates, dense pixel correspondences can be established
and polygonal models of the scene can be generated using frames of the
videos for texture-mapping.
While such reconstructions have been possible for years, as for
instance in the previous work of members of our team (see M. Pollefeys
et al.,
"Visual modeling with a hand-held camera", International Journal of Computer
Vision 59(3), 2004), the computational effort to obtain them was a limiting factor
for the development of a practical system that is able to process
massive amounts of video such as the ones needed to model an entire
city. In our current work, the speed of the system is a major
consideration. Our algorithms are fast by nature and amenable to GPU
implementations. We have achieved 30 Hz real-time performance on a single
consumer PC with a standard commodity graphics card (GPU) by leveraging both
the CPU and the GPU.
Data Collection
Portable recording system mounted on a
backpack
Left: A base with four cameras.
Right: The same base on the roof of our van in Chapel Hill.
For data collection at ground-level, we
have constructed two recording systems. The image on the top left shows a low-cost, man-portable camera
capture system, designed for flexibility and mobility. The setup consists of a
Point Grey Research Ladybug2 omnidirectional camera, a Garmin consumer-grade GPS
receiver with Wide Area Augmentation System (WAAS) capability, and a Microstrain
3DM-G inertial sensor. The Ladybug is a multi-camera system consisting of six
cameras, of which five form a ring and the sixth camera points upwards.
Together, these provide video coverage of most of the upper hemisphere about the
camera unit, except for the area directly below the camera. The GPS unit is
accurate to approximately five meters under optimal conditions meaning many
visible satellites, no or low multipath error etc. In practice this is highly
unusual in urban environments and errors on the scale of 10 or more meters are
more typical due to the urban canyon effect where very little of the sky is
visible to the GPS receiver because of surrounding buildings. Finally the
3DM-G provides an absolute orientation measurement accurate to +/-5 degrees.
The image on the top right shows the system we use for large-scale data
collection, constructed using multiple synchronized cameras
on a base and mounted on a car. Three of
the four cameras are horizontal, pointing forward, backward and to the
side, while the fourth one is tilted upward to capture the upper parts
of building facades. While additional sensors such as GPS and inertia sensors
can be integrated to give accurate position measurements, our system can
function in the absence of this additional information as well.
Below is a composite video captured by the four camera recording system in
Chapel Hill
Composite video captured with the four
camera system
We have also obtained 3D reconstructions from aerial video with a helicopter, using a nose-mounted,
gyro-stabilized HDTV camera system. Using both ground and aerial video allows us
to obtain increased coverage of the facades and rooftops in urban scenes. By
aligning the ground and aerial reconstructions, we are thus able produce more
complete 3D models. Below is a photo of the aerial recording system, along with
a sample video.
Setup for recording aerial video
Aerial video clip recorded
over Chapel Hill
Processing
The
input to our processing pipeline consists of the video sequence, which may be
optionally augmented by the trajectory of the vehicle, in the form of GPS/INS
measurements . Since the cameras do not overlap, reconstruction
is performed on the frames of a single camera as it moves through the
scene. The main steps of the processing pipeline are the following:
2D feature tracking: Our highly optimized
GPU implementation of the KLT
tracker is used to track features in the image, and is able to achieve
processing rates of more than 200 frames per second on on state-of-the-art
consumer GPUs for PAL (720 × 576) resolution data. The features are used
in the structure-from-motion computation and in the sparse scene
analysis step. Our implementation is also capable of estimating a global gain
ratio between successive frames in order to compensate for changes in the camera
exposure. The sources for the KLT tracker can be downloaded from
here.
Pose estimation and refinement: Using the tracked 2D features, we
compute camera poses using various structure-from motion techniques. The
computation is done in real-time using
robust estimation techniques. In the
event where additional trajectory information is available, this may be fused with visual measurements via an
Extended Kalman Filter to
refine the pose of each camera at each frame.
Sparse scene analysis: The tracked features can be
reconstructed in 3D, given the camera poses, to provide valuable
information about the scene surfaces and their orientation.
Multi-way plane-sweeping stereo: We use the
plane-sweeping
algorithm for stereo reconstruction. Planes are swept in multiple
directions to account for slanted surfaces and a prior probability
estimated from the sparse data is used to disambiguate textureless
surfaces. Our stereo algorithm is also run on the GPU to take advantage
of its efficiency in rendering operations, which are the most costly
computations in plane-sweeping stereo.
Depth map fusion: Stereo depth maps are computed for each
frame in the previous step in real time. There are large overlaps
between adjacent depth maps that should be removed to produce a more
economical representation of the scene. The
depth map fusion stage
combines multiple depth estimates for each pixel of a reference view,
enforces visibility constraints and thus improves the accuracy of the
reconstruction. The result is one fused depth map that replaces several
raw depth maps that cover the same part of the scene.
Model generation: A multi-resolution mesh is generated for
each fused depth map and video frames are used for texture-mapping. The
camera gain is adjusted within and across video streams so that
transitions in appearance are smoother. In addition, partial
reconstructions are merged and holes in the model are filled in, if
possible.
Model matching: We have also developed a
technique that may be used to efficiently perform 3D scene alignment. By leveraging local shape information, an invariant feature descriptor is extracted which is then used in a
hierarchical matching scheme to perform efficient matching and alignment of 3D
scenes. This allows us to align multiple 3D models; such as those obtained from
ground and aerial video, and also to perform loop completion.
Reconstructions
Below are some screenshots and videos of textured models of Chapel Hill
reconstructed using our approach. Our largest reconstruction comprises several
such sequences and totals 1.3 million frames.
Screenshot:
Overview and details of a ground-based
reconstruction using two cameras
Screenshot: View from above and
details of a ground-based reconstruction of a challenging scene using four
cameras
Videos of the two ground-based reconstructions shown above
Below are screenshots and videos for reconstructions obtained from aerial video
captured over UNC Charlotte and UNC Chapel Hill.
Screenshot: Aerial reconstruction over UNC Charlotte
Screenshot: Aerial reconstruction over UNC Chapel Hill
Videos of the two aerial-based reconstructions shown above
Using our technique for model matching, ground and aerial reconstructions can be
aligned, as described in the video below.
Our 3D models can also be loaded into Google Earth and displayed in a
geo-registered coordinate frame. Click
here to download
the model files (unzip
the folder and drop the .kml file into Google Earth). Below is a video clip that
shows one of our models being navigated within Google Earth.
Rahul Raguram, Jan-Michael Frahm, Marc Pollefeys, "A Comparative Analysis of
RANSAC Techniques Leading to Adaptive Real-Time Random Sample Consensus", ECCV
2008
Changchang Wu, Brian Clipp, Xiaowei Li, Jan-Michael Frahm, Marc Pollefeys,
"3D Model Matching with Viewpoint Invariant Patches (VIPs)", CVPR 2008
David Gallup, Jan-Michael Frahm, Philippos Mordohai, Marc Pollefeys,
"Variable Baseline/Resolution Stereo", CVPR 2008
M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp,
C. Engels, D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L.
Wang, Q. Yang, H. Stew´enius, R. Yang, G. Welch, H. Towles,“Detailed
Real-Time Urban 3D Reconstruction From Video”, IJCV special issue on
“Modeling Large-Scale 3D Scenes”
Christopher Zach, David Gallup and Jan-Michael Frahm, "Fast Gain-Adaptive
KLT Tracking on the GPU", CV GPU' 08 workshop in conjunction with CVPR'08
P. Mordohai, J.-M. Frahm, A. Akbarzadeh, B. Clipp, C. Engels, D.
Gallup, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H.
Stewénius, H. Towles, G. Welch, R. Yang, M. Pollefeys and D. Nistér, "Real-Time Video-Based Reconstruction of Urban Environments", 3D-ARCH'2007: 3D Virtual Reconstruction and Visualization of Complex Architectures, Zurich, Switzerland, July, 2007
S.J. Kim, D. Gallup, J.-M. Frahm, A. Akbarzadeh, Q. Yang, R. Yang, D. Nistér and M. Pollefeys,"Gain Adaptive Real-Time Stereo Streaming", International Conference on Computer Vision Systems, 2007
A. Akbarzadeh, J.-M. Frahm, P. Mordohai, B. Clipp, C. Engels, D.
Gallup, P. Merrell, M. Phelps, S. Sinha, B. Talton, L. Wang, Q. Yang,
H. Stewénius, R. Yang, G. Welch, H. Towles, D. Nistér and M. Pollefeys,
"Towards Urban 3D Reconstruction From Video",
Third International Symposium on 3-D Data Processing, Visualization and
Transmission, Chapel Hill, North Carolina, USA, June 2006