A Survey of Urban Reconstruction



This paper provides a comprehensive overview of urban reconstruction. While there exists a considerable body of literature, this topic is still under active research. The work reviewed in this survey stems from the following three research communities: computer graphics, computer vision and photogrammetry and remote sensing. Our goal is to provide a survey that will help researchers to better position their own work in the context of existing solutions, and to help newcomers and practitioners in computer graphics to quickly gain an overview of this vast field. Further, we would like to bring the mentioned research communities to even more interdisciplinary work, since the reconstruction problem itself is by far not solved.

1. Introduction

The reconstruction of cities is a topic of significant intellectual and commercial interest. It is therefore no surprise that this research area has received significant attention over time. Despite the high volume of existing work, there are many unsolved problems, especially when it comes to the development of fully automatic algorithms.

1.1. Applications

Urban reconstruction is an exciting area of research with several applications that benefit from reconstructed three-dimensional (3D) urban models:

  • In the entertainment industry, the storyline of several movies and computer games takes place in real cities. In order to make these cities believable at least some part of the models are obtained by urban reconstruction.
  • Digital mapping for mobile devices, cars and desktop computers requires 2D and 3D urban models. Examples of such applications are Google Earth and Microsoft Bing Maps.
  • Urban planning in a broad sense relies on urban reconstruction to obtain the current state of the urban environment. This forms the basis for developing future plans or to judge new plans in the context of the existing environment.
  • Training and simulation applications for emergency management, civil protection, disaster control, driving, flying and security benefit from virtual urban worlds.

1.2. Scope

Urban habitats consist of many objects, such cars, streets, parks, traffic signs, vegetation and buildings. In this paper, we focus on the reconstruction of 3D geometric models of urban areas, individual buildings and façades.

Most papers mentioned in this survey were published in computer graphics, computer vision and photogrammetry and remote sensing. There are multiple other fields that contain interesting publications relevant to urban reconstruction, for example, machine learning, computer aided design, geo-sciences, mobile-technology, architecture, civil engineering and electrical engineering. Our emphasis is the geometric reconstruction and we do not discuss aspects, like the construction of hardware and sensors, details of data acquisition processes and particular applications of urban models.

We also exclude procedural modelling, which has been covered in a recent survey by Vanegas et al. [VAW*10]. Procedural modelling is an elegant and fast way to generate huge, complex and realistically looking urban sites, but due to its generative nature it is not well suited for exact reconstruction of existing architecture. It can also be referred to as forward procedural modelling. Nevertheless, in this survey we do address its counterpart, called inverse procedural modelling (Section 'Inverse procedural modelling'), in addition to other urban reconstruction topics.

We also omit manual modelling, even if it is probably still the most widely applied form of reconstruction in many architectural and engineering bureaus. From a scientific point of view, the manual modelling pipeline is well researched. An interesting overview of methods for the generation of polygonal 3D models from CAD-plans has been presented by Yin et al. [YWR09].

In order to allow unexperienced computer graphics researchers to step into the field of 3D reconstruction, we provide a slightly more detailed description of the fundamentals of stereo vision in Section 'Point Clouds & Cameras'. We omit concepts like the trifocal tensor or details of multi-view vision. Instead, we refer to the referenced papers and textbooks, for example, by Hartley and Zisserman [HZ04], Moons et al. [MvGV09] and recently by Szeliski [Sze11]. Due to the enormous range of the literature, our report is designed to provide a broad overview rather than a tutorial.

1.3. Input data

There are various types of possible input data that is suitable as a source for urban reconstruction algorithms. In this survey, we focus on methods which utilise imagery and Light Detection and Ranging scans (LiDAR).

Imagery is perhaps the most obvious input source. Common images acquired from the ground have the advantage of being very easy to obtain, store and exchange. Nowadays, an estimated tens of billions of photos are taken worldwide each year, which results in hundreds of petabytes of data. Many are uploaded and exchanged over the Internet, and furthermore, many of them depict urban sites. In various projects this information has been recognised as a valuable source for large scale urban reconstruction [SSS06, IZB07, ASSS10, FFGG*10]. Aerial and satellite imagery, on the other hand, for many years was restricted to the professional sector of the photogrammetry and remote sensing community. Only in the recent decade, this kind of input data has become more easily available, especially due to the advances of Web-mapping projects, like Google Maps and Bing Maps, and was successfully utilised for reconstruction [VAW*10].

Another type of input that is excellently suitable for urban reconstruction is LiDAR data. It typically utilises laser light which is projected on surfaces and its reflected backscattering is captured, where structure is determined trough the time-of-flight principle [CW11]. It delivers semi-dense 3D point-clouds which are fairly precise, especially for long distance acquisition. Although scanning devices are expensive and still not available for mass markets, scanning technology is frequently used by land surveying offices or civil engineering bureaus. Many recent algorithms rely on input from LiDAR, both terrestrial and aerial.

Furthermore, some approaches incorporate both data types in order to combine their complementary strengths: imagery is inherently a 2D source of extremely high resolution and density, but view depended and lacking depth information. A laser-scan is inherently a 3D source of semi-regular and semi-dense structure, but often incomplete and noisy. Combining both inputs promises to introduce more insights into the reconstruction process [LCOZ*11].

Finally, both types can be acquired from the ground or from the air (cf. Figure 1), providing a source for varying levels of detail (LOD). The photogrammetry community proposes a predefined standard (OpenGIS) for urban reconstruction LODs [GKCN08] for Geographic Information System (GIS). According to this scheme, airborne data is more suitable for coarse building models reconstruction (LOD1, Section 'Blocks & Cities'), ground based data is more useful for individual buildings (LOD2, Section 'Buildings & Semantics') and façade details (LOD3, Section 'Façades & Images').

Figure 1.

Input data types. We review interactive and automatic reconstruction methods which use imagery or LiDAR-scans acquired either from the ground or from the air.

1.4. Challenges

1.4.1. Full automation

The goal of most reconstruction approaches is to provide solutions that are as automatic as possible. In practice, full automation turns out to be hard to achieve. The related vision problems quickly result in huge optimisation tasks, where global processes are based on local circumstances, and local processes often depend on global estimates. In other words, the detection of regions of interest is both context dependent (top down), since we expect a well-defined, underlying object and context free (bottom-up), since we do not know the underlying object and want to estimate a model from the data. In fact, this is a paradox and these dependencies can be generally compared to the ‘chicken or egg’ dilemma.

There is no unique solution to this fundamental problem of automatic systems. Most approaches try to find a balance between these constraints, for instance, they try to combine two or more passes over the data, or eventually to incorporate the human user in order to provide some necessary cues.

1.4.2. Quality and scalability

An additional price to pay for automation is often the loss of quality. From the point of view of interactive computer graphics, the quality of solutions of pure computer vision algorithms is quite low, while especially for high-quality productions like the movie industry, the expected standard of the models is very high. In such situations, the remedy is either pure manual modelling or at least manual quality control over the data. The downside of this approach is its poor scalability: human interaction does not scale well with huge amounts of input data.

For these reasons, many recent approaches employ compromise solutions that cast the problem in such a way that both the user and the machine can focus on tasks which are easy to solve for each of them. Simplified user interaction that can be performed even by unskilled users often provides the quantum of knowledge that is needed to break out from the mentioned dilemma.

1.4.3. Acquisition constraints

Other problems that occur in practice are due to the limitations given during the data acquisition process.

For example, it is often difficult to acquire coherent and complete data of urban environments. Buildings are often located in narrow streets surrounded by other buildings and other obstructions, thus photographs, videos or scans from certain positions may be impossible to obtain, neither from the ground nor from the air. The second common handicap is the problem of unwanted objects in front of the buildings, such as vegetation, street signs, vehicles and pedestrians. Finally, there are obstacles like glass surfaces which are problematic to acquire with laser-scans. Photographs of glass are also difficult to process due to many reflections. Lighting conditions, for example, direct sunshine or shadows, influence the acquisition as well, thus, recovery of visual information that has been lost through such obstructions is also one of the challenges.

A common remedy is to make multiple overlapping acquisition passes and to combine or to compare them. However, in any case post-processing is required.

1.5. Overview

It is a difficult task to classify all the existing reconstruction approaches, since they can be differentiated by several properties, such as input data type, level of detail, amount of automation or output data. Some methods are bottom–up, some are top–down and some combine both approaches.

In this paper, we propose an output-based ordering of the presented approaches. This ordering helps us to sequentially explain important concepts of the field, building one on top of another; but note that this is not always strictly possible, since many approaches combine multiple methodologies and data types.

Another advantage of this ordering is that we can specify the expected representation of the actual outcome for each section. Figure 2 depicts the main categories that we handle. In this paper, the term modelling is generally used for interactive methods, and the term reconstruction for automatic ones.

  1. Point Clouds & Cameras. Image-based stereo systems have reached a rather mature state and often serve as preprocessing stages for many other methods since they provide quite accurate camera parameters. Many other methods, even the interactive ones which we present in later sections, rely on this module as a starting point for further computations. For this reason we first introduce the Fundamentals of Stereo Vision in Section 'Fundamentals of stereo vision'. Then, in Section 'Structure from motion', we provide the key concepts of image-based automatic Structure from Motion methodology, and in Section 'Multi-view stereo', we discuss Multi-View Stereo approaches.
  2. Buildings & Semantics. In this section, we introduce a number of concepts that aim at the reconstruction of individual buildings. We start in Section 'Image-based modelling' with Image-Based Modelling approaches. Here, we present a variety of concepts based on photogrammetry and adapted for automatic as well as for interactive use. In Section 'LiDAR-based modelling', we introduce concepts of interactive LiDAR-Based Modelling aiming at reconstruction of buildings from laser-scan point clouds. In Section 'Inverse procedural modelling', we describe the concept of Inverse Procedural Modelling.
  3. Façades & Images. We handle the façade topic explicitly because it is of particular importance in our domain of modelling urban areas. In Section 'Façade imagery', we handle generation of panoramas and textures from Façade Imagery. In Section 'Façade decomposition', we introduce various concepts for Façade Decomposition that aim at segmenting façades into elements such as doors, windows, and other domain-specific features, detection of symmetry and repetitive elements, and higher-order model fitting. In Section 'Façade modelling', we introduce concepts which aim at interactive Façade Modelling, such as subdivision into highly detailed sub-elements.
  4. Blocks & Cities. In this section, we discuss automatic reconstruction of models of large areas or whole cities. Such systems often use multiple input data types, like aerial images and LiDAR. We first mention methods performing Ground Reconstruction in Section 'Ground-based reconstruction'. In Section 'Aerial reconstruction', we focus on Aerial Reconstruction from aerial imagery, LiDAR or hybrids, and finally, in Section 'Massive city reconstruction', we discuss methods which aim at automatic Massive City Reconstruction of large urban areas.

In the remainder of this paper we review those categories.

2. Point Clouds & Cameras

Generally speaking, stereo vision allows recovering the third dimension from multiple (at least two) distinct 2D images. The underlying paradigm is called stereopsis, which is also the way humans are able to perceive depth from two images stemming from two close-by locations.

2.1. Fundamentals of stereo vision

In computer vision, the goal is to reconstruct 3D structure which lies in the 3D Euclidian space in front of multiple camera devices, where each of them projects the scene on a 2D plane. For the purpose of simplification and standardisation, the most common model of a camera is the pinhole camera. This model allows expressing the projection by means of a linear matrix equation using homogeneous coordinates.

2.1.1. Camera model

The operation we want to carry out is a linear central projection, thus the camera itself is defined by an optical center C which is also the origin of the local 3D coordinate frame. Typically, in computer vision, a right-handed coordinate system is used, where the ‘up-direction’ is the Y-axis and the camera ‘looks’ along the positive Z-axis, which is also called the principal axis as shown in Figure 3. The scene in front of the camera is projected onto the image plane, which is perpendicular to the principal axis, and its distance to the optical centre is the actual focal length f of the camera. The principal axis pierces the image plane at the principal point inline image as depicted in Figure 3.

Figure 2.

Overview of urban reconstruction approaches. We attempt to roughly group the methods according to their outcome. We report about interactive methods using both user input and automatic algorithms as well as about fully automatic methods. Note that this is a schematic illustration, and in practice many solutions cannot be strictly classified into a particular bin.

Figure 3.

Camera geometry: (left-hand side) C denotes the camera centre and p the principal point. In a basic setup the centre of the first camera is centred at the origin; (right-hand side) 2D cross section of the projection.

In practice, lenses of common cameras are quite sophisticated optical devices whose projective properties are not strictly linear. In order to obtain the standardised camera from any arbitrary device, a process called camera calibration is carried out. In this process the internal camera parameters are determined and stored in the camera intrinsic calibration matrix K. The notation of the matrix varies throughout the literature, but a basic version can be described as:

display math(1)

where f denotes the focal length, and the point inline image is the principal point of the camera plane. This setup allows projecting a point inline image from 3D space onto a point x on the image plane by a simple equation:

display math(2)

Another aspect of camera calibration is its location in space, which is often called the extrinsic camera parameters. In single-view vision, it is sufficient to define the origin of the global space at the actual camera centre without changing any of the mentioned equations. In multi-view vision, this is not adequate anymore, since each camera requires its own local projective coordinate system. These cameras, as well as the objects in the scene, can be considered as lying in a common 3D space that can be denoted as the world space. The pose of each particular camera can be described by a rotation, expressed by a 3-by-3 matrix R, and the position of its optical centre C, which is a vector in 3D world space. This leads to an extension of Equation (1) to a 3 × 4 matrix:

display math(3)

where P is referred to as homogeneous camera projection matrix. Note that now the 3D space points have to be expressed in homogeneous coordinates inline image. In this way, an arbitrary point X in world space can be easily projected onto the image plane by:

display math(4)

Determining the extrinsic parameters is often referred to as pose estimation or as extrinsic calibration.

For a typical hand-held camera, the mentioned parameter sets are not known a priori. There are several ways to obtain the intrinsic camera calibration [LZ98, WSB05, JTC09], where one of them is to take photos of predefined patterns and to determine the parameters by minimising the error between the known pattern and the obtained projection [MvGV09]. Extrinsic parameters are of more importance in a multi-camera setup, which can be obtained automatically from a set of overlapping images with common corresponding points [MvGV09]. Please note that the described camera model is a simplified version which does not take all aspects into account, like the radial distortion or the aspect ratio of typical image-sensor pixels. We refer the reader to Hartley and Zisserman [HZ04] and to Moons et al. [MvGV09] for exhaustive discussions about calibration and self-calibration in multi-view setups.

2.1.2. Epipolar geometry

For a single camera, we are able to determine only two parameters of an arbitrary 3D point projected to the image plane. In fact, the point X lies on a projecting ray as depicted in Figure 4. Obviously, it is not possible to define the actual position of the point along the ray without further information. An additional image from a different position provides the needed information. Figure 4 depicts this relationship: The projective ray from the first camera trough a 2D image point x1 and a 3D point X appears as a line l2 in the second camera, which is referred to as an epipolar line. Consecutively, a corresponding point in the second image must lie on the line and is denoted as x2. Note that also the optical centres of each camera project onto the image planes of each other, as shown in Figure 4. These points are denoted as the epipoles e1 and e2, and the line connecting both camera centres is referred to as the baseline. The plane defined by both camera centres and the 3D point X is referred to as epipolar plane.

Figure 4.

Epipolar geometry in a nutshell: points x1 and x2 are corresponding projections of the 3D point X. In image 1 the point x1 lies on the epipolar line l1. The epipoles e1 and e2 indicate the positions where C1 and C2 project, respectively. The point v1 in image 1 is the vanishing point of the projecting ray of x2.

2.1.3. Stereo correspondence and triangulation

In a stereo setup, the relation of two views to each other is expressed in a 3-by-3 rank 2 matrix, referred to as the fundamental matrix F, which satisfies:

display math(5)

where x1 and x2 are two corresponding points in both images. There exist well-known algorithms to determine the fundamental matrix from 8 (linear problem) or 7 (non-linear problem) point correspondences [MvGV09]. When working with known intrinsic camera settings, the relation is also often referred to as the essential matrix E, which can be determined even from the correspondences of five points [Nis04].

Assuming full camera calibration, the problem of 3D structure reconstruction from stereo can be reduced to two sub-problems: (1) the one-to-one correspondence problem across the images and (2) the intersection of the projective rays problem. The second operation is usually referred to as structure triangulation due to the triangle which is formed by the camera centres C1 and C2, and each consecutive point X in 3D space. Note that this term has a different meaning than the triangulation of geometric domains, which is often used interchangeably to a tessellation into triangles in the computer graphics literature.

One of the key inventions which advanced this research field are robust feature-point detection algorithms, like SIFT [Low04] and SURF [BTvG06, BETvG08]. These image processing methods allow for efficient detection of characteristic feature points which can be matched across multiple images. Both algorithms compute very robust descriptors which are mostly invariant to rotation and scale, at least to a certain degree as shown by Schweiger et al. [SZG*09]. Once the corresponding features have been established, the extrinsic (i.e. pose in 3D space) and, under certain circumstances, also the intrinsic (e.g. focal length) parameters of their cameras, as well as positions of the 3D space points can be determined in an iterative process often called structure from motion (SfM).

2.2. Structure from motion

In practice, the stereo vision procedure described in the previous section can be used to register multiple images, to orient and place their cameras, and to recover 3D structure. It is carried out incrementally in several passes, usually starting from an initial image pair and adding consecutive images to the system one by one. Mutual relations between the images are detected sequentially, new 3D points are extracted and triangulated, and the whole 3D point cloud is updated and optimised.

In a first stage, for each image a sparse set of feature-points is detected, which are than matched in a high-dimensional feature space in order to determine unique pairs of corresponding points across multiple images. This stage is usually approached with high-dimensional nearest-neighbour search algorithms and data structures, like the kd-tree, vp-tree [KZN08] and the vocabulary-tree [NS06].

In order to improve the stability of the feature matching process, robust estimation algorithms (i.e,. RANSAC [FB81, RFP08]) are employed in order to minimise the number of wrong matches across images. By utilising the already known parameters it is possible to ‘filter out’ outliers which deviate too far from an estimated mapping.

Finally, advanced bundle adjustment solvers [TMHF00, LA09, ASSS10, WACS11] are used to compute highly accurate camera parameters and a sparse 3D point cloud. Bundle adjustment is a non-linear least-squares optimisation process which is carried out after the addition of several new images to the system in order to suppress the propagation of an error. In addition, it is always performed at the end, after all images have been added, in order to optimise the whole network of images. In this process both the camera parameters (K, R and C) as well as the positions of the 3D points X are optimised simultaneously, aiming at minimisation of the re-projection error:

display math(6)

where inline image indicates that the point inline image is visible in image j, and inline image denotes the projection of 3D points inline image onto image j. Usually optimisation is carried out using the non-linear Levenberg-Marquardt minimization algorithm [HZ04].

The entire process is typically called SfM due to the fact that the 3D structure is recovered from a set of photographs which have been taken by a camera that was in motion. In fact, this methodology applies to video sequences as well [vGZ97], and it can also be performed with line-feature correspondences across images [TK95, SKD06], which is especially suitable to urban models.

The advantage of general SfM is its conceptual simplicity and robustness. Furthermore, since it is a bottom–up approach that makes only few assumptions about the input data, it is quite general.

2.2.1. Sparse reconstruction

There is a number of papers which utilise sparse SfM for exploration and reconstruction of urban environments. All these methods produce sparse 3D point clouds, either as the end-product or as an intermediate step. In a series of publications, Snavely et al. [SSS06, SSS07, SGSS08, SSG*10] developed a system for navigation in urban environments which is mainly based on sparse points and SfM camera networks (cf. Figure 5). In this system, called ‘Photo Tourism’, it is possible to navigate through large collections of registered photographs. The density of photographs combined with sparse point clouds and smooth animations gives the user the impression of spatial coherence. These works contributed significantly to the maturity of the current state-of-the-art of SfM and to the use of unstructured collections of Internet images [LWZ*08].

Figure 5.

A sparse point cloud generated from several thousands of unordered photographs, and one photo taken from nearly the same viewpoint. Figure courtesy of Noah Snavely [SSG*10], ©2010 IEEE.

Further methods introduced semi-dense (quasi-dense) SfM [LL02, LQ05] and aimed at improving performance, scalability and accuracy [ASS*09, FQ10, AFS*10, COSH11] in order to deal with arbitrarily high numbers of input photographs. Recent work of Agarwal et al. demonstrates impressively how to reconstruct architecture from over a hundred thousand images in less than 1 day [AFS*11]. They cast the problem of matching corresponding images as a graph-estimation problem, where each image is a vertex, and edges connect only images which depict the same object. They approach this problem using multi-view clustering of scene objects [FCSS10].

2.3. Multi-view stereo

The described procedure of SfM delivers networks of images that are registered to each other, including their camera properties, as well as sparse point clouds of 3D structure. However, the point clouds are usually rather sparse and do not contain any solid geometry. The next step in order to obtain more dense structure is usually called dense matching. It is used for image-based reconstruction of detailed surfaces as for instance shown in Figure 6. In this context, dense means to try to capture information from all pixels in the input images—in contrast to sparse methods, where only selected feature points are considered.

Figure 6.

Comparison of 3D models created by different methods. Left panel: Vergauwen and van Gool [VvG06], middle panel: Furukawa and Ponce [FP07], right panel: Micusik and Kosecka [MK10]. Figure courtesy of Branislav Micusik [MK10]. ©2010 Springer.

In this paper, we mention several dense matching methods which have been utilised for urban reconstruction. For a more detailed overview, we refer the reader to Scharstein and Szeliski [SS02a] for two-view stereo methods, and to Seitz et al. [SCD*06] for multi-view stereo methods (MVS).

Furthermore, many MVS methods utilise a concept called ‘plane-sweeping’. This process, originally proposed by Collins [Col96], is approached with multiple to each other registered views. The main idea is to ‘sweep’ a plane through the 3D space along one of the axes with rays shot from all pixels of all cameras onto the plane. According to epipolar geometry, intersections of the rays with each other at their hitpoints on the plane indicate 3D structure points. Collins showed how to utilise a series of homographies in order to efficiently accumulate these points and to generate reconstructions [Col96]. The main advantages of this idea are that (1) it works with an arbitrary number n of images, (2) its complexity scales with inline image and (3) all images are treated in the same way. Thus, the method was called by the author as true multi-image matching approach. Plane sweeping has been successfully utilised for the recovery of dense structure and was consecutively extended to utilise programmable graphics hardware [YP03] and multiple sweeping directions [GFM*07]. Bauer et al. [BZB06] proposed a method based on plane sweeping in order to recover sparse point clouds of buildings.

2.3.1. Dense reconstruction

The dense structure of a surface is also computed by a MVS matching algorithm proposed by Pollefeys [PvGV*04]. Vergauwen and Van Gool [VvG06] extended this method from regular sequences of video frames to still images by improved feature matching, additional internal quality checks and methods to estimate internal camera parameters. This approach was introduced as the free, public ARC3D web-service, allowing the public to take or collect images, upload them and get the result as dense 3D data and camera calibration parameters [TvG11]. Images of buildings are among the most often uploaded data. Further extensions to this methodology were presented by Akbarzadeh et al. [AFM*06] and Pollefeys et al. [PNF*08].

Furukawa and Ponce [FP07, FP10] presented a different approach for MVS reconstruction. Their method uses a SfM camera network as a preliminary solution, but further, it is based on matching small patches placed on the surface of the scene object which are back-projected onto the images. First, features like Harris corners [HS88] or DoG spots [Low04] are detected and matched across images, which, projected on the object, define the locations of the patches. These are defined in such a way that their re-projected footprints cover the actual images. They are then optimised such that a photometric discrepancy function across the re-projected patches is minimised. The results are semi-dense clouds of small patches which serve as a basis for denser structure triangulation and, finally, for polygonal surface extraction. To achieve this, they employ the Poisson surface reconstruction algorithm [KBH06], as well as an iteratively refined visual hull method [FP08]. Also this 3D reconstruction idea is very generic, and it has since been extended and applied to 3D urban reconstruction as well [FCSS09a, FCSS10].

Another approach for the reconstruction of dense structures is to perform pairwise dense matching [SS02a] of any two registered views and then to combine the computed depth maps with each other. Usually this approach is denoted as depth map fusion. There are several ideas how to perform this, such as from Goesele et al. [GCS06, GSC*07], Zach et al. [ZPB07, IZB07] and Merrell et al. [MAW*07]. An practical and robust dense matching approach has been also proposed by Hischmüller [Hir08]. It uses a pixel-wise Mutual Information-based matching cost for compensating radiometric differences of input images.

A common problem of dense stereo methods is that the models exhibit a relatively high amount of noise along flat surfaces. This is due to the nature of matching nearby points more or less independently from each other. This, in fact, is a major obstacle in urban reconstruction, where most models are composed of groups of planar surfaces. Several methods try to overcome this problem by including hierarchical models [LPK09], Manhattan-world assumptions [FCSS09a, FCSS09b], multi-layer depth maps [GPF10] or piece-wise planar priors [MK09, MK10, SSS09, CLP10, GFP10]. Recently, Lafarge et al. [LKBV12] proposed a hybrid method which combines dense meshes with geometric primitives (cf. Figure 7).

Figure 7.

An initial rough surface (top panel) combined with a geometric primitive model (bottom panel). Figure courtesy of Florent Lafarge [LKBV12], ©2012 IEEE.

Generally, dense multi-view approaches deliver quite impressive results, like the large scale system presented by Frahm et al. [FFGG*10]: it deals with almost three million images, performs image clustering, SfM and dense map fusion in one day on a single PC. On the downside, these systems usually provide dense polygonal meshes without any higher-level knowledge of the underlying scene, even though such information is very useful in complex architectural models. However, there exist other approaches which provide well-defined geometric shapes and often also some semantics. We cover such methods in Section 'Buildings & Semantics'.

3. Buildings & Semantics

In this section, we turn our attention to approaches which aim at reconstructing whole buildings from various input sources, such as a set of photographs or laser-scanned points, typically by fitting some parametrised top-town building model.

3.1. Image-based modelling

In image-based modelling, a static 3D object is modelled with the help of one or more images or videos. Such methods are often also referred to as photogrammetric modelling, especially in the photogrammetry and remote sensing community. In this section we restrict our review to approaches which model single buildings mainly from ground-based or close-range photographs (cf. Figure 8).

Figure 8.

Interactive image-based modelling: (1) input image with user-drawn edges shown in green, (2) shaded 3D solid model, (3) geometric primitives overlaid onto the input image, (4) final view-dependent, texture-mapped 3D model. Figure courtesy of Paul Debevec [DTM96] ©1996 ACM.

Generally, in order to obtain true 3D properties of an object, the input must consist of at least two or more perspective images of the scene. There are also single-image methods which usually rely on user input or knowledge of the scene objects in order to compensate for the missing information.

Nonetheless, also multi-view methods make a number of assumptions about the underlying object in order to define a top–down architectural model which is successively completed from cues derived from the input imagery. The outcome usually consists of medium-detail geometric building models, in some cases enriched with finer detail, such as windows. Some methods also deliver textures and more detailed façade geometry, but we omit discussion of these features in this section, and instead elaborate them in Section 'Façades & Images'.

The degree of user interaction varies across the methods as well. Generally, the tradeoff is between quality and scalability. More user interaction leads to more accurate models and semantics, but such approaches do not scale well to huge amounts of data. Using fully automatic methods is an option, but they are more error prone and also depend more on the quality of the input.

3.1.1. Interactive multi-view modelling

A seminal paper in this field was the work of Debevec et al. [DTM96]. Their system, called ‘Façade’, introduced a workflow for interactive multi-view reconstruction.

The actual model is composed of parametrised primitive polyhedral shapes, called blocks, arranged in a hierarchical tree structure (cf. Figure 9). Debevec et al. based their modelling application on a number of observations [DTM96]:

Figure 9.

A geometric model of a simple building (a); the model's hierarchical representation (b). The nodes in the tree represent parametric primitives while the links contain the spatial relationships between the blocks. Figure courtesy of Paul Debevec [DTM96] ©1996 ACM.

  • Most architectural scenes are well modelled by an arrangement of geometric primitives.
  • Blocks implicitly contain common architectural elements such as parallel lines and right angles.
  • Manipulating block primitives is convenient, since they are at a suitably high level of abstraction; individual features such as points and lines are less manageable.
  • A surface model of the scene is readily obtained from the blocks, so there is no need to infer surfaces from discrete features.
  • Modelling in terms of blocks and relationships greatly reduces the number of parameters that the reconstruction algorithm needs to recover.

Composing an architectural model from such blocks turned out to be quite a robust task which provides very good results (cf. Figure 9). During the modelling process, the user interactively selects a number of photographs of the same object and marks corresponding edges in each of them. The correspondences allow establishing epipolar-geometric relations between them, and the parameters of the 3D primitives can be fitted automatically using a non-linear optimization solver [TK95]. Because the number of views is kept quite low, and because many of the blocks can be constrained to each other—thus significantly reducing the parameter space—the optimisation problem can be solved efficiently (e.g. up to a few minutes on the 1996 hardware).

The high quality of the obtained results encouraged other researchers to develop interactive systems. For example, another image-based modelling framework called ‘Photobuilder' was presented by Cipolla and Robertson [CR99, CRB99]. Their work introduced an interactive system for recovering 3D models from a few uncalibrated images of architectural scenes based on vanishing points and the constraints of projective geometry. Such constraints, like parallelism and orthogonality, were also exploited by Liebowitz et al. [LZ98, LCZ99], who presented a set of methods for creating 3D models of scenes from a limited numbers of images, that is, one or two, for situations where no scene coordinate measurements are available.

Lee et al. introduced an interactive technique for block-model generation from aerial imagery [LHN00]. They extended the method further and introduced automatic integration of ground-based images with 3D models in order to obtain high-resolution façade textures [LJN02a, LJN02b, LJN02c]. They also proposed an interactive system which provides a hierarchical representation of the 3D building models [LN03]. In this system, information for different LOD can be acquired from aerial and ground images. The method requires less user interaction than the ‘Façade' system, since it uses more automatic image calibration. It also requires at most three clicks for creating a 3D model and two model-to-image point correspondences for the pose estimation. Finally, they also handled more detailed façade and window reconstruction [LN04] (cf. Section 'Façade modelling').

Also El-Hakim et al. [EhWGG05, EhWG05] proposed a semi-automatic system for image-based modelling of architecture. Their approach allows the user to model parametrised shapes which are stored in a database and can be reused for further modelling of similar objects.

The next important advance of interactive modelling was the combination of automatic sparse SfM methods with parametrised models and user interaction. SfM provides a network of registered cameras and a sparse point-cloud (cf. Section 'Point Clouds & Cameras'). The goal is to fit a parametrised model to this data.

A series of papers published by van den Hengel and colleagues describe building blocks of an image and video-based reconstruction framework (cf. Figure 10). Their system [vdHDT*06] uses camera parameters and point clouds generated by a SfM process (cf. Section 'Point Clouds & Cameras') as a starting point for developing a higher-level model of the scene. The system relies on the user to provide a small amount of structure information from which more complex geometry is extrapolated. The regularity typically present in man-made environments is used to reduce the interaction required, but also to improve the accuracy of fit. They extend their higher-level model [vdHDT*07a], such that the scene is represented as a hierarchical set of parametrised shapes, as already proposed by others [DTM96, LN03]. Relations between shapes, such as adjacency and alignment, are specified interactively, such that the user is asked to provide only high-level scene information and the remaining detail is provided through geometric analysis of the images. In a follow-up work [vdHDT*07b], they present a video-trace system for interactive generation of 3D models using simple 2D sketches drawn by the user, which are constrained by 3D information already available.

Figure 10.

Interactive modelling of geometry in video. Left-hand side: Replicating the bollard by dragging the mouse. Right-hand side: Replicating a row of bollards. Figure courtesy of Anton van den Hengel [vdHDT*07a] ©2007 ACM.

Sinha et al. [SSS*08] presented an interactive system for generating textured 3D models of architectural structures from unordered sets of photographs (cf. Figure 11). It is also based on SfM as the initial step. This work introduced novel, simplified 2D interactions such as sketching of outlines overlaid on 2D photographs. The 3D structure is automatically computed by combining the 2D interaction with the multi-view geometric information from SfM analysis. This system also utilises vanishing point constraints [RC02], which are relatively easy to detect in architectural scenes. Recently, Larsen and Moeslund [LM11b] proposed an interactive method for modelling buildings from sparse SfM point-clouds. It provides simple block models and textures. The pipeline also includes an approach for automatic segmentation of façades.

Figure 11.

Results of interactive image-based modelling method. Figure courtesy of Sudipta Sinha [SSS*08], ©2008 ACM.

3.1.2. Automatic multi-view modelling

A number of image-based and photogrammetric approaches attempt fully automatic modelling. Buildings are especially suited to such methods because the model can be significantly constrained by cues typically present in architectural scenes, like parallelism and orthogonality. These attributes help to extract line-features and vanishing points from the images, which opens the door for compact algorithms [LZ98, Rot00, RC02, KZ02] that aim at both reliable camera recovery and consecutive reconstruction of 3D structure.

While the mentioned papers provide well-defined tools for multi-view retrieval of general objects, others proposed model-based systems which aim more specifically at building reconstruction. An early project for reconstructing whole urban blocks was proposed by Teller [Tel98]. Coorg and Teller [CT99] detected vertical building planes using the space-sweep algorithm [Col96] and provided a projective texture for their façade, however, their system did not yet utilised any stronger top–down model of a building.

Werner and Zisserman [WZ06] proposed a fully automatic approach inspired by the work of Debevec et al. [DTM96]. Their method accepts a set of multiple short-range images and it attempts to fit quite generic polyhedral models in the first stage. In the second stage, the coarse model is used to guide the search for fitting more detailed polyhedral shapes, such as windows and doors. The system employs the plane-sweep approach [Col96] for polyhedral shape fitting, which was also used by Schindler and Bauer [BKS*03], who additionally introduced more specific templates for architectural elements.

The work of Dick et al. [DTC00, DTC04] also aims at an automatic acquisition of 3D architectural models from small image sequences (cf. Figure 12). Their model is Bayesian, which means that it needs the formulation of a prior distribution. In other words, the model is composed of parametrised primitives (such as walls, doors and windows), each having assigned a certain probabilistic distribution. The prior of a wall layout, and the priors of the parameters of each primitive are partially learned from training data, and partially added manually according to the knowledge of expert architects. The model is reconstructed using a Markov Chain Monte Carlo (MCMC) machinery, which generates a range of possible solutions from which the user can select the best one when the structure recovery is ambiguous. In a way this method is loosely related to inverse procedural methods described later in Section 'Inverse procedural modelling' because it also delivers semantic descriptions of particular elements of the buildings.

Figure 12.

Example of fully automatic modelling: A labelled 3D model is generated from several images of an architectural scene. Figure courtesy of Anthony Dick [DTC04], ©2004 Springer.

Xiao et al. [XFZ*09] provided another automatic approach to generate 3D models from images captured along the streets at ground level. Since their method reconstructs a larger urban area than a single building, we discuss it in Section 'Ground-based reconstruction'.

3.1.3. Interactive single-view modelling

Assuming some knowledge about the scene, it is often possible to reconstruct it from a single image. Horri et al. [HAA97] provided an interactive interface for adding perspective to a single photograph, which is then subsequently exploited in order to simulate the impression of depth. Shum and Szeliski [SHS98] introduced a system for interactive modelling of building interiors from a single panoramic image. Photogrammetric tools, for example, a linear algorithm which computes plane rectification, plane orientation and camera calibration from a single image [LCZ99], paved the way for further single-image approaches. For example, van den Heuvel [vdH01] introduced an interactive algorithm for extraction of buildings from a single image. Oh et al. [OCDD01] proposed a tool for interactive depth-map painting in a single photo, which is then utilised for rendering.

The most recent paper in this category was presented by Jiang et al. [JTC09], who introduced an algorithm to calibrate the camera from a single image, and proposed an interactive method which allows for recovery of 3D points driven by the symmetry of the scene objects. Its limitation is that it only works for highly symmetric objects because the epipolar constraints are derived from symmetries present in the scene.

3.1.4. Automatic single-view modelling

Some fully automatic methods have been attempted. Hoiem et al. [HEH05] proposed a method for creation of simplified ‘pop-up' 3D models from a single image, by using image segmentation and depth assignments based on vanishing points [RC02, KZ02]. Kosecka and Zhang [KZ05] introduced an approach for automatic extraction of dominant rectangular structures from a single image using a model with a high-level rectangular hypothesis. Barinova et al. [BKY*08] propose a method for structure reconstruction using a Conditional Random Field model. Recently, Wan and Li [WL11] proposed an automatic algorithm for façade segmentation, which segments building to a set of separate façades based on extraction of vanishing points and lines.

To summarise image-based modelling, we must say that fully automatic modelling still suffers considerable quality loss compared to interactive approaches, and as of today, the best quality is still obtained by interactive multi-view methods. For this reason, due to the current demand for high-quality models, most close-range reconstruction is approached with semi-automatic modelling.

3.2. LiDAR-based modelling

Another group of methods focusing on the reconstruction of buildings utilises laser-scan data, also referred to as LiDAR-data. Generally, there are two main types of this class of data: those acquired by ground-based devices (terrestrial LiDAR), and those captured from the air (aerial LiDAR).

Laser scanning is widely used in the photogrammetry and remote sensing community for measurement and documentation purposes. In this paper, we omit those methods. Only in the recent years, the goal of further segmentation and fitting of parametrised high-level polyhedral models emerged, and we will focus on those approaches.

3.2.1. Interactive modelling

Due to advances in laser-scanning technology, LiDAR data has become more accessible in recent time, but also the quality demands on the models has grown due to the larger bandwidth and higher resolution displays. While laser-scans are in general dense and relatively regular—thus perfectly suited for architectural reconstruction—on the other hand, the practical process of acquisition is difficult and the resulting data is often corrupted with noise, outliers and incomplete coverage. In order to overcome such problems, several methods propose to process the data with user interaction.

Böhm [Binline image4] published a method for completion of terrestrial laser-scan point clouds, which is done by interactively utilising the repetitive information typically present in urban buildings. Another approach aiming for a similar goal was introduced by Zheng et al. [ZSW*10]. It is also an interactive method for consolidation which completes holes in scans of building façades. This method exploits large-scale repetitions and self-similarities in order to consolidate the imperfect data, denoise it and complete the missing parts.

Another interactive tool for assembling architectural models directly over 3D point clouds acquired from LiDAR data was introduced by Nan et al. [NSZ*10]. In this system, the user defines simple building blocks, so-called SmartBoxes, which snap to common architectural structures, like windows or balconies. They are assembled through a discrete optimisation process which balances between fitting the point-cloud data [SWK07] and their mutual similarity. In combination with user interaction, the system can reconstruct complex buildings and façades from sparse and incomplete 3D point clouds (cf. to Figure 13).

Figure 13.

Results of interactive fitting of ‘SmartBoxes' to u incomplete LiDAR data. Figure courtesy of Liangliang Nan [NSZ*10], ©2010 ACM.

Other approaches aim at the enhancement of LiDAR data by fusing it with optical imagery. Some work on registration and pose estimation of ground-images with laser-scan point clouds was done by Liu and Stamos [LS07]. The method aims at robust registration of the camera-parameters of the 2D images with the 3D point cloud. Recently, Li et al. [LZS*11] introduced an interactive system for fusing 3D point-clouds and 2D images in order to generate detailed, layered and textured polygonal building models. The results of this method are very impressive, of course again, at the cost of human labour and extended processing time.

Another approach is to fit polygonal models into point clouds, especially using the assumption of piece-wise planar objects. Chen and Chen [CC07] proposed a pipeline that relies on minimal user input. Recently, Arikan et al. [ASF*13] proposed a framework for the generation of polyhedral models from semi-dense point-clouds. Their system automatically extracts planar polygons which are optimised in order to ‘snap' to each other to form an initial model. The user can refine it with simple interactions, like coarse 2D strokes. The output is an accurate and well-defined polygonal object (cf. Figure 14).

Figure 14.

Interactive modelling: starting from a noisy and incomplete point cloud, the method of Arikan et al. yields a coarse polygonal model that approximates the input. Figure courtesy of Michael Schwärzler [ASF*13].

3.2.2. Automatic modelling

Similar as with image-based modelling, there also exist many approaches that aim at full automation. While such systems scale well with the data, they usually require the user to set up a number of parameters. This kind of parametrisation is very common in fully automatic methods and it turns out to be also an often underestimated obstacle, since the search for proper parameters can be very time consuming. The benefit is that once good parameters are found for a data set, it can be processed automatically irrespective its actual size.

In earlier works, Stamos and Allen developed a system for reconstruction of buildings from sets of range scans combined with sets of unordered photographs [SA00b, SA00a, SA01, SA02]. Their method is based on fitting planar polygons into pre-clustered point-clouds. Bauer et al. [BKS*03] also proposed an approach for the detection and partition of planar structures in dense 3D point clouds of façades, like polygonal models with a considerably lower complexity than the original data.

Pu and Vosselman [PV09b] proposed a system for segmenting terrestrial LiDAR data in order to fit detailed polygonal façade models. Their method uses least-squares fitting of outline polygons, convex hulls and concave polygons, and it combines a polyhedral building model with the extracted parts. The reconstruction method is automatic and it aims at detailed façade reconstruction (refer to Section 'Façade decomposition').

Toshev et al. [TMT10] also presented a method for detecting and parsing of buildings from unorganised 3D point clouds. Their top–down model is a simple and generic grammar fitted by a dependency parsing algorithm, which also generates a semantic description. The output is a set of parse trees, such that each tree represents a semantic decomposition of a building. The method is very scalable and is able to parse entire cities.

Zhou and Neumann [ZN08] presented an approach for automatic reconstructing building models from airborne LiDAR data. This method features vegetation detection, boundary extraction and a data-driven algorithm which automatically learns the principal directions of roof boundaries. The output are polygonal building models. A further extension [ZN10, ZN11] produces polygonal 2.5D models composed of complex roofs and vertical walls. Their approach generates buildings with arbitrarily shaped roofs with high level of detail, which is comparable to that of interactively created models (cf. Figure 15). Recently they improved their method using global regularities in the buildings [Neu12].

Figure 15.

Results of the automatic method which uses LiDAR segmentation. Figure courtesy of Qian-Yi Zhou [ZN10], ©2010 Springer.

Vanegas et al. [VAB12] proposed an approach for the reconstruction of buildings from 3D point clouds with the assumption of Manhattan World building geometry. Their system detects and classifies features in the data and organises them into a connected set of clusters from which a volumetric model description is extracted (cf. Figure 16). The Manhattan World assumption has been successfully used by several urban reconstruction approaches [FCSS09a, VAW*10], since it robustly allows to identify fundamental shapes of most buildings. Recently, Lafarge and Alliez [LA13] introduced an novel method for surface reconstruction from unstructured point sets, that is, structure preserving.

Figure 16.

Automatic reconstruction of a building with volumetric models. For purposes of visual evaluation, the reconstructed volume is superimposed over the original point set, including noise and obstacles (left-hand side), and textured with photographs of the buildings (right-hand side). Figure courtesy of Carlos Vanegas [VAB12], ©2012 IEEE.

Another important field is point sets segmentation. Korah et al. [KMO11] published a method for segmentation of aerial urban LiDAR scans in order to determine individual buildings, and Shen et al. [SHFH11] proposed a hierarchical façade segmentation method based on repetitions and symmetry detection in terrestrial LiDAR scans (cf. Section 'Façade decomposition'). Wan and Sharf [WS12] proposed an automatic method which uses a simplified predefined grammar that is fitted by probabilistic optimisation. Their method delivers good results but is restricted to a small number of possible input buildings.

While LiDAR data is accessible for quite a while, and methods which aim at robust fitting of top–down models into it deliver good results, the whole potential of this combination is still not fully exhausted, thus, we may expect further interesting papers on this topic in the near future.

3.3. Inverse procedural modelling

A new and growing area is that of inverse procedural modelling (IPM), where the framework of grammar-driven model construction is not only used for synthesis, but also for the reconstruction of existing buildings. Traditional forward procedural urban modelling provides an elegant and fast way to generate huge, complex and realistic looking urban sites. A recent survey [VAW*10] presented this approach for the synthesis of urban environments. An inverse methodology is applicable to many types of procedural models, but such an exploration has been quite prolific with respect to building models. The most general form of the IPM problem is to discover both the parametrised grammar rules and the parameter values that, when applied in a particular sequence, yield a pre-specified output.

Discovering both the rules and the parameter values that result in a particular model effectively implies compressing a 3D model down to an extremely compact and parametrised form. Stava et al. proposed a technique to infer a compact grammar from arbitrary 2D vector content [SBM*10]. Bokeloh et al. [BWS10] exploited partial symmetry in existing 3D models to do IPM. Recently, Talton et al. [TLL*11] used a Metropolis-based approach to steer which rules (from a known large set) and parameter values to apply in order to obtain a 3D output resembling a pre-defined macroscopic shape. Benes et al. [BvMM11] defined guided procedural modelling as a method to spatially dividing the rules (and productions) into small guided procedural models that can communicate by parameter exchange in order to obtain a desired output.

Various methods have specialized the inverse framework to the application of building reconstruction, often by assuming that the rules are known—thus inferring only the parameter values. A very complete, yet manual solution to this problem was presented by Aliaga et al. [ARB07]. They interactively extract a repertoire of grammars from a set of photographs of a building and utilise this information in order to visualise a realistic and textured urban model (cf. Figure 17). This approach allows for quick modifications of the architectural structures, like number of floors or windows in a floor. The disadvantage of this approach is the quite labour-intensive grammar creation process.

Figure 17.

Example of inverse procedural modelling of a building from a photograph (top panel) and the application of the grammar to generate novel building variations (bottom panel). Figure courtesy of Aliaga et al. [ARB07], ©2007 IEEE.

Another grammar-driven method for automatic building generation from air-borne imagery was proposed by Vanegas et al. [VAB10]. Their method uses a simple grammar for building geometry that approximately follows the Manhattan World assumption. This means that it expects a predominance of the three mutually orthogonal directions. The grammar converts the reconstruction of a building into a sequential process of refining a coarse initial building model (e.g. a box), which they optimise using geometric and photometric matching across images. The system produces complete textures polygonal models of buildings (Figure 18). Recently, Vanegas et al. [VGDA*12] introduced a framework that enables adding intuitive high-level control to an existing large urban procedural model.

Figure 18.

Results of the automatic method which uses aerial imagery registered to maps and an inverse procedural grammar. Figure courtesy of Carlos Vanegas [VAB10], ©2010 IEEE.

Hohmann et al. [HKHF09, HHKF10] presented a modelling system which is a combination of procedural modelling with generative modelling language (GML) shape grammars [Hav05]. Their method is based on interactive modelling in a top–down manner, yet it contains high-level cues and aims at semantic enrichment of geometric models. A more generic approach of generative modelling has been proposed recently by Ullrich and Fellner [UF11].

Mathias et al. [MMWvG11] reconstruct complete buildings as procedural models using template shape grammars. In the reconstruction process, they let the grammar interpreter automatically decide on which step to take next. The process can be seen as instantiating the template by determining the correct grammar parameters. Another approach where a grammar is fitted from laser-scan data was published by Toshev et al. [TMT10].

Also in the photogrammetry community the idea of IPM has found a wide applicability in papers aiming at reconstruction of buildings and façades: Ripperda and Brenner introduced a predefined façade grammar which they automatically fit from images [BR06, Rip08] and laser scans [RB07, RB09] using the Reversible Jump Markov Chain Monte Carlo (RJMCMC). A similar approach was proposed by Becker and Haala [BH07, BH09, Bec09] but in this system they also propose to automatically derive a façade-grammar from the data in a bottom–up manner.

Other work aims on grammar-driven image segmentation. For example, Han and Zhu [HZ05, HZ09] presented a simple attribute graph grammar as a generative representation for made–made scenes and propose a top–down/bottom–up inference algorithm for parsing image content. It simplifies the objects which can be detected to square boxes in order to limit the grammar space. Nevertheless, this approach provides a good starting point for inverse procedural image segmentation.

The field of IPM is relatively new and still not very well researched. For this reason, we expect more exciting papers on this topic in the near future.

4. Façades & Images

In this section, we focus on approaches aiming at the reconstruction and representation of façades. In recent years, many different approaches for the extraction of façade texture, structure, façade elements and façade geometry have been proposed.

First, we discuss façade image processing approaches which aim at an image-based representation of façades. Here we include panorama imaging and projective texturing. Second, we continue with façade-parsing methods. These methods aim at automatic subdivision of façades into their structural elements. Third, we address the topic of interactive façade modelling systems which aim at higher quality and level of detail.

4.1. Façade imagery

Imagery is essential in urban reconstruction as both a source of information as well as a source of realism in the final renderings. Additional advantages of imagery are its, in general, simple acquisition process, and also the fact, that there exists an enormous amount of knowledge about its processing. It has been the subject of very active research in the recent two decades. In this section we cover urban panorama imaging as well as texture generation approaches.

4.1.1. Panoramas and image stitching

Panoramas are traditionally generated for the purpose of visualising wide landscapes or similar sights, but in the context of urban reconstruction, panoramas might already be seen as final results of virtual models on its own.

In practice, panoramas are composed from several shots taken at approximately the same location [SS02b, Sze06]. For urban environments, often the composed image is generated along a path of camera movement, referred to as strip panorama. The goal of those methods is to generate views with more than one viewpoint in order to provide an approximation of an orthographic projection. Variants of those are pushbroom images, which are orthographic in the direction of motion and perspective in the orthogonal one [GH97, SK03], and the similar x-slit images presented by Zomet et al. [ZFPW03]. Similar approaches for the generation of strip-panoramic images was proposed also by Zheng [Zhe03] and Roman et al. [RGL04]. Agarwala et al. [AAC*06] aim at the creation of long multi-view strip panoramas of street scenes, where each building is projected approximately orthogonal on a proxy plane (cf. Figure 19). Optimal source images for particular pixels are chosen using a constrained Markov Random Field (MRF) optimisation process [GG84, KZ04].

Figure 19.

A multi-viewpoint panorama of a street in Antwerp composed from 107 photographs taken about 1 m apart with a hand-held camera. Figure courtesy of Aseem Agarwala [AAC*06], ©2006 ACM.

Panoramas are usually generated by stitching image content from several sources, often also referred to as photomosaics. The stitching of two signals of different intensity usually causes a visible junction between them. An early solution to this problem were transition zones and multi-resolution blending [BA83]. Pérez et al. [PGB03] introduced a powerful method for this purpose: image editing in the gradient domain. There is a number of further papers tackling, improving, accelerating and making use of this idea [PGB03, ADA*04, Aga07, MP08]. Zomet et al. presented an image stitching method for long images [ZLPW06]. The foundations behind the gradient domain image editing method are described in the aforementioned papers as well as in the ICCV 2007 Course-Notes [AR07].

4.1.2. Texture generation

Another fundamental application of imagery is its necessity for texturing purposes. The particular problem of generating textures for the interactive rendering of 3D urban models can be addressed by projective texturing from perspective photographs. Most interactive modelling systems, like ‘Façade' [DTM96], allow sampling projective textures on the reconstructed buildings. Based on input from video [vdHDT*07c] or image collections [ARB07, SSS*08, XFT*08], they introduce projective texture sampling as part of their modelling pipeline and they rely on user interaction in order to improve the quality of the results.

Others also proposed tools for texturing of existing models, like an interactive approach by Georgiadis et al. [GSGA05], or an automatic by Grzeszczuk et al. [GKVH09]. There are further fully automatic attempts (most of them in the photogrammetry literature) which aim at projective texture generation for existing building models [CT99, WH01, WTT*02, BÖ4, OR05, GKKP07, TL07, TKO08, KZZL10].

More tools dedicated to interactive enhancement and inpainting for architectural imagery were presented by Korah and Rasmussen [KR07b] who detected repetitive building parts to inpaint façades, Pavic et al. [PSK06] who proposed an interactive method for completion of building textures. Musialski et al. [MWR*09] used translational and reflective symmetry in façade-images to remove unwanted content (cf. Figure 20), and multi-image stitching to obtain obstacle-free near-orthographic views [MLS*10]. Eisenacher et al. [ELS08] used example-based texture synthesis to generate realistically looking building walls.

Figure 20.

The input image on the left contains a traffic light and several cables. To the right the unwanted objects have been successfully removed by utilising the symmetry in the façade image [MWR*09].

Recently, some interesting tools for façade imagery processing have exploited the matrix factorisation methodology. Matrix factorisation allows for good approximation of low-rank matrices with a small number of certain basis functions [Str05]. Façade images are usually of low-rank due to many orthogonal and repetitive patterns. The approach presented by Ali et al. [AYRW09] utilises factorisation for a compression algorithm in order to overcome a memory transfer bottleneck and to render massive urban models directly from a compressed representation. Another method proposed by Liu et al. [LMWY09, LMWY13] aims at inpainting of missing image data. Their algorithm is built on studies about tensor completion using the trace norm and relaxation techniques. Façades are well suited for such algorithms due to many repetitions (cf. Figure 21).

Figure 21.

Façade in-painting. The left image is the original image. Middle image: the lamp and satellite dishes together with a large set of randomly positioned squares has been selected as missing parts (80% of the façade shown in white). The right image is the result of the tensor completion algorithm proposed in [LMWY09], ©2009 IEEE.

While processing of urban imagery is basically a well researched topic, it still provides some challenges. Especially the issue of segmentation of façades is an active research direction, and we will elaborate on it in the next section.

4.2. Façade decomposition

Many different approaches for extraction of façade texture, structure, façade elements and façade geometry have been proposed. Most methods interpret façade reconstruction as an image segmentation problem, others define it as a feature detection challenge. Some resort to classical images processing tools which act locally, others face the problem as a global one and usually propose grammars in order to fit a top–down model of the façade. While recent interactive algorithms, which we review in the next section, deliver very good results, automatic façade segmentation is still an error-prone problem.

In the first step, façade imagery is usually processed with classic image processing methods, like edge [Can86], corner [HS88] and feature [Low04, BETvG08] detection as basic tools to infer low-level structure. We omit low-level processing and for details we refer to textbooks, for example, Gonzales and Woods [GW08], or Sonka et al. [SHB08].

The next step is to employ the low-level cues in order to infer more sophisticated structure, like floors or windows. Most earlier attempts were based on locally acting filtering and splitting heuristics, but it turned out that such segmentation is not enough to reliably detect structure in complex façades. The necessity of higher-order structure has emerged, thus, many methods turned to symmetry detection, which is widely present in architecture. These approaches often combine the low-level cues with unsupervised clustering [HTF09], with searching and matching algorithms, as well as with Hough transforms. Another trend of current research is towards machine learning [Bis09, HTF09] in order to fit elements in databases, or to infer façade structure with predefined grammars or rules. In this section, we provide an overview over these various approaches.

4.2.1. Heuristic segmentation

Wang and Hanson [WH01] and Wang et al. [WTT*02] proposed a system which aims at the generation of textured models and the detection of windows. They introduced a façade texture based on the weighted average of several source images projected on a (previously registered) block model, which serves both for texturing and for detection of further detail (i.e. windows). They proposed a heuristic oriented region growing algorithm which iteratively enlarges and synchronizes small seed-boxes until they best fit the windows in the texture. Another use of local image segmentation and heuristics is presented by Tsai et al. [TLLH05, TLH06, TCLH06], who calculate a ‘greenness index' to identify and suppress occlusions by vegetation on façade textures extracted from drive-by video sequences. They detect local mirror axes of façade parts in order to cover holes left after removing the occluding vegetation. In both methods the used assumptions, for example, that windows are darker than their surrounding façade, or the ‘greenness index', are, however, weak and often provide erroneous results.

Lee and Nevatia [LN04] proposed a segmentation method that uses only edges. They project the edges horizontally and vertically to get the marginal edge-pixel distributions and assume that these have peaks where window-frames are located. From the thresholded marginal distributions they construct a grid which approximates a subdivision of the façade. While the subdivisions are often quite good, on the downside, this approach depends very strongly on the parameters of the edge detector.

4.2.2. Symmetry and pattern detection

Symmetry abounds in architecture, which is mostly the result of economical manufacturing and aesthetic design.

In image processing, early attempts include [RWY95], who introduced a continuous symmetry transform for images. Later, Schaffalitzky and Zisserman [SZ99] detected groups of repeated elements in perspective images, and Turina et al. [TTvG01, TTMvG01] detected repetitive patterns on planar surfaces, also under perspective skew, using Hough transforms. They demonstrated that their method works well on building façades. Further, a considerable amount of work on this topic has been done by Liu and collaborators [LCT04]. They detected crystallographic groups in repetitive image patterns using a dominant peak extraction method from the autocorrelation surface. Other image processing approaches utilised the detected symmetry of regular [HLEL06] and near-regular patterns [LLH04, LBHL08] in order to model new images.

Further approaches specialised on detecting affine symmetry groups in 2D images [LHXS05, LE06] and in 3D point clouds [MGP06, PSG*06]. Follow-ups of those methods introduced data-driven modelling frameworks for symmetrisation [MGP07] and 3D lattice fitting (cf. Figure 22) in laser-scans of architecture [PMW*08, MBB10].

Figure 22.

This example shows automatic symmetry detection results performed on point-clouds of architectural objects. Figure courtesy of Mark Pauly [PMW*08], ©2008 ACM.

The work finally boiled down to the insight that the repetitive nature of façade elements can be exploited to segment them. Berner et al. [BBW*08, BWM*11] and Bokeloh et al. [BBW*09] proposed a set of methods to detect symmetry in ground-based urban laser scans. A heuristic segmentation based on detection of symmetry and repetitions was proposed by Shen et al. [SHFH11]. Their method segments LiDAR scans of façades and detects concatenated grids. It automatically partitions the façade in an adaptive manner, such that a hierarchical representation is generated.

Detection of repeated structures in façade images was approached by Korah and Rasmussen who introduced a method for automatic detection of grids [KR07a]. Also others approached with this task, like Wenzel et al. [WDF08], and Musialski et al. [MRM*10], who proposed methods to detect rectilinear patterns in orthographic-rectified façade images using sparse image features. A similar method was also introduced by Zhao and Quan [ZQ11], who later extended their method do detect per-pixel symmetries [ZYZQ12].

Others detect symmetry directly in perspective images. For example, Wu et al. [WFP10] proposed a method to detect grid-like symmetry in façade images under perspective skew, which they have used to reconstruct dense 3D structure in a follow-up work [WACS11]. Park et al. [PBCL10] introduced a method to detect translational symmetry in order to determine façades.

Another approach has been pursuit by Alsisan and Mitra [AM12] who propose a combination of grid-detection and a MRF-regularisation in order to provide variation-factored façade representation. Also Tylecek and Sara [TS10] pursued a similar approach, where both systems detect grids of windows in ortho-rectified façade images using a weak prior and MCMC optimisation. A framework for the detection of regularly distributed façade elements has been published by AlHalawani et al. [AYLM13]. In addition, detection of regularly places structures has also been proposed for segmentation of LiDAR data of façades [MS12] who also adapt a voting-scheme for lattice detection.

Recently, Nianjuan et al. [NTC11] proposed a method for detecting symmetry across multi-view networks of urban imagery. A similar setup was used by Ceylan et al. [CML*12] in order to detect reliable symmetry across multiple registered images, which is utilised to recover missing structure of buildings.

4.2.3. Learning and matching

Another group of methods specialises in the detection of windows and other pre-specified structural elements. Some rely on template matching, others try to detect more general shapes, like simple rectangles. The advantage of template matching is that the results look very realistic. However, the disadvantage is that the windows are in most cases not authentic because there is no template database that contains all possible shapes.

For example, Schindler and Bauer [SB03] matched shape templates against dense point clouds using supervised learning. Mayer and Reznik [MR07] matched template images from a manually constructed window image database against their façades. Müller et al. [MZWvG07] matched the appearance of their geometric 3D window models against façade image-tiles.

Some approaches combine template matching with machine learning, for example, Ali et al. [ASJ*07], who proposed to train a classifier, or Drauschke et al. [DF08], who used Adaboost [SS99]. These systems identify a high percentage of windows even in images with perspective distortion.

Another approach, which is based on rectangles, is the window-pane detection algorithm by Cech and Sara [CS08], which identifies strictly axis-aligned rectangular pixel configurations in a MRF. Given the fact that the majority of windows and other façade elements are rectangular, a common approach to façade reconstruction is searching for rectangles or assuming that all windows are rectangular. Also Haugeard et al. [HPFP09] introduced an algorithm for inexact graph matching, which is able to extract rectangular window as a sub-graph of the graph of all contours of the façade image. This serves as an basis to retrieve similar windows from a database of images of façades.

An approach for segmentation of registered images captured at ground level into architectural units has been proposed by Zhao et al. [ZFX*10]. Recently, Dai et al. [DPSV12] tackled the problem of façade segmentation and labelling without using any prior knowledge.

Learning of features has also been used for other input data, like geometric models. Sunkel et al. [SJW*11] presented a user-supervised technique that learns line features in such models.

4.2.4. Façade parsing

The term façade parsing denotes methods that aim at knowledge-based object reconstruction, which means that they employ an a priori top–down model that is supposed to be fitted by cues derived from the data (i.e. images or laser scans). In fact, some methods utilise the concept of IPM presented in Section 'Inverse procedural modelling'. In a first step, a formal grammar is either predefined manually [ARB07, Rip08], or automatically determined in a bottom–up manner from the data [Bec09]. In a second step, the grammar is fitted according to the underlying data, which results in very compact representations.

One of the first who proposed grammar-based segmentation were Alegre and Dellaert [AD04]. They introduced a set of rules from a stochastic context-free attribute grammar, and a MCMC solution to optimise the parameters. Mayer and Reznik [MR05, MR06, MR07] and Reznik and Mayer [RM07] published a series of papers in which they present a system for façade reconstruction and window detection by fitting an implicit shape model [LLS04], again using MCMC optimisation.

A single-view approach for rule extraction from a segmentation of simple regular façades was published by Müller et al. [MZWvG07]. They cut the façade image into floors and tiles in a synchronized manner in order to reduce it to a so-called irreducible form, and subsequently fit grammar-rules into the detected subdivision. This method is limited to rectilinearly distributed façades (cf. Figure 25). Van Gool et al. [vGZBM07] provided an extension which detects similarity chains in perspective images and a method to fit shape grammars to these.

Brenner and Ripperda [BR06, RB07, Rip08, RB09] developed in a series of publications a system for detecting façade elements and especially windows from images and laser scans. In this work, a context-free grammar for façades is derived from a set of façade images and fitted to new models using the RJMCMC technique. Becker and Haala [BH07, BHF08, Bec09, BH09] presented in a series of papers a system which attempts to automatically discover a formal grammar. This system was designed for reconstruction of façades from a combination of LiDAR and image data.

Pu and Vosselman proposed a higher-order knowledge-driven system which automatically reconstructs façade models from ground laser-scan data [PV09b]. In a further approach, they combine information from terrestrial laser point clouds and ground images. The system establishes the general structure of the façade using planar features from laser data in combination with strong lines in images [PV09a, PV09c].

This topic is also of wide interest in the computer vision community. In an automatic approach, Koutsourakis [KST*09] examines a rectified façade image in order to fit a hierarchical tree grammar. This task is formulated as a MRF [GG84], where the tree formulation of the façade image is converted into a shape grammar responsible for generating an IPM (cf. Section 'Inverse procedural modelling'). Teboul et al. [TSKP10] extend this work by combining a bottom–up segmentation through superpixels with top–down consistency checks coming from style rules. The space of possible rules is explored efficiently. In a recent follow-up they improve their method by employing reinforcement learning [TKS*11, TKS*13]. In their recent work [STK*12] they present a multi-view approach which uses depth maps and an evolutionary algorithms to determine the parameters of a predefined grammar. This allow them to segment not only single façades, but also 3D models of buildings (cf. Figure 23).

Figure 23.

Automatic 3D reconstruction of a building with multiple façades visible from the street. Figure courtesy of Olivier Teboul [STK*12].

Riemenschneider et al. [RKT*12] proposed an approach which uses generic grammars as a model and a set of irregular lattices in order to determine the structure of non-trial façades. Other recent work by Martinovic et al. [MMWvG12] introduced a method to decompose the façade into three basic layers of different granularity and to apply probabilistic optimisation in order to obtain a semantic segmentation of the model. Another interesting approach was proposed by Yang et al. [YHQT12] who fit a binary split grammar by treating the façade as a matrix and decomposing it into rank-1 approximations.

While newer approaches based on IPM provide quite stable results, the quality and the level of detail of these methods is still not good enough for current demands. In practice, the expected quality for production is much higher, therefore manual or interactive methods still have wide applicability.

4.3. Façade modelling

The previous section presented an overview of automatic façade-subdivision approaches. All these methods share the property that they create models of low or intermediate level of detail and complexity. Interactive approaches, on the other hand, promise better quality and higher level of detail.

An interactive image-based approach to façade modelling was introduced by Xiao et al. [XFT*08]. It uses images captured along streets and also relies on SfM as a source for camera parameters and initial 3D data. It considers façades as flat rectangular planes or simple developable surfaces with an associated texture. Textures are composed from the input images by projective texturing. In the next step, the façades are automatically subdivided using a split heuristic based on local edge detection [LN04]. This subdivision is then followed by an interactive bottom–up merging process. The system also detects reflectional symmetry and repetitive patterns in order to improve the merging task. Nonetheless, the system requires a considerable amount of user interaction in order to correct misinterpretations of the automatic routines.

Hohmann et al. [HKHF09] proposed a system for modelling of façades based on the GML shape grammar [Hav05]. Similar as in the work of Aliaga et al. [ARB07], grammar rules are determined manually on the façade imagery and can be used for procedural remodelling of similar buildings.

Another interactive method for the reconstruction of façades from terrestrial LiDAR data was proposed by Nan et al. [NSZ*10], which is based on semi-automatic snapping of small structural assemblies, called SmartBoxes. We mention the method also in Section 'LiDAR-based modelling'.

Recently, Musialski et al. [MWW12] introduced a semi-automatic image-based façade modelling system (cf. Figure 24). Their approach incorporates the notion of coherence, which means that façade elements that exhibit partial symmetries across the image can be grouped and edited in a synchronized manner. They also propose a modelling paradigm where the user is in control of the modelling workflow, but is supported by automatic modelling tools, where they utilise unsupervised clustering in order to robustly detect significant elements in orthographic façade images. Their method allows modelling high detail in competitive time.

Figure 24.

Results of interactive modelling with the method of Musialski et al. [MWW12]. The façade image has been segmented into 1346 elements. ©2012 The Eurographics Association and Blackwell Publishing Ltd.

While interactive methods seem to be too slow and not scalable, the advantage of the high-quality output is a considerable value (refer to Figure 25). For this reason, we believe that with the plethora of research in automatic computer vision algorithms, it will become equally important to study the efficient integration of automatic processing and user interaction in future.

Figure 25.

Comparison of the results of the automatic method of [MZWvG07] (left, 409 shapes, excluding windows matched from a template library) to the interactive method of [MWW12] (right, 1878 shapes). Left image courtesy of Pascal Müller [MZWvG07].

5. Blocks & Cities

The problem of measuring and documenting the world is the objective of the photogrammetry and remote sensing community. In the last two decades this problem has been also extended to automatic reconstruction of large urban areas or even whole urban agglomerations. In addition, also the computer vision and computer graphics communities started contributing to the solutions. In this section, we want to mention several modern approaches which have been proposed in this vast research field.

The common property of large-scale approaches is the demand of minimal user interaction or, in the best case, no user interaction at all, which leads to the best possible scalability of the algorithms. There is quite a variety of methods, which either work with aerial or ground-level input data or both. It is difficult to compare these methods directly to each other since they have been developed in different contexts (types of input data, types of reconstructed buildings, level of interactivity, etc.). For this reason we do not attempt a comparison; we will merely review the mentionable approaches and state their main contributions and ideas.

In large scale reconstruction, there is a trend towards multiple input data types. Some publications involve aerial and ground-based input, some also combine LiDAR with imagery. Other methods introduce even more data sources, like a digital elevation model (DEM), a digital terrain model (DTM) or a digital surface model (DSM). Finally, some methods incorporate positioning systems, like the global positing system (GPS), or local inertial navigation systems (INS).

In this survey, we omit a detailed discussion on remote sensing concepts and refer to further literature [CW11]. A number of papers up to the year 2003 have been also reviewed in a survey by Hu et al. [HYN03]. Haala and Kada [HK10] provide a survey of automatic approaches in the photogrammetry community. Another important related work is the International Society for Photogrammetry and Remote Sensing (ISPRS) test project on urban classification and 3D building reconstruction [RSJ*12] that aims at the evaluation for building detection, tree detection and 3D building reconstruction in the photogrammetry and remote-sensing community [RSGW13].

5.1. Ground-based reconstruction

One of the earlier approaches to reconstruct large urban areas was the work of Früh and Zakhor. They published a series of articles that aim at a fully automatic solution which combines imagery with LiDAR. First, they proposed an approach for automated generation of textured 3D city models with both high detail at ground level and complete coverage for the bird's-eye view [FZ03]. A close-range façade model is acquired at the ground level by driving a vehicle equipped with laser scanners and a digital camera under normal traffic conditions on public roads. A far-range DSM, containing complementary roof and terrain shape, is created from airborne laser scans, then triangulated, and finally texture-mapped with aerial imagery. The façade models are first registered with respect to the DSM using Monte Carlo localisation, and then merged with the DSM by removing redundant parts and filling gaps. In further work [FZ04], they improved their method for ground-based acquisition of large-scale 3D city models. Finally, they provided a comprehensive framework which features a set of data-processing algorithms for generating textured façade meshes of cities from a series of vertical 2D surface scans and camera images [FJZ05].

In the realm of image-based methods, Pollefeys et al. [PvGV*04] presented an automatic system to build visual models from images. This work is also one of the papers which pioneers fully automatic SfM of urban environments. The system deals with uncalibrated image sequences acquired with a hand-held camera and is based on features matched across multiple views. From these both the structure of the scene and the motion of the camera are retrieved (cf. Section 'Structure from motion'). This approach was further extended by Akbarzadeh et al. [AFM*06] as well as Pollefeys et al. [PNF*08].

Another image-based approach is the work of Irschara et al. [IZB07, IZKB12] which provides a combined sparse-dense method for city reconstruction from unstructured photo collections contributed by end users (cf. Figure 26). Hence, the ‘Wiki' principle, well known from textual knowledge databases, is transferred to the goal of incrementally building a virtual representation of a local habitat. Their approach aims at large scale reconstruction, using a vocabulary tree [NS06] to detect mutual correspondences among images, and combines sparse point clouds, camera networks and dense matching in order to provide very detailed buildings.

Figure 26.

Examples of dense reconstruction after depth map fusion. Figure courtesy of Arnold Irschara [IZB07], ©2007 IEEE.

Xiao et al. [XFZ*09] proposed to extend their previous method [XFT*08] in order to provide an automatic approach to generate street-side photo-realistic 3D models from images captured along the streets at ground level. They employ a multi-view segmentation algorithm that recognises and segments each image at pixel level into semantically meaningful classes, such as building, sky, ground, vegetation, etc. With a partitioning scheme the system separates buildings into independent blocks, and for each block, it analyses the façade structure using priors of building regularity. The system produces visually compelling results, however it clearly suffers quality loss when compared to their previous, interactive approach [XFT*08].

Another system introduced by Grzeszczuk et al. [GKVH09] aims at fully automatic texturing of large urban areas using existing models from GIS databases and unstructured ground-based photographs. It employs SfM to register the images to each other in the first step, and than the iterated closest point (ICP) algorithm [BM92] in order to align the SfM 3D point clouds with the polygonal geometry from GIS databases. In further steps their system automatically selects optimal images in order to provide projective textures to the building models.

A ground-level city modelling framework which integrates reconstruction and object detection was presented by Cornelis et al. [CLCvG08]. It is based on a highly optimised 3D reconstruction pipeline that can run in real-time, hence offering the possibility of online processing while the survey vehicle is recording. A compact textured 3D model of the recorded path is already available when the survey vehicle returns to its home base (cf. Figure 27). The second component is an object detection pipeline, which detects static and moving cars and localises them in the reconstructed world coordinate system.

Figure 27.

A collection of rendered images from the final 3D city model taken from various vantage points. Figure courtesy of Nico Cornelis [CLCvG08], ©2008 Springer.

In general, ground-based systems are usually limited to relatively small areas compared to airborne approaches. However, these methods are the only ones to provide small-scale details, thus, the objective is often the combination of both acquisition methods.

5.2. Aerial reconstruction

Aerial imagery is perhaps the most often used data source for reconstruction of urban environments, and has been explored in the photogrammetry and remote sensing community for many years. There has been a significant number of successful approaches in the past decade, like those of Baillard et al. [BZ99], the group of Nevatia et al. [NN01, NP02, KN04] or Jaynes et al. [JRH03]. Many approaches often combine imagery with other input data. In this section, we review several systems developed in recent years.

Wang et al. [WYN07] combined both aerial and ground-based imagery in a semiautomatic approach. The framework stitches the ground-level images into panoramas in order to obtain a wide camera field of view. It also detects the footprints of buildings in orthographic aerial images automatically, and both sources are combined, where the system incorporates some amount of user interaction in order to correct wrong correspondences.

Another multi-input method was proposed by Zebedin et al. [ZBKB08]. This framework combines aerial imagery with additional information from DEMs. They introduced an algorithm for fully automatic building reconstruction, which combines sparse line features and dense depth data with a global optimisation algorithm based on graph cuts [KZ04]. Their method also allows generating multiple LODs of the geometry. Also Karantzalos and Paragios [KP10] proposed a framework for automatic 3D building reconstruction by combining images and DEMs. They developed a generalized variational framework which addresses large-scale reconstruction by utilising hierarchical grammar-based 3D building models as a prior. They use an optimisation algorithm on the GPU to efficiently fit grammar-instance from the information extracted from images and the DEM.

A recent method of Mastin et al. [MKF09] introduced a method for fusion of 3D laser data and aerial imagery. Their work employs mutual information for registration of images with LiDAR point clouds, which exploits the statistical dependency in urban scenes. They utilise the downhill simplex optimisation to infer camera pose parameters and propose three methods for measuring mutual information between LiDAR and optical imagery.

Another source for large-scale reconstruction is a DSM, which can be obtained automatically from aerial and satellite imagery. Lafarge et al. [LDZPD10] proposed to use a DSM in order to extract individual building models. It treats each building as an assembly of simple 3D parametric blocks, which are placed on the DSM by 2D matching techniques, and then optimised using an MCMC solver. The method provides individual building models of urban areas (cf. Figure 28).

Figure 28.

Automatic urban area reconstruction results from a DSMs (left panel): without (middle panel) and with textures (right panel). Figure courtesy of Florent Lafarge [LDZPD10], ©2010 IEEE.

Others utilise imagery only. Liao et al. [LLM11] proposed an multi-view stereo algorithm for efficient reconstruction. Another recent method by Garcia-Dorado and Aliaga [GDA13] aims at the reconstruction of planar-hinged buildings Generally, methods that use aerial data provide a number of benefits over ground-based approaches and are thus still an active field.

5.3. Massive city reconstruction

In this section, we mention several methods which employ fully automatic methodologies and also provide reconstructions of entire urban areas. One significant factor which allows for such vast reconstruction is the general technological progress in the data acquisition process, such as the easy access to huge collections of images on the Internet, or the presence of many accurate and large LiDAR data sets. The second, perhaps more important, factor is the development of smart and scalable reconstruction algorithms. No hardware advantage will compensate for exponentially scaling approaches, thus development of such algorithms is still a challenge.

In the image-based reconstruction domain, an impressive system was recently presented by Frahm et al. [FFGG*10]. It is capable of delivering dense structure from unstructured Internet images within 1 day on a single PC. Their framework extends to the scale of millions of images, what they achieve by extending state-of-the-art methods for appearance-based clustering, robust estimation, and stereo fusion (cf. Section 'Point Clouds & Cameras'), and by parallelising the tasks which can be efficiently processed on multi-core CPUs and modern graphics hardware.

Poullis and You introduced a method for massive automatic reconstruction from images and LiDAR [PY09a, PY09b, PY09c]. Their system automatically creates lightweight, watertight polygonal 3D models from airborne LiDAR. The technique is based on a statistical analysis of the geometric properties of the data and makes no particular assumptions about the input. It is able to reconstruct areas containing several thousand buildings, as shown in Figure 29. Recently they extended their method for texturing [PY11].

Figure 29.

Large scale reconstruction of Downtown Denver and surrounding areas. The model is a polygonal mesh generated from air-borne LiDAR data. Figure courtesy of Charalambos Poullis [PY09a]. ©2009 IEEE.

Also Zhou and Neumann proposed a similar approach [ZN09, ZN11]. Generally, while the results of recent methods are very impressive, automatic large-scale reconstruction remains an open problem. With the goal of very detailed and dense virtual urban habitats, the problem still remains a very difficult one. The challenges lie in the management and processing of huge amounts of data, in the developments of robust automatic as well as fast and scalable algorithms, and finally, in the integration of many different types of data.

Lafarge and Mallet [LM11a, LM12] published an approach which aims at even more complete modelling from aerial LiDAR. Its advantage is that it not only reconstructs building models, but also the inherent vegetation and complex grounds. Furthermore, it is also generalised such that it can deal with unspecified urban environments, for example, with business districts as well as with small villages. Geometric 3D-primitives such as planes, cylinders, spheres or cones are used to describe regular roof sections, and are combined with mesh-patches to represent irregular components. The various geometric components interact through a non-convex optimisation solver. Their system provides impressive large-scale results as shown in Figure 30.

Figure 30.

Reconstruction of two large urban environments with closeup crops. Figure courtesy of Florent Lafarge [LM11a]. ©2011 IEEE.

6. Conclusions

Despite the large body of existing work on urban reconstruction, we believe it is still a great time to conduct research in this area. The topic is very important and the many open problems leave room for significant contributions.

Most importantly, the fully automatic algorithms rely on assumptions that are not met in practice, so that the quality of the obtained results is not sufficient for most applications. Contributions in this area might be difficult to obtain, but their impact can be significant.

Further, we believe that investigating the combination of interactive techniques and automatic algorithms is important from a practical standpoint. It is a common misconception that user interaction can be efficiently used as a post-processing step to clean up automatic results. Typically, an efficient framework can only be achieved with a tighter coupling of user interaction and automatic computation throughout the modelling pipeline.

There are several excellent examples of how the analysis of large photo collections can lead to impressive results. We believe that this remains a hot topic and that studying questions related to large-scale data analysis and using the modelling effort of many users will be fruitful.

Further, there is still more room for contributions related to higher-level shape analysis and understanding, such as symmetry detection, the reconstruction of functionality and IPM.

Finally, if high-quality urban models become easier accessible and more widespread, the investigation of novel applications will become more attractive in a wider range of research fields.


This work has been partially supported by Austrian Science Funds (FWF): P23237-N23.