Omnidirectional Free-viewpoint Rendering Using a Deformable 3-D Mesh Model

This paper proposes a method to render free viewpoint images from omnidirectional videos using a deformable 3-D mesh model. In the proposed method, a 3-D mesh is placed in front of a virtual viewpoint and deformed by using the pre-estimated omnidirectional depth maps that are selected on the basis of position and posture of the virtual viewpoint. Although our approach is fundamentally based on the model-based rendering approach that renders a geometrically correct virtualized world, in order to avoid the hole problem, we newly employ a viewpoint-dependent deformable 3-D model instead of the use of a unified 3-D model that is generally used in the model based rendering approach. In experiments, free-viewpoint images are generated from the omnidirectional video captured by an omnidirectional multi camera system to show the feasibility of the proposed method for walk-through applications in the virtualized environment.


INTRODUCTION
One of the typical goals of representation and modeling of a large-scale 3-D environment is to create a high-quality virtualized world based on the real environment. In recent years, virtualized worlds based on the real world have been released from Microsoft (Virtual Earth) and Google (Google Earth). These virtualized worlds now realize virtual sightseeing, navigation and will be used for a wide range of applications such as entertainment, digital archiving and education. However, details of the real world are omitted from the current versions of the virtualized worlds because of the cost of 3-D modeling. Thus, the reality of the virtualized world has still not reached to the sufficient level for some applications.
Many studies have attempted to automatically construct a virtualized real world to reduce the human cost involved in modeling. Most of these studies can be categorized according to the rendering policy of the virtualized world. One such rendering policy is model-based rendering (MBR) and the other is image-based rendering (IBR). MBR methods render the virtualized world on the basis of explicit 3-D models. The key problem in MBR is the automatic generation of explicit 3-D models of the real world. There are many types of vision-based methods: shape from shading, silhouette, focus and defocus, Manuscript received on 1 March 2009 E-mail: tomoka-s@is.naist.jp motion, stereo, etc. In vision-based methods, a combination of shape from motion and stereo is often employed for the 3-D modeling of an outdoor environment [1][2][3][4]. In these methods, the camera parameters of input images are first calibrated using a structure-from-motion algorithm, and a depth map for each input image is then generated. Finally, estimated depth maps are fused into a unified 3-D model. Goesele et al. [1] developed a method to estimate the geometry of an outdoor environment from community photographs. Merrell et al. [2] propose a fast algorithm to estimate the geometry of an outdoor environment using images taken by a car-mounted camera system. Car-mounted laser rangefinders have also been used for the 3-D modeling of outdoor environments [5][6][7][8]. Although both vision-based and laser-scanner-based methods realize (semi-)automatic 3-D modeling of large outdoor environments, the obtained 3-D models cannot be directly employed as 3-D CG models placed in the virtual world because these models include many holes resulting from occlusions and depth estimation errors. Although some of the model-based approaches relax the hole problem assuming that the scene is constructed of planes [4,8], this assumption works only for artificial scenes. Even by using the state-of-the-art methods, a complete 3-D modeling for a large and complex outdoor environment is very difficult because of the invisible and immeasurable parts that depend on the observation positions.
On the other hand, the IBR method renders virtual viewpoint images without explicit 3-D geometry [9]. There are several types of IBR approaches. The most simple method for novel view synthesis by IBR approach is the morphing-based method that directly warps the images using the corresponding points in the pair of images [10,11]. By using this method, we can generate realistic images for the virtual camera placed between the original camera positions. However, the rendered image is easily distorted when the virtual viewpoint is set at a point far from the original viewpoint. Image-ray-based methods [12][13][14] reconstruct an image for the virtual viewpoint by collecting rays from input images. In this approach, the quality and reality of the reproduced image generally depend on the number of corrected images. The weakness of the light-field-based methods without geometry is the difficulty of acquiring and handing a considerably large number of images. Several researchers have proposed methods for compressing the light-field in order to solve the problem of handing a considerably large number of images [15]. However, the acquisition stage for input images still needs a considerable

Tomokazu Sato, Hiroyuki Koshizawa and Naokazu Yokoya
Graduate School of Information Science, Nara Institute of Science and Technology, Japan human cost in the case of a large environment, and this makes the development of the virtualized real-world difficult. If a sufficient number of images (rays) cannot be corrected, image distortion like the unnatural aspect-ratio change of the object will be exposed on the generated images. In order to avoid the problem of holes and distortion, some methods employ a hybrid approach of IBR and MBR that employs view-dependent geometry and texture [16,17]. Irani et al. [16] determined each pixel value of generated image by estimating the depth from the virtual viewpoint. In this method, the depth value of a pixel is automatically determined by using photo-consistency. These conventional methods that use view-dependent geometry and texture can generate natural images even if the viewpoint is set at a point far from the original viewpoint. However, on-demand depth estimation is not suitable for interactive applications like a walk-through in the virtualized real world because a considerable computational cost is incurred for each loop of the rendering stage.
In this paper, we propose a novel-view synthesis method that renders free-viewpoint images using a 3-D mesh model that is deformed by considering pre-estimated depth maps for original viewpoints. In our approach, by selecting and merging appropriate depths and textures from several viewpoints, realistic images without holes can be generated even when the virtual viewpoint is set at a point away from the original viewpoint. The contributions of this paper are summarized as follows: (1) the geometry of the scene for the virtual view point can be immediately recovered by fitting a deformable mesh model to pre-estimated omnidirectional depth maps for original viewpoints, (2) the 3-D mesh model is deformed as optimal shape for the scene structure so that no holes appear in generated images, and (3) omnidirectional free-viewpoint rendering is achieved by using omnidirectional video sequences as input.
The rest of this paper is organized as follows: In Section 2, the method for free-viewpoint rendering using a deformable 3-D mesh model is described. Section 3 presents experimental results for walk-through applications in the virtualized world, and finally Section 4 summarizes the present study.

II.
FREE-VIEWPOINT RENDERING USING VIEW-DEPENDENT 3-D MESH MODEL Fig. 1 shows a flow diagram of the proposed method for free-viewpoint rendering. In this research, in order to render the images for arbitrary directions in the virtualized world, input images are captured by an omnidirectional multi-camera system (OMS). By automatically tracking the feature points on the captured images, camera parameters of the OMS are estimated using the structure-from-motion algorithm designed for OMS [18]. The depth map for each input image is then estimated by multi-baseline stereo [19]. At the rendering stage, a 3-D mesh model is placed in front of the virtual viewpoint and is deformed so as to minimize an energy function that expresses consistency between the resulting view-dependent depth map and pre-estimated depth maps for original viewpoints. After deforming the mesh model, an appropriate texture for each polygon is mapped onto the deformed mesh from the original images.

Acquisition of input data
At the first stage, an omnidirectional video is taken as input data by moving an OMS in the target environment. Camera parameters of the OMS are estimated by the structure-from-motion algorithm for OMS [18]. Although not only the camera parameters but also 3-D positions of feature points can be estimated by the structure-from-motion algorithm, more dense 3-D information is necessary in order to synthesize the images for the target scene. Therefore, after estimating the camera parameters, the multi-baseline stereo algorithm for OMS [19] is applied to the input images in order to acquire the depth information for every feature point. Finally, omnidirectional dense depth maps are generated using depth interpolation. For depth interpolation, first, the 2-D feature points in omnidirectional images are triangulated by using Delaunay's triangulation [20]. The depth value of each pixel is then determined by computing the depth for each triangle in 3-D space. The following processes use omnidirectional video, dense depth maps, and camera parameters acquired at this stage.

Generation of view-dependent 3-D model
As shown in Fig. 2, before beginning the rendering stage, a 3-D mesh model is initially placed in front of the virtual viewpoint as a plane model whose distance from the virtual viewpoint is F and which is parallel to the image plane of the virtual camera. After mesh initialization, the 3-D mesh model is deformed by minimizing an energy function depending on the Mesh model position and posture of the virtual viewpoint. In the followings, the energy function that expresses the consistency of depth information is defined first. The method for selecting appropriate depth maps for computing the energy function and the method for deformation of the mesh model is then detailed.

Definition of energy function
Each vertex on the mesh model is moved so as to minimize the energy function that expresses the consistency with the depth data of the original viewpoints. The energy is minimized when the 3-D mesh is deformed to fit the depth maps of the original viewpoints.
As shown in Fig. 2, in the proposed method, destination position i p of the i -th vertex i p (whose 3-D position in the camera coordinate system is ) ) in mesh deformation is constrained on the straight line connecting the virtual viewpoint v and the initial position i p . This straight line is expressed by a single parameter d in the camera coordinate system as follows.
The parameter d that expresses the depth of i p in the camera coordinate system is determined by minimizing the energy ) (d E i for the i -th vertex defined as follows: (2) , ; 0 represents the set of frame indexes of the original viewpoints ) , , , is occluded by other objects from the j -th original viewpoint ij c . It should be noted that if most of the original viewpoints satisfy the above-mentioned occluding condition, the energy i E will be unstable because the number of original viewpoints for energy determination is very small. Therefore, the computation of energy is skipped for depth d when the number of original viewpoints that are used for energy determination is M or lower for depth d .
The energy ) (d E i is minimized when the weighted sum of distances between the 3-D position ) ( d i p and 3-D models generated from the pre-estimated depth maps is minimized. The most consistent 3-D mesh model with a pre-estimated geometry can be automatically generated by minimizing this energy function ) (d E i .

Selection of depth map
As described above, the energy is computed on the basis of the depth maps of the original viewpoints. In the proposed method, as shown in Fig. 4, the depth maps used for computing the energy i E are selected based on the distance between the original viewpoint ij c and the ray i r that connects the virtual viewpoint v and the vertex i p . Concretely, first, the Euclidian distance ij l from the ray i r to the original viewpoint ij c is computed for each j . The top N nearer original viewpoints ) , , , iN i i c c c  from the ray i r and their associated depth maps are then selected based on the distance l . These viewpoints are selected on the basis of the idea that there are fewer occluders between these viewpoints and the vertex i p than there are in the case of the other viewpoints. Images of the viewpoints selected in this process are also used in the texture selection process with the same reason.

Deformation of mesh model
Except for the initial iteration of the rendering stage, the 3-D mesh model generated for the previous virtual viewpoint is used as an initial mesh model. This initial model can also be used to limit the search range for the depth value. This limitation for the depth search decreases the computational cost for a 3-D mesh deformation in each frame. Fig. 5 illustrates the limited range for the depth search when the virtual viewpoint is moved forward.
The surface (A) in Fig. 5 illustrates the mesh model that is generated for the previous camera position. In the proposed method, the search range for the depth value is limited inside the surfaces (B) and (C) that are placed around (A). Concretely, as shown in Fig. 6 where d R is a given ratio that determines the size of the search range. It should be noted that if occluding edges exist in the scene, this scheme cannot work well because the true depth will be outside of the search range. To avoid this problem, in the case when the minimized energy i E with the depth min E d is more than a given threshold, the depth value d is researched using Eq. (5) without a limiting the search range.

View-dependent texture mapping
After deforming the mesh model, an appropriate texture image for each patch is selected and mapped to the mesh model from the images of the original viewpoints. Here, for each triangle patch  on the 3-D mesh model, the frame number f , that maximizes the following function f R , is selected as the texture frame for the patch  . where indicates a vertex index that constructs the patch  . k f is the index list and kj w is the weighting function, which are defined in Section 2.2.1. The function f R is increased if the f -th frame is used multiple times for determining the depth of vertexes on the patch  . The weight w for each viewpoint is also considered with this function. For each patch of the mesh model, the image frame that maximizes f R is selected and the image of the corresponding region in the selected image is then mapped to the patch  as the texture.

III. EXPERIMENTS
In order to verify the feasibility of the proposed method, we have carried out experiments with free-viewpoint image generation from the omnidirectional image sequences of a real outdoor environment.

Acquisition of input data
In this experiment, an omnidirectional multi-camera system: Pointgrey Research Ladybug, is used in order to acquire real image sequences. Ladybug consists of six camera units for capturing an omnidirectional view; each camera unit captures a perspective video whose resolution is 1024 768  pixels. Fig. 7 shows an example frame of the omnidirectional video captured in the target real environment. In this experiment, a total of 500 frames (3,000 images) are used as input video. We first estimate the extrinsic camera parameters for this video [18]. Fig. 8 shows the estimated camera position and posture of every 20 frames and the camera path for all the frames. Point clouds in this figure indicate the 3-D positions of feature points estimated by the structure-from-motion process.
By using the estimated camera positions and postures, the multi-baseline stereo method for OMS [19] is applied to the input images in order to acquire the omnidirectional depth map for each frame. Fig. 9 (a) shows the panoramic omnidirectional image generated from the images shown in Fig. 7, and (b) is the corresponding depth map acquired by the method described in Section 2.1.

Free-viewpoint rendering for straight routes
First, free-viewpoint images are synthesized for three different routes, as illustrated in Fig. 10, by using the parameters shown in Table 1.
(Route A) straight route in which the virtual camera moves along the original camera path.  (a) Panoramic image warped from images in Fig. 7 (b) Depth map estimated for (a)  Fig. 11 and 12 show the generated images for Routes A and B, respectively. In these figures, the depth maps of the generated mesh model for the direction (b) are also shown. From these results, it can be confirmed that there are no holes in the generated images. For route A, there is very little geometric distortion in the generated images. However, there exist discontinuous textures around the center of the generated images. That is mainly because of a large frame change in the texture selection for adjacent meshes. To resolve this problem, a photometric correction for textures for adjacent meshes is necessary. It should be noted that the black object shown in the bottom part of Figs. 11(a)-(c) is a handle of a camera mount; it is not directly concerning with this problem.
For Route B (Fig. 12), in this scene, the images are generated without large distortions if the distance from the viewpoint S to the virtual viewpoint is 2m or shorter. However, when the distance becomes 4m or more, obvious distortion can be observed around the trees in the scene. One of the reasons of this problem is the shortage of the resolution of the mesh model. As shown in the depth maps in Fig. 12, textures on the building are mapped onto the position of the tree because of the sparse depth map. In order to solve this problem without large additional cost, employment of the adaptive mesh-division will be effective. Fig. 13 shows the generated images for Route C. In this route, 300 images are generated for the zigzag camera path moving around the original camera path. As shown in this figure, the images generated for this route are natural for most of the scene. However, distortion appears for some frames around the edges of the ground in the generated video. As shown in this figure, rippling edges that should be straight in the real world are easily perceived as unnatural by human observer. In order to reduce such an unnatural feeling with generated images, some constraints that can be extracted from the original images should be introduced to the energy function in a future work. For example, lines can easily be detected in the original images, and they can be used as constraints in mesh deformation to reduce distortions. Temporal and spatial smoothness constraint will also relax the distortion problem. Table 2 lists the average time for each process of the proposed method in the case when a PC (Intel Core2Duo E8600 3.33GHz, memory 16GB) is used. The rendering system uses GPU (NVIDIA GeForce GTX285, texture memory 2GB) for texture mapping and all the images are stored in the texture memory in advance. Except for the initial time of the rendering stage, we need approximately 1.3 seconds to render a single free-viewpoint image and most of the time is consumed for deforming the 3-D mesh model. This cost is due to the exhaustive search for the minimum energy within the limited range of depth. In order to realize an interactive walk-through system, we must implement a faster method for energy minimization like gradient descent and multi-thread programming.

IV. CONCLUSION
In this paper, we have proposed an omnidirectional free-viewpoint rendering method that uses a view-dependent 3-D mesh model. In the proposed method, a 3-D mesh model is deformed by minimizing the energy function that expresses the surface consistency with the depth maps for the original camera positions. For a deformed mesh model, appropriate textures are selected and mapped to synthesize an image of a virtual viewpoint. In experiments, free-viewpoint images are generated for several directions and positions by using omnidirectional images and depth maps. From the generated images, we have confirmed that natural images without holes are generated if the virtual viewpoint is set around the original viewpoints. However, the distortion of textures was observed in some part of the generated images. In future work, in order to improve the quality of generated images, we will investigate the method for introducing pre-detectable knowledge like straight edges in the original image as constraints on mesh deformation. Adaptive mesh-division and temporal and spatial smoothness constraint in mesh deformation will also relax the distortion problem. .