Virtual Table-Teleporter: Image Processing and Rendering for Horizontal Stereoscopic Display

— We describe a new architecture composed of software and hardware for displaying stereoscopic images over a horizontal surface. It works as a ``Virtual Table and Teleporter'', in the sense that virtual objects depicted over a table have the appearance of real objects. This system can be used for visualization and interaction. We propose two basic configurations: the Virtual Table, consisting of a single display surface, and the Virtual Teleporter, consisting of a pair of tables for image capture and display. The Virtual Table displays either 3D computer generated images or previously captured stereoscopic video and can be used for interactive applications. The Virtual Teleporter captures and transmits stereoscopic video from one table to the other and can be used for telepresence applications. In both configurations the images are properly deformed and displayed for horizontal 3D stereo. In the Virtual Teleporter two cameras are pointed to the first table, capturing a stereoscopic image pair. These images are shown on the second table that is, in fact, a stereoscopic display positioned horizontally. Many applications can benefit from this technology such as virtual reality, games, teleconferencing, and distance learning. We present some interactive applications that we developed using this architecture.


Introduction
The stereoscopic technology is getting more and more common nowadays, and as a consequence this kind of technology is becoming cheaper and widely accessible to people in general [1], [2].
Most stereoscopic applications use simple adaptations of non-stereoscopic concepts in order to give the observer a sense of depth.This is true, for example, in the case of 3D movies where two versions are usually released, one to be watched in a stereoscopic movie theater and the other to be watched in a normal theater.
We are exploring the use of stereoscopic technology changing the usual paradigm that tries to give the observer the "sense of depth" to the new paradigm that gives the observer a "sense of reality".We call it "sense of reality" when in addition to giving a sense of depth to the image, the setting is presented in such a way that it is compatible with real objects in the real world.Normal 3D movies do not implement the sense of reality because of the following: • The screen is limited, thus, points in the border can be shown without the stereo correspondence.It is not a problem if the whole scene appears "inside" the screen, but it is a problem if the scene extends outside the screen.
• The objects presented in a movie are usually floating in space because the scene is not grounded to the real world floor.
• Many scenes typically present a very large range of depths, which cannot be exhibited by the current stereoscopic technology.
• The zoom parameter of the camera is usually chosen in order to capture the scene in the same way as a regular movie, which magnifies portions of the scene.
The above aspects make it difficult for the observer to believe that the content, although presented in 3D, is actually real.To be physically plausible the content presented in the screen must make sense when viewed as part of the environment that surrounds it.This goal can be achieved by making four changes to the stereoscopic system: (1) Presenting the 3D stereo content on a horizontal support leveling the floor with the screen.
It establishes a link between virtual objects and the screen.This link makes the result appear more real compared to the exhibition of virtual objects flying in front of a vertical screen.
(2) Not presenting a scene whose projected points in the border of the screen are closer to the observer than the screen.
If a 3D point on the left or right border of the screen is closer to the observer than the screen, then one of its correspondent stereoscopic projections will not be exhibited due to the screen limitation.That means that it will generate a stereoscopic pair that does not correspond to a 3D scene.If the stereoscopic projections of an object cross the top border, but do not cross the laterals, then the scene will not be well accepted by the observer either, although the stereoscopic pair corresponds to a 3D scene.In this case the problem is that the border limitation corresponds to a 3D cut in the object, causing the top of the projection to be perfectly aligned with the top border of the screen.Besides the fact that the 3D cut makes the scene odd, there is the fact that the alignment between the border and the cut implies that the observer had to be placed in a very specific position in order to be able to see it Moreover, it means that the stereoscopic projections are images that do not satisfy the generic viewpoint assumption [3] that can cause interpretation problems.Finally, if the stereoscopic projections cross the bottom border, then they will suffer from the same problems as those that cross the top border, plus the fact that they will correspond to floating objects.If a 3D point on the left or right border of the screen is closer to the observer than the screen, then one of its correspondent stereoscopic projections will not be exhibited due to the screen limitation.That means that it will generate a stereoscopic pair that does not correspond to a 3D scene.If the stereoscopic projections of an object cross the top border, but do not cross the laterals, then the scene will not be well accepted by the observer either, although the stereoscopic pair corresponds to a 3D scene.In this case, the problem is that the border limitation corresponds to a 3D cut in the object, that makes the top of the projection be perfectly aligned with the top border of the screen.Besides the fact that the 3D cut makes the scene odd, there is the fact that the alignment between the border and the cut implies that the observer had to be placed in a very specific position in order to be able to see it, it means that the stereoscopic projections are images that do not satisfies the generic-viewpoint assumption [3], that can cause interpretation problems.Finally, if the stereoscopic projections cross the bottom border, then they will suffer from the same problems as those that cross the top border, plus the fact that they will correspond to floating objects.
(3) Constraining the scale of the scene based on some physical reference.
This can be achieved by changing the cinematography technique.For example, 3D stereo movies adopt the classic film language used for 2D films.As a consequence, it employs different framing techniques, such as close-ups, and medium and long shots that cause the objects in a scene to change size relative to the screen.This practice impairs the sense of reality with the physical world.The problem is avoided by establishing a fixed scaled correspondence between the displayed scene and the real environment.
(4) Restricting the field of view to encompass the objects.
In standard 3D stereo movies, the fact that the cameras are positioned parallel to the ground implies in a wide range of depth, including elements far from the center of interest of the scene.Conversely, in stereoscopic images produced for display over a table, the camera will be oriented at an oblique angle in relation to the ground.This limits the maximum depth of the scene and favors the use of stereoscopic techniques.
Devices that use horizontal stereo have already appeared in the patent literature, such as presented in [4], [5] and [6], and also have been explored by the computer graphics community, as in [7] and [8].The method that we use to generate synthetic stereoscopic pairs is similar to the one used in these works, and corresponds to the problem of generating the image on the floor of a CAVE [9].This problem will be explained in Section 2. Our work differs in the method used for exhibiting stereoscopic pairs captured by cameras.For example, in [7] they reconstruct a 3D model of the object from its silhouette and then use it to render the stereoscopic pair.The problem with this technique is the low quality of the result.In our system, we solve the problem by using image processing, more precisely, we apply a homography, previously estimated by a computer vision process.The details about this process will be explained in Section 3.
The main contribution of this paper is to introduce a new architecture, composed of software and hardware that works as a "Virtual Table -Teleporter".The system displays a 3D stereoscopic scene over a horizontal viewing table.It can also capture the stereoscopic appearance of a set of objects distributed over a table and transmits the content in realtime to the stereoscopic viewing table.This kind of setting can be very useful in applications such as a teleconference, allowing a group of people to share virtual representations of objects positioned on a table.The technology presented can also be used to capture and display (in a scaled down fashion) a theater play, a sports match (such as tennis, basketball, etc.), or any other event that takes place in a horizontal field.

Rendering Horizontal Stereoscopic Images
There are two issues that should be taken into account when rendering stereoscopic pairs that will be presented as objects over horizontal displays: • The cameras do not have to point to the object to be captured.Instead, their view directions must be orthogonal to the planar surface that supports the object.
• The intrinsic parameters of each camera must be chosen in such a way that makes the projection plane coincident with the horizontal display, and makes the view frustum encompass the object, although the camera is not pointed to it.This camera setup makes the virtual object stand over the display for a user whose eyes are in the same position as the optical centers of the cameras.This is because it makes the rays emitted by the object and passing through each eye have the same color as the corresponding pixel in the horizontal display, thus each eye sees the same image whether it came from the real object or from the display.Figure 1(a) illustrates.
The fact that the camera is not pointed at the object can be non-intuitive.It happens because the view direction is always orthogonal to the projection plane that usually is close to the object direction.It is so common that OpenGL standard uses the "look at" expression as part of the name of a function used for defining the view direction as well as the other extrinsic parameters of a camera.As a consequence, if one intends to use the gluLookAt function for rendering virtual objects over horizontal displays, s/he must keep in mind that the function will miss the "look at" sense since the camera will not point to the object.
Besides the view frustum skewing due to the non-coincidence of the view and object directions, presented in Figure 1(a), there is another skewing in the orthogonal direction, and that is presented in the Figure 1(b).This skewing occurs as a consequence of using the same rectangular region over the projection plane as border for the image captured by both cameras.This skewing is not exclusive for horizontal stereo; it exists whenever stereoscopic images have to be superimposed on the same screen.

Building Horizontal Stereoscopic Images by Deformation
Section 2 presented how to define camera models for rendering stereoscopic pairs prepared for being exhibited horizontally.This section explains how to build the stereoscopic pair by using real world cameras.
It is possible to adapt the same approach presented in Section 2 to the case of using real world cameras.Figure 2 illustrates it.The cameras are pointed toward the ground, making their view directions orthogonal to the planar support.Then we would use a very large field of view in order to encompass the object to be captured.It would be necessary to

PROJECTIVE GEOMETRY BASICS
In order to solve the problem at hand we need some mathematical notions from Projective Geometry.We list these concepts here [10]: enlarge the frustum because ordinary real world cameras do not have skewing control.The result is an image that differs from the image to be presented over the display by a scale factor.
A problem with this approach is that most of pixels captured by the camera pair are far from the projection of the object so they would not be used.More precisely, just the portion containing the information to be exhibited by the horizontal display would be used.Another problem is that we would need a camera with a very large field of view.
A better approach consists on pointing the cameras toward the object, and deforming the captured images in order to make them the same as images that would be captured by cameras defined such as in Section 3. Figure 3 illustrates.We shall use projective geometry to show that this deformation is a homography, and then we shall explain how it can be calculated by a well-known computer vision process.This result implies that there is an homography relating the coordinates of projections, measured over the images captured by the cameras pointed to the object to be captured, and the coordinates of the projections, made by using the same optical center as center of projection and using the planar support as projection plane.This fact explains why the projections p1 and p2 , presented in Figure 3, are related by a homography.
From Theorem 1 it follows that a homography can be represented by a matrix that acts as a projective mapping onto itself, and from the fundamental theorem of projective geometry it follows that these mappings are completely defined by a set of four correspondences between elements in the domain and in the range.Thus if we establish t he correspondence between the coordinates of four known markers over the planar support and their respective coordinates over the images captured by the camera pair, then the homographies will be defined.Since the coordinates over the planar support are being measured in spatial units, such as centimeters, the homographies cannot be used for finding the deformed images directly, because they are measured in terms of pixels.This problem can be easily solved by rescaling the deformed image by the pixels per unit of length relation that represents how many pixels of the horizontal display are in each unit of length used for defining the markers' coordinates.

THE SOLUTION BY USING HOMOGRAPHIES
It is easy to see, by examining Figure 4, that if a set of points in a scene is projected by a camera over a set of collinear projections, then they remain collinear if we maintain the optical center in the same place and change the position of the projection plane.This happens because the rays whose intersection generate these projections must be coplanar, and if the optical center is unchanged they still have to be used for defining the projections over the plane in the new position.Since the rays are coplanar, the intersection of them with any plane must be collinear.
As a consequence of this scale ambiguity, it follows that whatever the distance between the cameras used for capturing the stereoscopic pair, there is always a position for the users head, as well as a scale for reproducing the images over the display, that allows the user to observe a correct version, in some scale of the reality of the object whose appearance is being captured.It happens because there is always a rescaled version of the scene that makes the distance between the cameras equal to the distance between the user eyes.

Homography Estimation
The homography estimation from correspondences is a very well known problem in computer vision, and is usually solved by using many more than four correspondences since the measures are in general corrupted by noise.Doing this we can improve the accuracy of the estimation.
The set of corresponding points can be defined with the help of a checkerboard whose square corners can easily be detected by image processing, allowing the correspondence process to be automatic.In this case we use a coordinate system over the checkerboard for defining the position of each corner.We describe here the solution presented in [10] for estimating homographies from correspondence of points.

THE DIRECT LINEAR TRANSFORMATION ALGORITHM 4.2 LEAST SQUARES SOLUTION
The optimization problem defined by equation ( 4) does not have a direct geometric interpretation.A better solution in finding the homography H that minimizes the This problem can be solved by using the Levenberg-Marquardt algorithm.Since it is an iterative algorithm, it demands an initial estimation for H near to the optimum solution.The homography calculated by the direct linear transform algorithm, explained in the previous section, can be used for this purpose.

System Architecture
We built prototypes for capturing and for presenting stereoscopic images.The ones that present images are the virtual tables, and when combined withcapture devices, they comprise what we call virtual teleporters.
The capture devices are plane surfaces to which we point a pair of stereo cameras.We use as the plane an ordinary table for small objects and the floor for large ones.
Before using capture devices, the homographies related to each camera are estimated by software that establishes correspondences between the square corners of a checkerboard and their respective projections over the images captured by the pair of cameras.It is done as described in Section 4.
While the system runs, different software is used for applying the homographies previously estimated and for rescaling the images accordingly to the distance between the cameras as described in Section 5.The result is a pair of images that are prepared to be shown horizontally.
We built two capture devices.The first one, shown in Figure 5(a), is used for capturing small objects.It is composed of a pair of small cameras whose distance between For instance, considering that the distance between the eyes is about 6.5 cm, if the distance between the two optical centers of the cameras used for capturing the stereoscopic pair is 65 cm, and the distance of them to the captured object is four times that: 260 cm, then the user must observe the display by the distance of four times 6.5 cm, or 26 cm, and the object displayed will be one-tenth the size of the real one.them is the same as that between the eyes of a human being.These cameras are pointed to an ordinary table.The second one, presented in 5(b), is used for capturing large objects and people.It is composed by two HD cameras fixed to a structure placed near the ceiling.Since the distance between the cameras in the second device is greater than in the first one, it makes the result a stereoscopic pair that corresponds to a reduced size version of the captured objects.The virtual table is the device that shows the stereoscopic image pair.It has a stereoscopic screen positioned horizontally that is connected to the computer by a N VIDIA Quadro card.That card allows us to use quad-buffering in OpenGL.

System Architecture
We built three virtual tables, the first one, presented in Figure 6(a), consists of a CRT monitor horizontally positioned over an iron and wood structure, and stereoscopic glasses.
The second one, presented in Figure 6(b), consists on an LCD monitor that supports 120Hz refresh rates positioned horizontally.That refresh rate allows it to display high quality stereoscopic images for shutter glasses.
The third one, presented in Figure 7, has not been designed to be part of a virtual teleporter since it has lots of peripherical equipment that is useful only in the case of presenting rendered content.It is currently installed at the VISGRAF Laboratory , and is composed of the following: • A stereoscopic projector, fixed to the ceiling of the laboratory, and its respective 3D glasses.• A table that receives the stereoscopic projection.• A camera, also fixed to the ceiling, that is used for capturing fiducials in interactions performed by the AR Toolkit.• A Wii video game controller, that can be used as a head tracking system, by tracking an infrared LED on a cap worn by the user, and also as a controller, depending on the application.
• A wireless mouse and keyboard, used for conventional interaction.

Results
A virtual teleporter prototype has been built.Figure 8 illustrates users interacting with a 3D scene using both the capture table for acquisition of real object's images and the stereo viewing table for showing them in real time.
Figure 9(a) shows the image pair captured by each camera, and Figure 9(b) shows the respective deformed version.
It can be noticed that, even when the floor is aligned to the screen, the limitation of the cameras frustum can generate points over the plane surface that are presented in one image that do not appear in the other one, Figure 10 illustrates.It means that they should not be displayed.Otherwise they may cause an interpretation problem, compromising the sense of reality.It must be clear that there are many cases that points do not have both stereoscopic projections visible due to occlusion.but it is not a problem because the human visual sense is used to deal with it.
One way to deal with the limitation of the camera frustum consists in excluding the portion of the support of one image that does not appear in the other.It serves to reduce the supports of both images to the intersection of them.A better solution is to choose the distance between the cameras in such a way that makes the region exhibited by the display be inside the frustum of both cameras.It means the quadrilateral border of both stereoscopic images does not appear, as shown in Figure 11.
We also tested the system for the capturing and exhibition of sport events.In this case, the homography estimation can usually be made without using a planar pattern since the markers on the floor used for the evaluation of rules can be used instead.We preformed a test by capturing the appearance of an athlete positioned on a volleyball court (Figure 12).The court markers were placed relative to the respective projections (Figure 13).The deformed images were then projected over the horizontal display (Figures 14 and 15).We used the OpenCV library for doing all the image processing.A consequence of this is that the algorithm has real-time performance on the CPU in the main memory.We tested it on an Intel Core i7 computer, where we achieved a very good interactive response for images captured by ordinary NTSC cameras.It is possible that for FULL HD images a GPU implementation may become necessary, but we have not evaluated this yet.All the HD tests were offline.
Besides the development of the virtual teleporter architecture that uses image processing for generating the stereoscopic pair, we developed applications that generates them synthetically by the use of computer graphics.
We adapted ordinary 3D applications for running over the stereo table.
More specifically, we adapted open source games, such as Warzone2100 and Cannon Smash.The choice of these games was not arbitrary.The game Warzone2100 is a 3D real-time strategy game that presents various combats over hills.When this scenario is presented in a stereoscopic horizontal way, the user has the idea that the combats are taking place over a miniature set, which is more natural than the sensation given by the original version (Figure 16).
Although the modified version is interesting because of the mountain reliefs, it presents two inconveniences: • There are problems of absence of stereo correspondence at the edges of the image because the mountain relief is not in the same level as the display at the edges.
• The game needs the set to scroll because the scene is much bigger than the area exhibited within the field of view.The scroll of 3D objects over the screen does not correspond to any natural process in the real world.is showing an image whose scale was chosen in such a way that the quadrilateral border generated after applying the homography is outside the display.

Conclusion and Future Works
We presented two different processes to generate horizontal stereoscopic images.We used both to build some prototypes of which emphasize the application that we called the virtual teleporter because it can transmit the appearance of an object over one surface to another one, thereby displaying a realistic virtual version of it in real time.
In Section 2 we began presenting the process based on synthetic images generated by computer graphics.There we used head tracking to deform the image in order to allow the user to move his/her head, and we experimented with many kinds of interactive mechanisms.These experiments reproduced most of the visual results achieved by the horizontal stereoscopic devices described in the literature.
After that, we proposed a new architecture based on captured images deformed by homographies.This approach, compared to other systems based on synthetic images, presents advantages and disadvantages.The main advantage is  In order to test the stereoscopic effect without these problems we selected the game Cannon Smash (Figure 17) that represents a table tennis game, to be adapted.The above problems are eliminated because the tennis table can be kept static, without scrolling, and the floor can be adjusted to match the screen level.
Additionally, we have developed our own interactive applications t hat can generate synthetic objects and present them over the stereo viewing table.We developed solutions in C using OpenGL, and in Python using the Panda 3D library.Examples of those applications are shown in the Figure 18.
We have also modeled a scene in the Autodesk Maya software, adjusting the intrinsic and extrinsic parameters as explained in Section 2. There we rendered the animation using global illumination and we displayed the videos over the stereo table as shown in Figure 19.
The people who tested the device reported that they were impressed, although most of them had previous experience with stereoscopic exhibitions such as 3D movies.They also reported that they became bothered and had a reduction in the sense of reality whenever elements of the scene crossed the border of the display, for example, in situations like the one in Figure 20.We emphasize that this example corresponds to a scene whose problem was mentioned in Section 1.The scene in Figure 20 does not present any problem in the stereoscopic correspondence, since the clipping on the top corresponds to a 3D cut in the model, but its perfect alignment to the top of the screen makes the scene appear odd.This example supports the idea that we should be careful while setting the view frustum in order to avoid all kinds of crossing that could reduce the sense of reality    that we easily achieved a very high visual quality because the result is generated by image processing.The main disadvantage is the fact that head tracking is useless in this architecture since the absence of a 3D model makes the adjustment of the stereoscopic pair to the user's head movement impossible.An important issue about this architecture is that the computer vision approach used to estimate the homographies gave us some freedom to set the pose of the cameras used to capture the nondeformed stereoscopic pair.This is important because it simplifies the task of fixing the cameras since we do not need to be concerned about the careful adjusting their rotation.When we compare both architectures, we also have to consider that the applications o f them are different.The approach based on image processing is very useful when we intend to reproduce real world scenes, such as in the case of the virtual teleporter, and the approach based on computer graphics is useful if interaction is required.Because we could not use head tracking in the case of captured images, the users noted that the objects were deformed if they changed their head position.They also reported that the worst case occurred when they moved their head laterally.That generated a weird skewing of the displayed objects.In the future we are considering constraining the user's head position by fixing the stereoscopic glasses to the display in the correct position (Figure 21).
We have used OpenCV to apply the homography, which makes the process heavy since it is being done by software.We intend to implement all the processing with Cuda and OpenGL using t exture map resources.It will move the problem to the GPU for increased performance.
We intend to adapt the process presented here to the case of theater plays and for capturing large sports events.It will demand the development of techniques to fix cameras in positions compatible with the user's eyes.The problem is that the large ratio scale between the real and the virtual scene demands, as explained in Section 5, the use of cameras that are fixed in a very high position, one very close to the other.
A conceptual solution for the case of capturing soccer matches is presented in Figure 22, where we propose using balloons anchored to the ground in order to suspend the cameras.That is useful when installing poles for mounting cameras on the stage is not feasible.Since the wind can move the balloons, it should be necessary to use techniques to stabilize the capturing process.One possibility is estimating the homographies related to both cameras all the time.That will perfectly stabilize the 3 degrees of freedom that correspond to rotation, and it will stabilize the location of the contact point of each object presented in the scene to the ground.
We also intend to develop a bi-directional version of the system.The idea is to combine the capture device and the horizontal display into a single table, setting up a configuration where the display supports the objects to be transmitted.A collaborative environment can be created by connecting two tables since the real objects over one table are presented as virtual objects over the other, allowing the users to see all the objects together (Figure 23).It will be necessary to block the images presented by the display while capturing the objects over the table, which can be done by using a display that emits polarized light and placing polarizing filters in front of the cameras.
There are many real world situations that can benefit from this configuration.An example is the use by designers, architects, and engineers as a complement to an ordinary teleconference system, allowing the participants to share ideas about 3D prototypes.Another interesting application consists in using the device as a platform for playing ordinary board games.In this case one of the players must position the board over the table and must capture the image printed on it.After that the image is projected over both tables.Then both players use their own pieces to play.Since the image presented by the table is not captured by the corresponding camera, it will not appear in the image transmitted to the other table.The constraint on this system is that it can only be used in the case where the pieces of one player are never occluded by the virtual piece of an opponent.Otherwise the portion of the screen that exhibits the virtual piece will be covered by a real piece, resulting in an weird scene.

Figure 1 .
Figure 1.(a) shows the lateral view of the frustum used for rendering an image to be presented horizontally.(b) shows the frontal view of the setup.

Figure 2 .
Figure 2. The projection p2 is a scaled version of the projection p1.Notice that the camera must have a very large field of view, and most of pixels in the image plane are not used.

Figure 4 .
Figure 4.This example shows a curve whose projection over a projection plane is linear.As a consequence, it is also linear if we change the projection plane and keep the optical center unchanged.

Figure 5 .
Figure 5. (a) Capture device used for small objects.(b) Capture device used for big objects

Figure 6 .
Figure 6.Virtual Tables.(a) the CRT version and (b) the LCD version.

Figure 7 .
Figure 7.In (a) it can be seen the camera and the stereoscopic projector fixed to the ceiling; and in (b) the table, 3D glasses, keyboard, mouse and the Wii control.

Figure 11 .
Figure 11.The Virtual Tableisshowing an image whose scale was chosen in such a way that the quadrilateral border generated after applying the homography is outside the display.

Figure 12 .
Figure 12.Stereoscopic pair of an athlete on a volleyball court.

Figure 13 .
Figure 13.Correspondence between the volleyball court markers and their respective projection.

Figure 15 .
Figure 15.Deformed image projected over the horizontal dis-

Figure 17 .
Figure 17.In (a) original version of the game Cannon Smash.In (b) the modified version of the game being exhibited over the Stereo Table.

Figure 19 .
Figure 19.3D Animation of a chorus line.
40 THE INTERNATIONAL JOURNAL OF VIRTUAL REALITY

Figure 20 .
Figure 20.The display presents an image where the top of the head of a person is clipped by the border of the screen.

Figure 21 .
Figure 21.Concept of a device that has the stereoscopic glasses fixed to the display in order to avoid distortions due to the user's head motion.

Figure 22 .
Figure 22.In (a) balloons are being used for fixing the cameras that capture a soccer game that is exhibited in (b).

Figure 23 .
Figure 23.A collaborative environment.In (a) the ball is real and the cube is virtual, in (b) the ball is virtual and the cube is real.