Estimation of Distances in Virtual Environments Using Size Constancy

— It is reported in the literature that distances from the observer are underestimated more in virtual environments (VEs) than in physical world conditions. On the other hand estimation of size in VEs is quite accurate and follows a size-constancy law when rich cues are present. This study investigates how estimation of distance in a CAVE TM environment is affected by poor and rich cue conditions, subject experience, and environmental learning when the position of the objects is estimated using an experimental paradigm that exploits size constancy. A group of 18 healthy participants was asked to move a virtual sphere controlled using the wand joystick to the position where they thought a previously-displayed virtual cube (stimulus) had appeared. Real-size physical models of the virtual objects were also presented to the participants as a reference of real physical distance during the trials. An accurate estimation of distance implied that the participants assessed the relative size of sphere and cube correctly. The cube appeared at depths between 0.6 m and 3 m, measured along the depth direction of the CAVE. The task was carried out in two environments: a poor cue one with limited background cues, and a rich cue one with textured background surfaces. It was found that distances were underestimated in both poor and rich cue conditions, with greater underestimation in the poor cue environment. The analysis also indicated that factors such as subject experience and environmental learning were not influential. However, least square fitting of Stevens’ power law indicated a high degree of accuracy during the estimation of object locations. This accuracy was higher than in other studies which were not based on a size-estimation paradigm. Thus as indirect result, this study appears to show that accuracy when estimating egocentric distances may be increased using an experimental method that provides information on the relative size of the objects used.


1.1
Introduction Perception of space is an important index of the ability to interact with virtual environments (VEs). Potential causes of the different perception of distances between real-world and VEs include the realism/quality of the graphics [1], binocular vs. stereoscopic vision [2], field of view (FOV) [3,4], differences in luminance across surfaces, contrast consistency [5], object and environmental texture [6,7], alterations of convergence and accommodation [8,9], and pictorial cues available to the subject such as perspective cues [2,10,11]. The effect of these last attributes on size-constancy and distance estimation has been investigated in more detail because of the key role they play in the interaction between subject and environment.
Distance perception is also influenced by the kind of virtual environment used and the relation between virtual object and observer, i.e. if the perceived distance is egocentric (relative to the observer) or exocentric (between objects). Egocentric distances have been found to be more underestimated in VEs than in the physical world. The majority of the works in this sense have been conducted on non-immersive or semi-immersive environments. There is a lack of studies on the effect of perspective cues in completely immersive environments along the depth and height directions of space for short distances from the observer. These distances are involved when close interactions between participants and avatars, or virtual-objects, take place. In the current study, estimation of an object location was investigated in a completely immersive CAVE TM environment for objects appearing in front of the observer. The effect of pictorial cues, environmental learning and subject experience was evaluated. The high accuracy found in previous studies while subjects performed size-estimation in VEs [12], led us to investigate if an approach where participants were required to estimate the relative size of the virtual objects could be used to increase accuracy during egocentric distance estimation.

Related Work
In the physical world egocentric and exocentric distance perception has been widely investigated. Perceived distance can be expressed as a function of the n th power of the actual distance [13], the value of n depending on the nature of the estimation (egocentric or exocentric), the magnitude of the distance to be estimated, and the methodology used [14]. In general, egocentric distance has been found to be underestimated compared to the actual distance [15]. However different protocols may lead to different values of n, as well as lower or higher accuracy. In their extensive review [14] Wiest and Bell reported average values of n equal to 1.08 from direct viewing, 0.91 from memory and 0.75 from inference methods during estimation of environmental distances. Specific studies found values of n altogether different from these averages, for example during estimation of distances between 6 m to 23 km [16] values of n were found equal to 0.8 using direct perception and to 0.59 when estimating distance from memory on the day after the date of observation. When accuracy during egocentric distance estimation was measured [17], it appeared to improve when pointing towards a previously seen target while blind walking along a direction oblique to it. The relation between perceived and actual distance was nearly linear up to 15 m, while exponents n = 0.66 were found beyond 15 m. The same study also found high accuracy when subjects were asked to blindly face a previously seen target after blind walking along a direction oblique to the target, exponents n higher than 0.9 were found. An almost linear relation between target distance and walked distance was also found when blind walking to previously seen targets [18].

Estimation of Distances in Virtual Environments Using Size Constancy
Outside of real-world situations, exocentric distances have been found to be overestimated in desktop displays [4] with larger errors encountered when a background grid that appeared behind the stimulus was removed. In contrast, egocentric distance underestimation was found while judging target distances on videotaped outdoor scenes [19]. Subjects judging distances in 3D video-recorded and virtual reality spaces achieved exponents n between 0.53 and 0.8 when indoor and outdoor situations were projected, compared to powers ranging between 0.95 and 1.01 in the physical world [20]. However when subjects were asked to imagine walking to a target in real world and recreated screen-projected virtual reality settings, imagined time to walk appeared not to be different between real and virtual settings for distances between 6 m to 36 m [21].
In more immersive virtual reality devices such as head mounted displays (HMDs), egocentric distance underestimation was also found [22]. Distance covered by blind walking to a target seen through HMDs was found to be 77% of the actual distance, while if the target was seen in the physical world through a restricted field of view and unrestricted monocular view, underestimation was 96% [23]. Distance compression by about 30% was also found during blind throwing while wearing HMDs [24]. When subjects were asked to judge the distance of cylinders placed in front of them at different depths, distance estimates in VEs based on verbal report were 47% of actual distances, compared to 72% in the real world. Traversing a distance rather than simply estimating it verbally was accompanied by increased performance [7]. In contrast, Interrante and Anderson [25], while measuring blind walking in physical and VEs that replicated the real world ones occupied by the participant, did not find any statistically significant differences between physical and virtual conditions. Finally, in completely immersive environments such as the CAVE several studies have attempted to quantify egocentric distance estimation. Ryu et al. [26] described distance estimation for subjects moving between 2 m and 20 m in both physical and VEs (using joystick simulation). Ratios between estimated and physical distance ranged in this study between 205% and 131% in the virtual world, and between 160% and 108% in the real world. In this case, a larger virtual distance travelled indicated an underestimation by the participant. Underestimation in the CAVE was also found by Klein et al. [27] using blind walking, estimated walking time, and verbal reporting.
However, although depth has been found to be generally underestimated in VEs this appears to be influenced by the cues available, which range from graphics resolution, to stereoscopic vision, to FOV, as outlined at the beginning of section 1.1. Witmer and Kline, displaying a virtual scene on a BOOM2C device, found that different floor textures did not affect depth estimation [7]. Image quality also had no effect on egocentric distance perception of targets using blind walking and triangulation [1]. Performance during exocentric distance estimation also increased if stereo or additional cues such as shadows and interreflections were added in HMDs [10]. Creem-Regehr et al. [3] suggested that FOV had limited effect on distance underestimation in HMDs, although the restricted viewing conditions used were only tested in real-world settings. Similar conclusions on FOV, again from real-world situations, were also drawn by Knapp et al. [28]. More research is needed to clarify whether restricting FOV has any influence when these tests are conducted in VEs rather than real-world conditions. The convergenceaccommodation conflict has also been pointed out as one of the factors that influence distance perception in VEs [8]. For a virtual image perceived to be before the front screen the focal point is in fact further away than the vergence point, while the opposite happens for an object perceived to be beyond the front screen since the eyes will focus on the screen and yet converge on the object. In addition, retinal blur is not consistent with the scene depth but with the distance of the projection surface. Hence, if information deriving from accommodation stimulus affects depth estimation, the depth of an object in front of the screen would be overestimated while beyond the screen it would be underestimated [9].
Contrary to egocentric distance estimation, size estimation in VEs appears to be quite accurate. Correct judgment of size is established independently from the shrinkage or enlargement that occurs on the retina when an object moves closer or farther away from the subject [15,29]. This ability decreases as more cues are removed and size judgment switches from a size-constancy law to a judgment based on retinal dimensions, also known as the visual-angle law [29]. In CAVE-like environments Kenyon et al. [12] demonstrated that size-constancy is preserved provided that sufficient environmental cues are given to estimate depth. If these cues are removed, size is estimated based on the visual angle and size-constancy is not maintained. Kenyon et al. used a physical object (a bottle of liquid) to provide a reference of absolute distance and asked the participants to scale up or down a virtual bottle so that its size matched that of the real bottle at a specific distance. The study showed that the ratio between perceived bottle size and real bottle size oscillated around unity with environment present, while it increased from about 0.8 to about 1.4 with a poor cue environment, for depth increasing from 0.6 m to 2.4 m approximately.
The objective of the current study was to investigate the effect of selected pictorial cues, environmental learning and subject experience on the accuracy of distance estimation when a paradigm that exploited size-constancy was used to evaluate distances. An experimental method was used where the real-world size of the virtual objects, and thus their size relatively to each other, were presented to the participants as an aid to estimate distances. Participants were asked to position a virtual object where a previously-observed virtual stimulus had appeared. Thus a second indirect objective was to observe if near-linear space compression in a CAVE-like environment could be achieved using an experimental paradigm where accurate estimation of object location depended on the correct evaluation of the size-ratio between virtual and real objects. The analysis was extended to the three directions of space to evaluate any selected under-or overestimation along a specific direction. Moreover the study covered the regions before and beyond the front projection screen of the CAVE, where cues related to accommodation and convergence play a part in this estimation, although here they were not analysed separately from other cues.

Participants
A group of 18 volunteers was recruited having CAVE experience ranging from none to expert, aged between 22 to 35 years (AV: 26.4 ys, SD: 4.2 ys), with normal or corrected-to-normal vision, no medical conditions, and not currently taking medical prescription drugs. At the beginning of each trial height, age and information on whether eyesight was or not corrected were taken. Participants were also asked to rate their level of experience with the virtual environment using a number between 0 and 5: 0 being null and 5 being experienced. Participants were told they could withdraw from the experiment at any time without giving any reason. The investigators then gave a brief introduction to immersive virtual reality systems and explained the procedure to follow. The authors/investigators did not take part in the experiments. Ethical approval was obtained for the study.
Participants were asked to stand in the centre of the CAVE, which was also the origin of the coordinate system, and were permitted to rotate their head and torso and to flex their legs, but not to walk. Parallax cues were still available to the participants through the movements of head and torso.

Environment and Stimulus
The CAVE-like environment was a room 3 m deep x 3 m long x 2.2 m high, with front, left and right screen as well as floor surfaces used as projection surfaces [30]. The screens were rendered by a SGI Prism running ATI FireGL graphic cards, with 16 GB of RAM and four graphics pipes. An Intersense IS-900 provided head as well as wand tracking.
Screen resolution was 1024x768 pixels, and stereo image refresh rate was 45 Hz. The interpupillary distance for each participant was measured at the beginning of the trials and the CAVE configuration was changed accordingly.
Two conditions were tested: poor cue and rich cue, each including three steps. These conditions differed in the type of background cues used, but were otherwise similar as both included placing a yellow sphere where a previously seen stimulus (blue cube) had appeared. In the poor cue condition a green sphere was also loaded in the centre of the front screen as the only available background, and set to be 1.1 m high and twice as distant from the subject than the real front screen, i.e. 3 m from the origin. No other background was displayed. In the rich cue condition, a front wall, 2 side walls and a floor were displayed with the same texture: pale green lines crossing at 90° over a black background. The side walls and floor coincided with the screens, while the front wall was placed 3 m from the origin. Fig. 1 shows these settings.
In both conditions participants were asked to study a real cube 0.16 m sides that matched the size of the virtual cube, or stimulus, later displayed, and a real yellow sphere, 0.12 m diameter, that matched the size of the virtual yellow sphere and that moved with the wand (see procedure). In addition, for the poor cue condition subjects were asked to study a real green sphere, 0.236 m diameter, which matched the size of the virtual green one displayed in the background. All these real objects were subsequently left in the CAVE for the duration of the trials so as to be always visible to the participant. This was done to provide an absolute reference of physical distance not otherwise available.

Procedure
During the experiments a cube was loaded in a random position in front of the participant. The object's position appeared within boundaries defined by a frontal plane parallel to the subject's frontal plane and distant 0.6 m from the subject, the virtual front wall (distant 3 m from the subject), and the two side screens of the CAVE, as shown in Fig. 1-Bottom.
Participants were allowed to study the cube and were instructed to press button 2 on the wand when ready to move to the next step. The wand is an instrumented device capable of mapping its positions relatively to the CAVE space, and it is equipped with buttons and a thumb-controlled joystick. Two different arrays containing randomly pre-generated positions of the stimulus were used for poor and rich cue conditions in the trials. Figs. 2 and 3 show where the cube appeared in the virtual space. After pressing button 2 the cube disappeared while a yellow sphere appeared in front of the wand. The background walls (or the background sphere in the poor cue condition) were kept on. The yellow sphere moved with the wand and was also programmed to slide towards or away from the wand along the front-back axis of the wand, by acting on the backward/forward movement of the joystick. Each participant was asked to position the yellow sphere where he/she thought the cube had appeared, and to press button 1 on the wand when confident he/she had done so. After pressing button 1, the positions of the yellow sphere and the cube, taken relatively to the CAVE's origin, were saved to file, the yellow sphere disappeared and the cube was loaded again in a new position. A new trial was then repeated from the beginning.
Subjects were allowed to practice for no more than 8 times at the beginning of the experiment to check that they To eliminate the effect of environment learning the sequence of exposure (SoE) was assigned randomly to each participant with 2 sequences being used: poor followed by rich cue and vice-versa. If, for example, the sequence poor-rich was assigned, the participant performed all the 40 trials in the poor cue environment and subsequently completed another 40 trials in the rich cue environment, while the order was inverted for the rich-poor sequence.

III. ANALYSIS OF RESULTS
Stevens' law was used to express the relation between actual target distance and perceived distance [4,13]: where d is the actual distance and  the perceived distance, k and n being constants. The coefficients k and n were calculated to minimize the above expression in a least-squares sense, given the experimental data of actual and perceived distance. The Euclidean distances of sphere and cube from the origin were used in this study. In both poor and rich cue conditions Steven's law indicated a tendency to underestimate distances, albeit slightly:   = 1.  Fig. 5. The ANOVA showed that absolute normalized errors were significantly different between poor and rich conditions (p < .001), while SoE, i.e. poor followed by rich or vice-versa, had no effect (p = 0.261), similarly age (p = 0.166), height (p = 0.547) and level of experience (p = 0.067) were not statistically influential. Individual t-tests on the NE error distributions revealed that the stimulus position in the poor cue condition was not underestimated when the stimulus appeared before the front projection screen (t 17 = -0.06; p = 0.476), however it was when the stimulus appeared beyond it (t 17 = -3.77; p < .001). In the rich cue condition, underestimation was present when the stimulus appeared both before (t 17 = -3.52; p < .01) and beyond the projection screen (t 17 = -2.25; p = 0.02).
To evaluate estimation along the depth and height directions, the depth ratio was defined as the ratio between the depth location of the sphere and that of the cube. Similarly the height ratio was defined as the ratio between the height location of sphere and cube. Depth ratios lower than unity indicated that the stimulus location was underestimated, this condition occurred when the sphere was placed before the cube location. Depth ratios less than unity were significant in the poor cue condition (t 17 = -1.82; p = 0.04) but not in the rich cue one (t 17 = -1.66; p = 0.06). Similar to the normalized error, a depth ratio was not significantly less than unity in the poor cue condition when the stimulus appeared in front of the projection front screen (t 17 = 0.32; p = 0.623) while it was when the stimulus appeared beyond the screen (t 17 = -3.48; p < .01). In the rich cue environment underestimation was not significant for a stimulus appearing both in front (t 17 = -1.65; p = 0.06) and beyond the front screen (t 17 = -1.58; p = 0.07).
Overestimation of height was significant in the poor cue condition (t 17 = 2.50; p = 0.01), while underestimation was in the rich cue one (t 17 = -3.35; p < .01). Height ratios less than unity indicated that the sphere's height location was lower than the cube's one. Looking specifically at the height ratio when the stimulus appeared before and beyond the front screen, the ratio in the poor cue condition could not be found to differ from unity before (t 17 = -0.65; p = 0.261) but it was greater than unity beyond the front screen (t 17 = 4.55; p < .001). On the contrary, in the rich cue condition underestimation occurred when the stimulus appeared both before (t 17 = -2.45; p = 0.01), and beyond (t 17 = -3.82; p < .001) the front screen.

IV. DISCUSSION
Statistical analysis and Stevens' law indicated that perceived distance was underestimated on average in both poor cue and rich cue environments. However it was found that accuracy was higher than that pointed by studies on perceived egocentric distances in both the physical [14] and the virtual world [7,22,26], with the exception of few cases where walking to a previously seen target was used as a way to estimate perceived distance [18,25]. Distances in the rich cue condition were perceived closer to linearity (n = 0.975) than those in the poor cue condition (n = 0.830). Distance underestimation was also limited because of high values of the k coefficients. A direct comparison of the n coefficients between this study and the studies cited above can only be limited because of the different experimental methods used. In this study, as a result of the experimental paradigm used, correct estimation of distance was a likely consequence of using a size-constancy law: first when the distance of the virtual cube was evaluated and later when the virtual yellow sphere was placed where the cube had appeared. During both processes the participant needed to use information on the physical size of sphere and cube to perform measurements in the virtual space. Research in the physical world has also shown that familiar size can affect the perception of distance to an extent where the perceived size determines the perceived distance [15,31]. In the current study, further research is needed to discriminate the effect of presenting real physical objects on the estimation of depth, and to evaluate if changing the size of sphere and cube relatively to each other, for example by making the cube several times bigger than the sphere, affects the results. A correct estimate of the size-ratio in our experiments would result in a NE close to 0, which is what we found in the rich cue environment (Fig. 5). These results are supported by the findings of Kenyon et al. [12] who found a ratio between the sizes of virtual and real object close to unity, and thus the presence of a size-constancy law in rich cue environments. However, from the analysis at different depths we also found that the NE was close to 0 in the poor cue condition when the stimulus appeared in front of the projection screen, while NE was significantly less than 0 when the stimulus appeared beyond it. This last result may have been induced by the combined effect of lacking perspective cues and conflicting focus cues, although further work is needed to discriminate exactly the role played by each.
In this study participants were put in control of the encoding time by deciding when the cube disappeared. Participants controlled this event by pressing a button on the wand. In this way we guaranteed that the length of time during which the stimulus was present, and thus the encoding time, could not be perceived as too short by some subjects. Concerns may be raised on the rapid deterioration of the stimulus location in memory during the second phase, i.e. when the yellow sphere was placed in its final position. The authors are not aware of any research that investigates this decaying effect using the experimental paradigm proposed in this study, however research has been carried out on the influence of this effect on blind walking to a previously seen target. Steenhuis and Goodale [32] found no evidence that short-term memory deteriorates before 30 s have elapsed from target disappearance. In that study subjects were instructed to walk to the target without visual feedback. In the current study the length of time that elapsed before participants placed the yellow sphere where they thought the stimulus had appeared was not longer than 30 s.
Results also showed how distance along height was overestimated in the poor cue condition, only when the stimulus appeared beyond the front screen, whereas in the rich cue condition it was underestimated. Further investigation is needed to prove if the absence of a ceiling in the CAVE influenced the estimation of height when few perspective cues were given, so that the stimulus was perceived to be located at greater height than it actually was. Sequence of exposure (SoE) and level of experience also did not play a part in the magnitude of the absolute error, although the latter was very close to the rejection threshold. These results indicate that accuracy in the second cue environment did not seem to improve after exposure to the first environment, be this a rich cue or a poor cue one. It also follows that participants failed to make use of previous experience and that perspective cues appeared to have a stronger influence than level of practice with the environment, for the number of trials used in this study.
To gain insight into the accommodation-convergence discrepancy effect, the analysis was extended to results before (up to 1.5 m) and beyond the front screen (between 1.5 m and 3 m from the origin). The analysis revealed different averages before and beyond the projection screen boundary in poor cue only. This is consistent with the fact that the influence of focus cues is felt more if depth cues are less present in the scene [9]. Further research would be needed to discriminate the combined effect of conflicting focus cues and poor cues as happened in this study, and specifically to disambiguate the effect of accommodation and retinal blur for CAVE displays.

V. CONCLUSION
Underestimating distance in virtual space leads to possible artifacts if the environment is used as a training platform for real life situations. This study showed that when estimating a stimulus location at distances up to 3 m from the observer in a CAVE-like environment, perspective cues are more effective than any level of expertise or previous practice with the environment. It was also shown that participants showed a high degree of accuracy while estimating object location along depth in the presence of perspective cues, while in the absence of perspective cues estimation was accurate for distances up to the front projection screen. The accuracy of perceived distances appeared to be higher than in other studies where the participants were not provided with information on the relative size of the virtual objects used. However further research is needed to evaluate the contribution given by the presence of real physical objects and their relative size as an aid to evaluate distance, and the role played by conflicting focus cues when perspective cues are absent.