ARE: AN AUDIO REALITY ENGINE FOR VIRTUAL ENVIRONMENTS

A general-purpose virtual audio system called the Audio Reality Engine (ARE) is proposed in 
this paper. Based on the investigation of the addition of spatial-localized sound to virtual 
environments, we discuss in detail the design and implementation techniques, including HRTF 
model based on neural network, the simulation of moving sound sources, the simulation of Doppler 
effects, and room acoustic simulation based on dynamic mechanisms. Experimental results illustrate 
the efficiency of our system.


Introduction
The abundance of research on Virtual Reality (VR) in the 90s boosted the investigation in many relevant research areas.As we know, visual and auditory channels are regarded as the two most significant sensory modalities for human being to sense the external world.In order to build up a virtual environment that satisfies the three 'I's, namely interaction, immersion, and imagination (Burdea,93), we need to provide multi-modal interaction channels and enhance the realistic effect of the human-computer interaction.Computer graphics and image processing technology can support visual interaction in VE.The auditory sensory channel could use the support of 3-D audio rendering technology (He, 98; Gardner, 97; Zhang, 98a), which integrates computer technology and digital signal processing methods into acoustical simulation.3D audio rendering is the process of calculating the sound at the listener's location according to the characteristics of sound sources, the environment, and the listener(s).The objective is for listeners to experience audio as in their daily lives.
Depending on the means of playback, there are three primary kinds of 3D audio rendering techniques.The first method, called binaural playback, employs high-fidelity headphones as the final replay facility (Begault,94), which is characterized by low-cost and convenience.The second ,called transaural playback (Gardner,97), is the variation of the first, replacing the headphones with two loudspeakers.Although this method may overcome some intrinsic shortcomings of the binaural one, such as in-head localization and front-back confusion, it introduces crosstalk between left-right channel.Thus an additional elimination procedure is required.Moreover, this method has rather strict requirements on the playback environment and listening position, which limit its application in interactive and multi-user modes.The third, called the surround array system, has three or more loudspeakers (Tabata,93), and is reasonably expensive.
Considering the significance of interaction in virtual environments, we adopt the binaural method as the playback means.Two principles are observed in the design of ARE: (1) Reasonable compromise between spatial reality and temporal reality: 3D audio in VE contains both (Zhang,96).Spatial reality denotes the simulation of the whole signal transmission process from source to listener's eardrum as detailed as possible.However, temporal reality requires the reduction of computational cost involved in the simulation process.Clearly we need to provide for some tradeoff between those two requirements.
(2) Multiple algorithms or models with different accuracy or computational complexity to fulfill a variety of application situations: For example, owing to the expensive computational cost in a room acoustic simulation, it is worthwhile to provide several models with different complexity and accuracy.The definition of the environment includes the environment type, i.e. whether it is free-field or room-field.For room-field, the relevant geometrical and acoustic parameters should also be specified.
Step 2: Calculation of early reflections, computing the early reflections in a room on the basis of geometrical acoustics.
Step 3: Evaluation of late reflections (or the whole reflection procedure).Some primary results from statistical acoustics are to model late reverberation (or the whole reflection procedure).
Step 4: Calculation of BRIR (binaural room impulse response) according to the incident direction of direct and early-reflected sound.
Step 5: Generation of y 1 (n) as a result of the convolution of BRIR and input signal x(n).
Step 6: Generation of y 2 (n) after the input signal multiplied with a specified gain is passed through the reverberator.
Step 7: Addition of y 1 (n) and y 2 (n) to complete the 3D audio rendering.
It should be noted that the above flowchart doesn't cover the simulation scheme of moving sources and the realization of a distance cue.As for sources in the free-field, steps 2, 3, 4 and 6 can be bypassed and BRIR at step 5 replaced by corresponding HRTF data.It is also possible to accelerate the computation by omitting steps 2 and 4.

The Simulation of Moving Sound Source
The simulation of moving sound sources is one of the most important features of ARE.Imagining a listener situated in an nonsimulated sound field.The corresponding HRTF will then be altered at any time due to the constant change of positional relationship between the moving sound source and the listener.However, in a computer simulation it is not practical to pursue a very high position update rate.The required position update rate is dependent on relative speed between the source and the listener.Experiments show that 25Hz position update rate is adequate for a rotating sound source at a speed up to 360 degrees per second.
Since only limited HRTF data can be acquired by measurements at a few locations, one problem is how to generate the intermediate HRTF data based on the existing data in realtime.Another problem is the smoothing of adjacent sections convolved with varied HRTF impulses.Furthermore, we need to take into account the Doppler effect for rapid moving sources, such as dive-bombing plane or galloping train etc.

THE GENERATION OF INTERMEDIATE HRTF DATA
The HRTF can be represented as follows: A multilayer feedward (Blum,91) network is employed in ARE as a universal HRTF approximator.A particular advantage of this kind of neural network is that its execution speed is very fast.
Figure 2 shows the input-output relationship of the neural network model.Two input nodes are used to encode azimuth in order to overcome discontinuity problem (Zhang,98a).

. The input/output relationship of the HRTF model based on neural network
It is clearly desirable to select a training set according to human spatial auditory properties.As Oldfield et al (1984) suggested, the JNDs in the horizontal plane (i.e.elevation=0º) are the smallest.With the increase of source elevation, the localization ability and accuracy will weaken to some extent.In addition, listeners can localize the frontal sources more accurate than the sources behind them.Therefore, high-density HRTF samples corresponding to regions where people possess fineresolution are used, while in other regions we only pick out low-density data.The adopted strategy subjects the HRTF model to human localization characteristics based on the neural network.
After model initialization, the conjugate gradient algorithm is used to minimize the meansquared output error.When the minimum is found, a simulated annealing algorithm attempts to break out of what might be a local minimum.Thus the HRTF data at any specified position can be obtained using the trained network.

The SMOOTHING OF ADJACENT SECTIONS
If we just deal with moving sources in the same manner as static, we get little clicks in the output, which may confuse the perception of motion.Fade-in/fade-out (terms borrowed from movies or music) can be used to represent the effect of the latter HRTF increasing while the former HRTF decreasing.In fact, the fade-in/fade-out is implemented in the process of interpolating gradually from one HRTF to another.If it was not done smoothly, the listener can still get clicks.
In the ARE, a pair of sine/cosine functions is used to smooth the contiguous sections: where N is the length of the smoothing buffer.

THE SIMULATION OF DOPPLER EFFECT
The acoustic Doppler effect refers to the observed frequency shift when there is relative movement between the sound source and the listener.Assuming that s denotes a sound source and p is a listener, let the observed sound frequency, wavelength, and transmission speed in the medium be v, ë, and V respectively.If v s and v p represent the speed of the source and the listener, respectively, relative to the medium, then the observed sound frequency by the listener will be ( Two methods are used in ARE to simulate Doppler effect.One is implemented in the frequencydomain by determining the frequency shift directly according to the relative speed of source and observer.The other is implemented in the time domain, adjusting dynamically the sound transmission delay to simulate the physical process that causes the Doppler effect.The first is to simulate results while the second is to simulate the process.

Room Acoustic Simulation Based on Dynamic Mechanisms
Room acoustic simulation aims to simulate the transmission process of acoustic signals in the room (Bradley,97).In general, the schemes involved fall into two categories.One is based on geometrical acoustics or theoretical acoustics, such as the ray-tracing algorithm, image-source model (Zhang,96), the finite element model, and the boundary element model (Wright,95).These kinds of models are fairly elaborate and are able to duplicate most acoustic phenomena in the room.On the other hand, it usually takes a rather long time to run them due to the high complexity (Zhang,96).The other is merely subject to a few significant parameters in room sound field, such as reverberation time.The latter scheme can be implemented with FIR or IIR filter to speed up the computation (Zhang,98b).
In a sophisticated VR system there are quite a few different positional relations between virtual sound sources and users, such as moving source and static user, static source and moving user, and under the most complicated conditions both moving virtual sound source and moving user.Furthermore, the movement of sources and users can also be classified as 'determined' or 'undetermined'.
In order to clarify the situations in a virtual environment, we suppose: (1) the virtual sound sources are static; (2) users are moving, and their movement trail in a plane isn't determined in advance, but rather controlled by realtime interaction.
Owing to the continuity of the user's movement, we can deduce that the adjacent positions in virtual space have some correlation, that is, the next movement position may be predicted based on several previous positions.
In addition, based on the geometrical acoustics, some correlation also exists in the room impulse response (RIR) among adjacent positions in the acoustic field (Zhang,96).
The following subsections present some dynamic mechanisms which can be embedded in an acoustic simulation.Among them, the interpolation procedure is based on the correlation of RIRs.The correlation of the user's movement is as the basis of predictive mechanisms.Moreover, the user's movement is represented by a kind of hierarchical data structure which is crucial to the whole dynamic mechanism.

INTERPOLATION PROCEDURE
The primary objective of acoustic simulation in VR is to provide realistic 3-D sound that may embody properties of a specific sound field and the positional relations between virtual sound source(s) and user(s).An impulse response is a useful time domain representation of an acoustic system, including rooms and HRTFs.It can be sampled and represented with a time domain function, P(T), which denotes the corresponding intensity of impulse response at time T. Therefore the most common method to generate 3-D realistic sound is to convolve the raw audio signal with the impulse response.Since in general the HRTF data are obtained beforehand, the reduction of computational cost in acoustic simulation mainly dependents on the speed of the generation of the room response impulse at arbitrary position.
An interpolation procedure based on the correlation of adjacent RIRs is used to accelerate the computation of RIRs when the user's location changes frequently.Linear interpolation in the time domain doesn't actually yield an average response in the frequency domain (Wightman,92).We adopt interpolation in the frequency domain for the following reasons: (1) To overcome the shortcomings in the time-domain interpolation.
(2) Since the audio spatialization with RIR computation is usually performed by a convolution operation, representing the RIR in advance in the frequency domain is more convenient to do fast convolution with the FFT algorithm.
As shown in Figure 3, assume each side of the square has unit length and the RIRs that exist at P 0 , P 1 , P 2 , P 3 are known in advance.Then we can obtain the corresponding RIR in the frequency domain by the FFT.

REPRESENTATION OF OBJECTS AND USER MOVEMENT
We can represent a 2-D object with a quadtree data structure.It begins with the definition of a square that contains the whole object to be represented in coordinate form.Two neighboring sides of this square are parallel to the x and y axes respectively.If the area in this square is fully occupied by the object, it is evident that the object is equivalent to this square.Otherwise, we divide the original square into four parts, all of which are smaller squares with only half-length sides compared to their parent.Four child squares are indexed with 0, 1, 2 and 3 in the order shown.The "partial", "empty" and "full" are abbreviated with the single characters P, E, and F, respectively.The squares marked with "Full" or "Empty" are regarded as the terminal nodes while non-terminal nodes are with "Partial" mark.Then all the "Partial" child squares are further divided in the same manner until all the squares have either been marked as the terminal node or the sides of the child square have been trimmed to only one-unit length.With this method we can represent any 2-D object.
Any position in virtual space can be seen as a 2-D object occupying only a single point P(x,y).We suppose that the quadtree may be employed to represent the user movement, as is similar with the representation of the 2-D object to some extent.Some definitions are given in order: Definition 1.Current movement position (CMP): the immediate user position at a determined moment in virtual space, which can be represented with a terminal node in a quadtree corresponding to a point P(x,y) in 2-D coordinates.Encoding precision (EP): refers to the encoding length of the CMA, i.e., the depth of the corresponding node in the quadtree.The higher the EP is, the higher the precision of the user(s) movement.For example, the EP corresponding to nodes 0, 0.2, and 0.22 in Figure 4 are 1, 2, and 3 respectively.
Assume that the CMP is P(x, y), and x and y can be transformed into binary numbers as follows: where the number n is the least number of binary bits required to express both maximum x and maximum y inside the movement area.
If the corresponding code of the CMA containing point P, i.e. q 1 q 2 q k ... q n , is known, then we can get It is obvious that the corresponding EP of the CMA satisfies 1 ≤ d ≤ n.Thus with the representation of the CMA in different EPs, the quadtree data structure can also be adapted to represent the movement of the user in the virtual environment.The quadtree which varies with the user's movement frequently may be named as an active quadtree, which is also convenient for the representation of multiuser movement in a distributed VR environment.
Since the precision of the interpolation procedure also has strong influence on computational cost, if combining CMA with the square employed in the interpolation procedure, different EPs of the CMA results incur different computational costs.Therefore, by controlling the EP of the CMA, we can adjust the computational cost of RIR dynamically, thereby almost the same effect of multiple levels of details (LoDs) in a visual simulation will also be achieved in an acoustic simulation (Pan, 98).

PREDICTIVE MECHANISM Figure 5. The prediction of the user's movement
Assuming that the RIRs at the points P0, P1, P2, and P3 in Figure 5 are known in advance, the boundaries of the square P0-P1-P2-P3 are determined by the CMA, which in turn is determined by the CMP.The RIR corresponding to any CMP inside this square can be calculated by the interpolation procedure.In order to further accelerate the computation of the RIRs, we can predict the future CMP outside the current CMA based on direction priority, which is dependent on the direction and velocity of the user's movement.Then the RIRs at the vertex of the neighboring CMA that has highest priority will be computed firstly.
Assuming that the CMP P(x,y) in the current CMA is normalized to the length of the sides, i.e., x=0 at left boundary of CMA, x=1 at right boundary of CMA; y=0 at bottom boundary of CMA, y=1 at top boundary of CMA.Then four child processes upPid, downPid, rightPid and leftPid are forked to compute the RIRs at the vertexes of four adjacent CMAs respectively.The initial running priorities of those processes are set to upPriority, downPriority, rightPriority, and leftPriority, respectively, that is: upPriority=y, downPriority=(1-y), rightPriority=x and leftPriority= (1-x).In this way the RIR computing procedures can be implemented as preemptive processes.

Implementation and Examples
In environments without sound reflecting surfaces, it is straightforward to convolve the source sound signal with the corresponding HRTF to implement directional filtering.Block convolution with FFT is employed in the ARE to accelerate this time-consuming process.
In environments with sound reflecting surfaces, it is necessary to consider the effect of reflected sound, especially early reflections in directional filtering.Assuming that the early RIRs have been calculated through room acoustic simulation, then each RIR is actually a transmission path and can be expressed with three factors: ϕ i ,θ i , and a i (t), where i=0 (i.e. the transmission path for direct sound) .In fact, we need only to consider the early n-order reflections or early m milliseconds since the late reverberation is almost independent of direction.Let ϕ i and θ i denote the incident azimuth and elevation of sound with respect to the listener, respectively, and a i (t) denote the impulse response in the transmission path.
If h left (t) and h right (t) correspond to left and right HRTF impulses r espectively, then the contribution of the sound signal in this path is: Thus, the total BRIR can be obtained as follows: where h left (t) and h right (t) will vary with the incident angles :ϕ i and θ i .
The main purpose of adding the concept of distance simulation is to maintain or improve the sense of realism and increase the immersion in the 3D audio.In fact, the distance cue is important in our everyday listening.The accurate judgment of the distance of a sound may be very helpful in the determination of one's action.
The perception of distance can be classified into two categories: the absolute and relative senses of distance evaluation.It is widely believed that the distance characteristics can be estimated by an intensity scaling scheme.Many experiments show that the intensity scaling scheme is more suitable for the relative distance perception than the absolute distance perception.The reason is that it is hard to estimate the absolute distance of a sound source, especially when the sound pressure level of that sound source at the ears is an unknown quantity.Hence the distance characteristics described here mainly contribute to the relative distance perception.
The gain of the intensity can be created according to the inverse square law for relative distance increments.The desired distance increment can be calibrated to a relative dB scale so that it satisfies the following equation (Begault,94): where d gain is the relative intensity and rd is the relative distance increment.
The reverberant-to-direct sound ratio (R/D) of the sound field in a room has a significant correlation with distance perception.This ratio decreases as distance between sound source and receiver increases due to the effect of inverse square law on the direct portion of the sound field.In contrast, the energy in the late-arriving reflected portion is relatively constant for a varying source distance (Zahorik,96).
In ARE we integrate comprehensive algorithms for 3D audio rendering, which include a variety of room acoustic models, the HRTF model based on neural network, the simulation of moving sound sources, the simulation of Doppler effects, and many others.ARE is an API library which is implemented with C/C++ on the Windows 95/98/NT platform.A demo/test system based on this library was developed in our lab.The following examples are all rendered with this demo system.You are invited to visit the URL http://cad.zju.edu.cn/home/jyshi/ to experience some sample sounds rendered with ARE in .WAV format.

Conclusion
In this paper design and implementation techniques for a general-purpose virtual audio system were presented.The system can be used to support the development of various applications, such as spatial audio, auralization, sonification, auditory display, and HCI.Possible future extension of the research will be the transaural and multi-speaker implementation of ARE.

Acknowledgement
This project was partially supported by National Defense Preliminary Research Fund of China.

Figure 1 .
Figure 1.Basic flowchart of ARE Figure 2. The input/output relationship of the HRTF model based on neural network

Figure 4 .
Figure 4.The representation of CMA Definition 2. Current movement area (CMA): we can use a square to represent the adjacent area of the CMP.This square can be considered equivalent to a non-terminal node or even a terminal node (the CMA converges to a single point at that moment) in a quadtree.Assume that the CMP lies in the smallest black square in Figure 4. Then the CMA can be encoded with 0, 0.2, or 0.22 corresponding to three different nodes, respectively .Definition 3.Encoding precision (EP): refers to the encoding length of the CMA, i.e., the depth of the corresponding node in the quadtree.The higher the EP is, the higher the precision of the user(s) movement.For example, the EP corresponding to nodes 0, 0.2, and 0.22 in Figure4are 1, 2, and 3 respectively.Assume that the CMP is P(x, y), and x and y can be transformed into binary numbers as follows:

Figure 6 .
Figure 6.An example of the definition of sound source and environment with ARE Figure 6 illustrates an example of the definition of sound source and environment with ARE.The bottom waveform represents the concrete source.The top right figure shows the configuration of the transmission environment.The top left figure shows the positional relationship between the source and the listener.

Figure 7
Figure 7 illustrates an example of ARE computation.The top waveform is the input audio signal in 16 bit/44100 sampling rate.The bottom waveform is the resulting 3D audio in stereo format.The two channels correspond to the left and right ears respectively.

Figure 7 .
Figure 7.An example of ARE

Figure 9 .
Figure 9. Waveform of a rotating sound source (half circle) received by the left (top waveform) and right ears (bottom waveform)