Volumetric Amplitude Panning and Diffusion for Spatial Audio Production
Audio Engineering Society
Presented at the 153rd Convention
2022 October, Online
This convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. This paper is available in the AES E-Library (http://www.aes.org/e-lib), all rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.
Dr. Kaushik Sunder - Director of Engineering; Embody
Saarish Kareer - Acoustics & Machine Learning Engineer; Embody
Volumetric Amplitude Panning and Diffusion for Spatial Audio Production
Spatial audio production for multichannel and Dolby Atmos is becoming increasingly common in music, cinematic, gaming, and XR applications. With spatial audio creation and consumption on the rise, it’s important to rethink the tools and workflows used to translate immersive inspiration into perceptual reality. 3D panning is fundamental to any such workflow, as it provides the means by which audio engineers and sound designers position sound sources or objects within their 3D soundscapes. Current state-of-the-art 3D panning techniques such as Vector-Based Amplitude Panning (VBAP) have inherent limitations, including the assumption of a “sweet spot” listening position surrounded by loudspeakers on either on a 2D ring or a 3D sphere, which can degrade the accuracy of spatial production and quality of the end user experience. In this paper, we propose a novel algorithm for Volumetric Amplitude Panning which does not assume a sweet spot and is compatible with any symmetric or asymmetric loudspeaker layout. We also present geometric and distance-based diffusion techniques that ensure smooth spread for any amount of diffusion across all directions and distances.
The Need for Better 3D Panning in Spatial Audio Production
In the effort to provide the industry with better and more efficient tools for producing spatial content, it's vital to understand the limitations and caveats of those tools which are commonly used today. 3D panners allow audio engineers and sound designers to position sound objects in 3D space, making them indispensable in the creation of any immersive audio experience.
In this paper, we first present state-of-the-art 3D panning techniques such as Vector-Based Amplitude Panning (VBAP), Multiple-Direction Amplitude Panning (MDAP), and Distance-Based Amplitude Panning (DBAP) in detail . Although these techniques are widely used, they each possess key limitations that can degrade the immersive experience. One such limitation found in VBAP and MDAP is the assumption that the listener be placed in a “sweet spot'' surrounded by equidistantly positioned loudspeakers, rendering both techniques incompatible with irregular loudspeaker layouts. VBAP, which renders sound sources between adjacent loudspeaker triplets in 3D layouts, can incur intense spatial jumps across the loudspeakers' common edge. When representing moving sound sources, VBAP causes unintended fluctuations in both perceived spatial spread and spectral coloration. MDAP, an extension of VBAP, was developed based on a uniform spatial spread that simultaneously pans the sound signal to multiple directions close to one another. This approach reduces perceived fluctuations in source width by making the spatial spread independent of the panning direction. Since VBAP and MDAP are also distance agnostic, neither is ideal for applications like gaming and XR where perceived depth of the object in 3D space is critical. Finally, both VBAP and MDAP depend heavily on triangulation which results in inaccurate panning when there aren’t enough speakers to triangulate.
Later, we propose our novel algorithm for Volumetric Amplitude Panning - a more accurate and versatile approach to 3D panning which doesn’t suffer from these critical limitations. Consumer demand for spatial audio in music, gaming, cinema, AR, VR, and XR will continue to increase, placing greater emphasis on the need for tools which enable speedier production and a higher quality standard. This proposal represents a critical step forward in creating a more reliable spatial production workflow that empowers audio engineers, producers, and sound designers to make accurate mix decisions in 3D space and thereby enhance the overall quality of their immersive audio experiences.
II. STATE-OF-THE-ART PANNING ALGORITHMS
Vector-Based Amplitude Panning
Ville Pulkki   originally proposed Vector-Based Amplitude Panning (VBAP) as the most reliable and universal amplitude panning technique compatible with almost any surrounding loudspeaker configuration. VBAP provides directionally reliable auditory event localization by activating the fewest number of loudspeakers possible. However, this can also result in fluctuations in width and color for moving sound sources.
A virtual source, defined as an intended auditory event at a panning direction, 𝞱, can potentially be controlled by the criterion assuming Gerzon’s  Energy-Velocity vector ((rE ) and (rV )) model in order to predict the perceived direction:
Here, the loudspeakers’ direction vectors are denoted by θl, and to maintain constant loudness the amplitude weights, g̃l must be normalized.
Additionally, the weights g̃l should always be positive to prevent in-head localization and other distortions. Loudspeaker rings around the horizon will always use 1 or 2 loudspeakers, while arrangements on a surrounding sphere will always use 1 up to 3 loudspeakers. The selected loudspeakers' directions must enclose the intended direction of the desired auditory event or virtual source. To ensure directional stability of the auditory event, the angle between loudspeakers should remain less than 90 degrees. In order to choose which loudspeakers to activate for playback, all speakers arranged along the convex hull are grouped into loudspeaker triplets. The triplet with all-positive weights, g1 ≥ 0, g2 ≥ 0, and g3 ≥ 0, is selected from the list of all loudspeaker triplets in order to determine which one needs to be activated.
Since convex hull triangulation is a key component of vector-based panning techniques, irregular loudspeaker arrangements that result in degenerate vector bases might cause issues. The nearest loudspeaker dominates localization in various panning directions for off-center placements. A likely outcome is that the apparent direction stays inside the loudspeaker pair and the directional mapping appears monotonic with the panning angle. The perceived breadth and coloring vary audibly as a result of this behavior in particular. There will frequently be powerful abrupt jumps that are fairly evident with virtual source motions which cross a common edge of nearby loudspeaker triplets.
VBAP can also be modified to have squares of weights to improve the perceptual mapping . The effect of this is that the auditory event now starts aligning with the direction of Gerzon’s energy (rE) vector.
Multiple-Direction Amplitude Panning
Pulkki  expanded VBAP to Multiple-Direction Amplitude Panning (MDAP) in order to regulate the number of active loudspeakers for moving sound objects by adjusting the (rE) or (rV ) vectors in both length and direction. By doing this, it’s possible to maintain consistency in both the sound sources’ perceived width and coloring throughout time.
As a directional spreading scheme, MDAP uses multiple virtual sources dispersed around the panning direction. A pair of virtual VBAP sources at an angle ±θ to the panning direction can make up the MDAP for horizontal loudspeaker rings. The angle θ = 90% 180°/L produces an appropriately flat spread in a ring of L loudspeakers with uniform angular spacing. Additionally, it appears that MDAP balances the rE measure’s direction with the rV measure’s direction, which is the one that’s managed by VBAP and MDAP. With MDAP, those instances of VBAP with only one active loudspeaker can be avoided for more consistent panning.
Distance-Based Amplitude Panning
The majority of standard spatialization techniques require the listener to be situated in a “sweet spot” surrounded by equidistantly positioned loudspeakers. Such speaker layouts might not be ideal for actual concert, stage, or installation purposes. With no preconceived notions regarding the placements of the speakers in space or their relationships to one another, Distance-Based Amplitude Panning (DBAP) extends the equal intensity panning principle from a pair of speakers to a loudspeaker array of unlimited size .
III. VOLUMETRIC-BASED AMPLITUDE PANNING
We propose a volumetric 3D panner which augments VBAP to allow sound objects to be panned to any point in 3D space, and which works for any asymmetric or non-uniform loudspeaker grid. This panner is based on the principle that the total energy being distributed amongst multiple loudspeakers should be constant across all panning directions.
As previously discussed, VBAP and MDAP panning techniques are limited in that they require the listener to be positioned in the sweet spot and don’t take distance into account. In a 3D audio environment where both the listener and the pan puck -- which determines the position of the panned sound source -- can be anywhere within 3D space, distance agnostic techniques like these fail to accurately provide true phantom source perception. One of the main advantages of the proposed technique is that it allows accurate panning for any non-uniform loudspeaker layout. In order to explain volumetric-based panning, we will consider 3D panning as two distinct scenarios : 1) On-the-Wall Panning 2) Inside-the-Wall Panning.
When a sound object is panned on the wall in a 3D loudspeaker grid, it’s important to maintain both the source direction and overall loudness. This can be visualized as the speakers being on the circumference of a 2D circle or 3D sphere with a radius equal to the distance between each speaker and the listener (Figure 1). To avoid any audible jumps, it is critical to ensure smooth transitions from one position to another. If the transitions and boundary conditions are handled well, a normalized VBAP or MDAP can be used for panning anywhere on the walls. Assuming g1, g2, and g3 are the triplet weights obtained from VBAP/MDAP calculation, the scaled weights can be defined as sc ∗ g1, sc ∗ g2, and sc ∗ g3, where sc can be computed as:
These weights are then multiplied by the total input signal energy such that it is distributed to the speakers corresponding to g1, g2, and g3.
As the panning position moves inward towards the listener from the circle or sphere containing the speakers, the traditional VBAP loses perceptual accuracy. In order to accurately pan a sound source inside the walls while maintaining the overall loudness and source direction, one must first assume that the total energy available for distribution is constant. In most panning applications, this energy is assumed to be the mono/stereo input source signal energy which needs to be distributed across the speaker grid. The contribution of speakers for a source inside the walls can be considered as a sum of two parts.
The first part is obtained via VBAP as if the source position is on the walls without distance consideration. The source position on the walls is obtained by extending the vector from the center of the room (O) and the source position all the way to the walls. The VBAP contribution here helps maintain the direction of the panned source inside the walls. The second part is the contribution of the volumetric factor. Assuming that Emax is the summation of the maximum VBAP weights that can be applied to a speaker, the volumetric factor for a particular source position si can be defined as:
The weights g1, g2, and g3 are calculated for the source position si. Lastly,
The weights g1max, g2max, and g3max are the maximum VBAP weights that can be applied to a source position. In most cases, this position will coincide with one of the speaker locations.
Fig. 1: Example illustrating On-the-Wall and Inside-the-Wall panning for a circular/spherical layout.
To cater for the distance-based effects, relative euclidean distances from each speaker to the source are first computed to calculate the contribution of each speaker to the total energy. Let the Center of the 3D space be represented as O, source position as si, and each of the loudspeaker positions as p1, p2, p3....pi. The EVol is now distributed to each of the loudspeakers based on the distance of the source to the speakers. The Euclidean distance between the source and any loudspeaker is given by:
The Euclidean distance between the Center and any loudspeaker is given by:
The volumetric contribution of any loudspeaker for a source inside the walls is then given by:
where D(O, pi) is:
The total contribution of any loudspeaker for a source inside the walls is thus the sum of the volumetric (EVol pi) and VBAP (EVBAP) weights.
Fig. 2: Compression of Weights to smoothen transitions from on the wall to inside the wall.
Additionally, maintaining smooth transitions between on the wall and inside the wall requires certain boundary conditions to be met. As the weights linearly increase from 0 to a perceivable value of gx around the wall, disturbing audible jumps often occur in localization and loudness. In order to cater to these sudden jumps, a compression is applied to the weights as they ramp up from a value of 0 to gx. Here, gx is the minimum audible loudspeaker weights (Figure 2).
The volumetric weights ensure that the perception of the panned source is accurate and in the correct direction of the rE and rV vectors. When the panned source is at the center of the room there must be contributions from all speakers to achieve the perception of the phantom source also being at the center of the room. Volumetric panning achieves that seamlessly.
Diffusion is a fundamental part of any spatial audio production workflow. Diffusion or Decorrelation is defined  in mathematical terms as a process whereby an audio source signal is transformed into multiple output signals with waveforms that appear different from each other, but which sound the same as the source. Diffusion significantly affects the perception of spatial imagery primarily by increasing the externalization and image width . Here we introduce two approaches to diffusion
Fig. 3: Geometric Based Approach for Diffusion.
A novel geometric-based approach is proposed, where the loudspeakers are considered in a unit spherical (3D) or a unit circular (2D) grid. To facilitate the diffusion, a diffusion sphere or a circle is created (as shown in Figure 3) at the location of the source, the radius of which is controlled by the amount of diffusion. Figure 3 shows the diffusion sphere or circle for a 5.1 loudspeaker setup drawn as squares around the circle with center O. The circle S is the source position where the sound is panned to. Let us define the radius of diffusion as r, which corresponds to the minimum and maximum range of diffusion energy that is desired in the application. The diffusion energy is computed as a fraction of the total energy based on the application and maximum diffusion desired. The diffusion energy is distributed to the loudspeakers by measuring the intersection of the diffusion sphere with the unit spherical grid. In Figure 3, the diffusion sphere intersects the unit sphere at points N and M, the lengths of which correspond to the two roots α,β of the quadratic equation.
The length NM (d) can be computed as α−β or β−α depending on which root is larger. It is important to note that as the diffusion sphere gets larger, the intersection with other speakers comes into play, and the corresponding roots and d are computed for each speaker. The total diffusion energy (ETotalDiff), which is user defined, decides the maximum diffusion energy that each speaker (EMaxDiffpi) can have:
where n is the number of loudspeakers in the grid. The diffusion contribution for each speaker (EDiffpi) is a fraction of (EMaxDiffpi), that is determined by the length d computed in the equation:
In this approach, the total energy is then assumed to be a sum of the volumetric panned energy explained previously (with no diffusion) and the diffusion energy EDiffpi. Since the total diffusion energy is user defined, it can be tuned for any specific application, speaker layout, or listener position based on perceptual analysis.
Another approach to diffusion is to capture the source positions inside the room, calculate the Euclidean distances of the speakers to the source, and distribute the diffusion energy accordingly. This approach caters to the source position moving inwards from the circumference of the speaker grid while maintaining the overall loudness and source direction. As the sound object is moved towards the listener (or the center of the 3D space), the energy will be spread equally amongst all speakers. For every speaker, the contribution of the diffusion energy and the volumetric 3D panned energy is then combined to ensure smooth functionality across all directions, distances, and diffusion sizes.
Fig. 4: 3D Panner implementation based on the principles of Volumetric Amplitude Panning.
This paper describes a practical volumetric-based panning mechanism with a focus on Spatial Audio Production tools (Figure 4). Though VBAP and MDAP algorithms are state-of-the-art, they seldom consider panning distance and often rely heavily on triangulation. Moreover, these techniques are agnostic to listener and loudspeaker positions. Such limitations render both VBAP and MDAP incapable of providing the level of granularity in perceived sound source positions required to meet the needs of today’s spatial music production, gaming, and AR/VR applications.
Volumetric amplitude panning caters to these drawbacks by effectively combining traditional VBAP with listener-to-speaker distance considerations, thereby reducing the fluctuation and coloration of sound during panning. Since it can also be used with custom speaker layouts, this technique offers greater flexibility than VBAP in designing novel soundscapes for modern day spatial audio applications.
We also discuss methods to eliminate transitional jumps as we move from the speaker grid wall towards the listener position such that the overall panning energy remains constant. In order to further reduce spectral fluctuations and enhance the spatial width, we introduce geometric-based and distance-based approaches to acoustic diffusion. These enable the user to specify and control the total diffusion energy needed for an application and help smoothen the volumetric amplitude panning across all directions and distances. Figure 4 shows an example implementation of volumetric-based amplitude panning where the user has the flexibility to work with any loudspeaker grid in a 3D space. Future work involves a detailed subjective experiment and objective analysis of the panner performance using Gerzon’s rE or rV vectors.
 Zotter, Franz, and Matthias Frank. "Ampli- tude Panning Using Vector Bases." Ambisonics. Springer, Cham, 2019. 41-52.
 Sunder, K. (2021). Binaural audio engineering. In 3D Audio (pp. 130-159). Routledge.
 V. Pulkki, Spatial sound generation and perception by amplitude panning techniques, Ph.D. dissertation, Helsinki University of Technology (2001)
 V. Pulkki, Virtual sound source positioning using vector base amplitude panning. J. Audio Eng.Soc. 45(6), 456–466 (1997)
 Gerzon, Michael A. "General metatheory of auditory localisation." Audio Engineering Society Convention 92. Audio Engineering Society, 1992.
 M. Frank, Phantom sources using multiple loudspeakers in the horizontal plane, Ph.D. Thesis, Kunstuni Graz (2013)
 V. Pulkki, Uniform spreading of amplitude panned virtual sources, in Proceedings of the Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE (New Paltz, NY, 1999)
 Lossius, Trond, Pascal Baltazar, and Théo de la Hogue. "DBAP–distance-based amplitude pan- ning." ICMC. 2009.
 Kendall, Gary S. "The decorrelation of audio signals and its impact on spatial imagery." Computer Music Journal 19.4 (1995): 71-87.
 Begault, Durand R., and Leonard J. Trejo. 3D sound for virtual reality and multimedia. No. NASA/TM-2000-209606. 2000.
 Frank, Matthias. Localization using different amplitude-panning methods in the frontal horizontal plane. 2014.