Audio Engineering Society
Presented at the 153rd Convention
2022 October, Online
This convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. This paper is available in the AES E-Library (http://www.aes.org/e-lib), all rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.
Nikhil Javeri - Manager, Machine Learning R&D; Embody
Prabal Bijoy Dutta - Computer Vision Engineer; Embody
Dr. Kaushik Sunder - Director of Engineering; Embody
Kapil Jain - Chief Executive Officer; Embody
Predicting Personalized Head Related Transfer Functions using Acoustic Scattering Neural Networks
Recently increased demand for immersive sound experiences across all media has made it more critical than ever to ensure a high quality standard for spatial audio production and consumption. Binaural audio, a headphone-centric format, is one of the most convenient methods for both virtual production and front-end distribution of spatial audio. Personalized Head-Related Transfer Functions (HRTFs) are integral to determining the quality and directional accuracy of the spatial audio experience. In this paper, we describe a novel technique to predict personalized HRTFs based on 2D images or video capture. Existing state-of-the-art 3D reconstruction techniques were developed for generic objects and thus do not perform well with complex structures such as the human ear. We propose a novel 3D reconstruction algorithm that is modeled taking into account the geometry of the ear. The 3D output is then fed to an Acoustic Scattering Neural Network (ASNN) designed on the principles of Boundary Element Method (BEM) which outputs personalized HRTFs. The predicted personalized HRTFs are then compared objectively with measured HRTFs. We discuss the results, limitations, and the caveats necessary for an accurate modeling of personalized HRTFs.
The Rise of Immersive Content and the Scalability of Production
There’s been a recent surge of interest in creating interactive immersive content for music, movies, games, and extended reality (XR) on a robust variety of platforms. The rising popularity of hearables as a medium for delivering and experiencing that content necessitates accurate headphone-based rendering methods. Such methods depend on the use of personalized HRTFs to achieve the best spatial experience and good externalization .
Many processes for measuring personalized HRTFs are tedious and may take several hours depending on the spatial grid being measured. Today’s most promising techniques utilize 2D images or 3D scans of the head to synthesize personalized HRTFs using techniques such as Boundary/Finite Element Method . Here we detail the architecture for estimating personalized HRTFs for Virtual, Augmented, and Mixed Reality on a mass scale with machine learning techniques and an input database of 2D images and video capture. We first present a novel approach for 3D ear reconstruction, followed by an ASNN model capable of predicting personalized HRTFs on a mass scale over the cloud.
Predictive 3D Reconstruction
Capturing the intricacies of the human ear can be difficult. 3D ear scanning methods require the use of expensive specialized equipment and for subjects to sit still for extended durations. The resulting scans then further need to be stitched, registered, cleaned of imperfections and noise, and then isotropically remeshed before being fed into the simulator or engine being used for HRTF prediction. By using prediction as the basis for 3D reconstruction, we’re able to generate personalized HRTFs using 2D images or video capture submitted by users with their own smartphones and tablets.
Beyond the Limitations of Vision-based 3D Reconstruction
The main disadvantage of the state-of-the-art vision-based 3D reconstruction techniques available in the literature is that they’re developed to solve shape estimation problems for generic objects like tables, chairs, cars, etc. Ear geometries involve a high number of concave and convex sub-structures that don’t perform well with such models. Custom backbone networks are required in order to properly understand and utilize ear structures for prediction.
In this paper, we propose a neural network model that explores ear reconstruction as a mesh and occupancy grid. We discuss novel loss functions that help back-propagation over ear sub-structures to produce accurate spatial representations. This technique rapidly outputs a 3D model of the ear, with desired accuracy that is compatible with Boundary Element Method (BEM) simulators and HRTF prediction engines alike. Accurate spatial sampling is also discussed here, as it’s both a prerequisite for 3D reconstruction and essential for forming a good HRTF without any artifacts.
Acoustic Scattering Neural Network Solves BEM Problems
Because the solution accuracy of BEM simulation increases rapidly with the number of points on a mesh, compatible cloud infrastructure can be expensive and time consuming. BEM simulations also suffer if the mesh has inaccuracies like holes or floating particles. Adding to the list of inherent problems, BEM solvers lack sufficient details unless configured properly.
By using a neural network to solve the acoustic scattering problem posed by the 3D reconstructed ear data, HRTFs can be directly predicted in a cost effective and computationally efficient manner. This paper describes one such custom Acoustic Scattering Neural Network (ASNN) for HRTF prediction. The predicted personalized HRTFs are then objectively and subjectively compared with the measured HRTFs.
II. 3D RECONSTRUCTION
With numerous uses in robotics, augmented reality, and autonomous vehicles, 3D reconstruction has emerged as a fundamental area of study in computer vision. Here, we outline the most cutting-edge 3D reconstruction techniques currently in use and contrast them with the proposed inception-based network. The reconstruction being performed is in the voxel domain, and the most cutting-edge techniques in this domain at the moment include Attsets Network, 3D-RETR, and Pix2Vox. These techniques do not excel when it comes to recreating the ear surface from images, despite having achieved notable results when recreating common household objects.
We tested these techniques with both multiple-view and single-view images. Various loss functions including Dice, Tversky, Binary Cross Entropy, and Focal were tested and found to work well in a number of instances of generic 3D reconstruction.
Attsets uses an aggregating method that enables feature concatenation from an R2N2 encoder to aid the reconstruction . For each of the mastered visual features, Attsets assigns attention scores. For any arbitrary number of visual features, the aggregation procedure produces a fixed-size representation. This is accomplished through the use of the attention activation mechanism, which first computes a score for each feature in the set before aggregating the representation from the features weighted by the learned scores. Prior to learning the visual feature scores for multi-view, the average of all representations was calculated.
3D−RETR uses a transformer based architecture for encoding the 2D image, and then a Transformer-CNN based decoder which outputs the encoded context to a voxel-space . The encoder Data-efficient Image Transformer (DeiT), which takes the input image and a positional encoding, is used by the 3D-RETR network. For multi-view reconstruction, the encoder network offers a dense vector that is averaged over all views.
The positional embeddings of these dense vectors are then sent through the transformer decoder network. The voxel features that were reshaped to feed the CNN decoder are output by this decoder. The CNN decoder is made up of residual blocks, followed by a string of Convolutional Transpose blocks, which up-sample the latent representation into a cubical shape representation and encode whether or not a voxel should be present there. While the original work is based on a resolution of 32x32x32, the representation used in this experiment has a resolution of 96x96x96 (Filters-Width-Height).
Pix2Vox uses a context-aware encoding to enable the decoder to perform better while reconstructing the extracted features . This method first creates a number of coarse volumetric reconstructions from the input images. The higher quality coarse reconstructions were then chosen through a fusion using a context-aware selection mechanism. Following the selection of these coarse volumes, a superfine volumetric reconstruction was created.
An object that self-occludes (i.e., has a portion of itself covered) during the reconstruction phase is frequently recreated with imperfections in the portion that was visible; Pix2Vox primarily intended to address this problem. Each coarse volume’s context is assigned a score that can help with awareness-building and the fusion of carefully constructed parts. This network is helped by a refiner network that can improve the fused reconstruction by fixing the incorrectly recovered portion of the reconstructed volume, which has been a problem in nearly all of the described networks.
In this study, segmented ear images serve as the 3D reconstructor’s input. The reconstruction network receives input from a Segmentation Network (SegNet) based detector that has been trained to separate the RGB ear images from the original image. By applying techniques like Multi-Scale Structural Similarity Index Measure (MS-SSIM) , thresholds for structural similarity are found for images to be selected that are maximally variant with respect to each other. Using this method, we chose 5 of these pictures from the video. For 3D reconstructions with higher resolution, more images can be used. Additionally, we were able to obtain 3D scans of our ears in binary voxel format.
Proposed Network Based on Inception
To extract feature maps from a convolutional network with kernel values (1, 3, 5, 7), which are stacked, we suggest an inception-based network. The same procedure is repeated to produce 3D feature maps with the desired resolution of 96x96x96. With this method, a decoder is not necessary. From the feature maps that have been processed, the encoder learns the representation and the reconstruction.
Fig. 1: Proposed network based on Inception.
Figure 1 illustrates the suggested network. This method aims to identify representative features at various kernel levels that are suitable for the reconstruction task. After that, a large number of features were extracted from the output feature maps. The 2D resolution of the feature maps is halved at each stage of stacking the filter output, and finally a representation of 96³ was obtained from 3-channel 384x384 images that were shrunk and expanded to 96x96x96. The Dice loss function outperformed the others when used to train the Inception Net, and the results that are shown are consistent with this loss function.
We experimented with Dice , Tversky , Binary Cross Entropy, and Focal Loss function . We are only presenting the results using the DICE loss function because it performed significantly better than the other three. Intersection over Union (IoU) between the voxel grid reconstruction and the ground truth voxels is the assessment measure employed for this work. This measure essentially specifies the voxel domain overlap between two structures. This metric is practical for our purpose because we configured all the models to have a consistent alignment during the preprocessing procedures. Three subjects (six samples) whose mean IoU the network had never seen during training were examined.
Fig. 2: 3D reconstruction performance of different techniques based on IOU.
Figure 2 displays how well the previously mentioned networks performed while recreating the human ear in voxel grid space. For coarse reconstruction, Pix2Vox creates score maps that are later combined. The scoring system of such networks was inadequate to rebuild the ear in voxel space because of the intricate structure of the ear. The same is true for Attsets; these networks aggregate features using a scoring method which has also been unable to produce an adequate reconstruction. In comparison to the other approaches, 3D-RETR has the greatest mean IoU, and 1D-positional embeddings may be the cause of this improvement. Testing out more positional embeddings, such as 2D, sinusoidal, etc., might be fascinating.
The inception net model was created to use various kernel levels that were afterwards merged in order to encode the ear image. The reconstruction was created using the fused feature maps. According to the results, the network fared well when learning the width by fusing distinct kernel maps, but struggled while learning the voxel-depth in various ear areas. In the following sections, we explain the acoustic scattering neural network (ASNN) model that outputs personalized HRTFS using the 3D voxelized data as input.
III. ACOUSTIC SCATTERING NEURAL NETWORK
The 3D ear data is contained in a 96x96x96 voxel occupancy grid that serves as the neural network’s input. The largest ear in the database, measuring 88.98mm in height, was used to determine the resolution of 96. The following step is to meet the simulation requirement of having at least 6 elements per interest wavelength .
The highest frequency that can be predicted in this situation is 16kHz. The wavelength for 16kHz can be computed using the formula (343,000mm/s)/(16000Hz) = 21.43mm. The element length is 21.43mm/12 = 1.78mm if we assume 12 elements per wavelength, double the initial requirement of 6 to be safe. We round this requirement down to 1.5mm assuming some level of safety. We can efficiently resolve frequencies up to 16kHz because the largest ear in our database, which was 88.98mm, can be divided into 96 bins on each axis, resulting in a voxel edge length of 88.98/96 = 0.92mm. This is shorter than the 1.78mm element length.
How Sound Interacts with Ear Structures
To comprehend the acoustic scattering neural network, we first need to comprehend how sound interacts with complex ear structures. The human ear can approximately be described as the composition of strongly and weakly linked helmholtz resonators. Ear regions might pair strongly or weakly with the wavelength being considered. For shorter wavelengths, nearby regions can become highly coupled, while distant regions can do the same for integral multiples of the same wavelengths. Dissonances can also become strongly linked for integral multiples of half wavelengths. Resonances and dissonances can thus be identified based on these wavelengths.
Training the Acoustic Scattering Neural Network
Our ASNN works by representing these wavelengths as its excitational components. Small kernels that interact with the 3D reconstructed mesh teach our neural networks about these patterns and lead to abstract acoustic potentials at each mesh location. The final HRTF can then be formed by combining these abstract acoustic potentials at various frequencies and places in space.
Fig. 3: Basic overview of the Acoustic Scattering NN with the intermediary outputs.
The fundamental layout of the ASNN is shown in Figure 3; the convolutional kernel serves as the foundation. In order to recognize patterns in the ear region and then learn how the acoustic potentials of the distinct concave ear substructures’ form, the ASNN trains a collection of convolutional kernels of various sizes. These kernels provide abstract auditory potential feature maps that can be reduced to an HRTF when convolved with any 3D ear. This process also requires the selection of a capable loss function that can effectively support information back-propagation.
Effectiveness of Divergence Loss Functions
Initial studies explored common loss functions like Mean Absolute Error (MAE), Mean Square Error (MSE), and Huber Loss, but poor performance rendered them unsuitable for this experiment. Divergence loss functions such as Kullback-Leibler (KL) and Itakura-Saito (IS) [10, 11], were considered suitable for this study. We use IS distance as a metric for all our results since it effectively captures perceptual discrepancies between the ground truth and the prediction.
While IS divergence effectively captures perceptual dissimilarity , the overall shape of the HRTF is ultimately the key to personalization. Equation 1 represents IS divergence encapsulated in the absolute function to make it completely non-negative.
Correlation of HRTF Notches to Cochlear Response
Dissonant time signals interact with one another to cancel out at particular frequencies, resulting in notches. These notches are among the most crucial monaural signals for an HRTF because they correspond to a person’s cochlear response [13, 14]. Accurate perception of sound elevation depends on the notches’ distinctive position and depth. We have chosen to incorporate measures that capture notch distances into the loss function, thereby training the ASNN to learn locations in the ear that generate them.
Limiting the Predictive Frequency Response Range
We chose to exclusively predict the HRTF response between 4kHz and 16kHz for our ASNN. The reason for this is that the most distinctive aspect of the pinna is this high frequency region. The main causes of features below 4 kHz include head and torso effects, ear-canal resonance, and well-established modeling methods. Responses over 16kHz become smaller, having less of an impact on how humans perceive their surroundings; these can also be modeled independently.
Modeling HRIR, ILD, and ITD
We must sew the low frequency (4kHz) and high frequency (> 16kHz) regions to the expected HRTFs in order to transform them into auralizable Head Related Impulse Responses (HRIR). The distinct interaural level differences (ILD) are captured by the HRTF magnitude responses. Using camera-based approaches, we first estimate the head-size before modeling the interaural time differences (ITD). We employ the iris depth and face landmark estimators from mediapipe’s [15, 16] publicly accessible datasets for this. We can estimate the head size and ITDs by feeding the cheek-to-cheek distance between the landmarks and the depth of the face within the same image frame from our video input. The ILD value baked into the HRTF forecast can also be cross-validated using the head size.
As there are various metrics to compare the HRTF with, comparisons can frequently get complicated and confusing. We identify objective characteristics such as notch depths and distances as a major determinant of an HRTF’s accuracy. In light of this, we discovered that IS divergence best exemplifies a measurement that is objective and resembles perceptual dissimilarity. It may be inferred that effective perceptual responses will be comparable if the IS divergence is modest. As a check on our findings, we created IS divergences comparing HRTFs with the same angle across multiple participants.
Fig. 4: IS distance showing the variation of HRTFs among subjects
The distribution of IS distance for HRTFs throughout the board is shown in Figure 4. This distribution is created by comparing every HRTF in the dataset to itself, skipping over the HRTF that B will never equal A, i.e., IS(A|B). The magnitude response comparison between the predicted HRTFs and the actual HRTFs from our ASNN is shown in Figure 5. We observe that several of the distinctive notch features are indeed very well captured by the projected HRTFs.
An average IS distance of about 0.271 is produced by the projected HRTFs overall to their ground truth. When comparing them to their closest neighbors, they outperform 81% of all HRTFs. Because of this, the predicted spectrums would sound reasonably similar to their corresponding ground truths, which is better than nearest neighbors for the majority of HRTFs even though they don’t look the same. The nearest neighbor method is computed in the 3D voxel domain and selects the HRTF corresponding to the nearest neighbor 3D mesh. Here, we note that the prediction based on the Acoustic Scattering NN with a Mean IS of 0.271, outperforms the nearest neighbor method (Mean IS 0.5) for HRTF personalization.
IV - DISCUSSION
Improving the ASNN’s Prediction Outcomes
Although IS is a useful tool for capturing perceptual differences, it does not specifically highlight notches as a distinguishing characteristic. Because IS was created for the entire spectrum, we must build better measurements that focus on notch similarities in order to potentially improve the findings. Figure 5 illustrates how effectively our network can predict a general form that resembles the ground truth. The prediction has more mismatch in some locations than others. However, IS doesn’t take that into account, making it seem incomplete to use it as a measure to backpropagate an NN.
Fig. 5: Comparison between Ground Truth and Predicted HRTFs with IS scores.
Red: Ground Truth, Blue: Predicted HRTFs
Such an NN would be best trained using a custom loss function that considers regional similarity, notches, and the overall prediction shape. It would be ideal if creating such a loss function required continuity and differentiability. Because the function to find notches is not continuous, and hence not differentiable, this distance cannot be easily factored into the loss. The ASNN’s prediction outcomes can be considerably enhanced by combining the NFD and the IS as a metric.
Validating the ASNN Model
An Acoustic Scattering NN is able to predict HRTFs much faster than a BEM simulation, which would take several hours. Our research indicates that, on average, the NN required 1.8 seconds to predict 5x7x70 HRTFs (5 participants, 7 channels, and 70 frequency bins) using the following Machine spec: NVIDIA RTX 3080 Max-Q 16 GB DDR6 VRAM, i7 11800H processor, 64 GB DDR4 memory. To further validate the NN model, future research will also involve subjective comparison of the predicted HRTFs to the measured HRTFs.
The Future of Spatial Audio Rendering
We can infer from the discussion of 3D reconstruction in Section 2 that complex ear structures don’t respond well to current state-of-the-art techniques. We suggest a prototype network that works well for images of the ears. Even though it’s still a work in progress, the inception network shows promising results by accurately collecting some of the most crucial ear properties. With more data and a better loss function that takes geometry into account, the network performance will improve. Future studies include determining the degree of 3D reconstruction accuracy required for a reliable, accurate HRTF forecast. With the methods described in this research, we now have the means to rapidly and accurately produce the precisely tailored HRTFs for every subject that are essential for spatial audio rendering in interactive media platforms like VR and AR.
 Sunder, Kaushik. "Binaural audio engineering." 3D Audio. Routledge, 2021. 130-159.
 Liu, Y. J, et al.. "The fast multipole boundary ele- ment method for potential problems: a tutorial." Engi- neering Analysis with Boundary Elements 30.5 (2006)
 Yang, et al. "Robust attentional aggregation of deep feature sets for multi-view 3D reconstruction." International Journal of Computer Vision 128.1 (2020).”
 Shi, Zai, et al. "3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers." arXiv preprint arXiv:2110.08861 (2021).”
 Xie, et al. "Pix2vox: Context-aware 3d reconstruc- tion from single and multi-view images." Proceedings of the IEEE/CVF International Conference on Com- puter Vision. 2019.
 Wang, Zhou, Eero P. Simoncelli, and Alan C. Bovik. "Multiscale structural similarity for image quality as- sessment." The Thirty-Seventh Asilomar Conference on Signals, Systems Computers, 2003. Vol. 2. Ieee, 2003.
 Sudre, Carole H., et al. "Generalised dice overlap as a deep learning loss function for highly unbalanced seg- mentations." Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, Cham, 2017. 240-248.
 Salehi, Seyed Sadegh Mohseni, Deniz Erdogmus, and Ali Gholipour. "Tversky loss function for image segmentation using 3D fully convolutional deep net- works." International workshop on machine learning in medical imaging. Springer, Cham, 2017.
 Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings of the IEEE international con- ference on computer vision. 2017.
 Févotte, Cédric, Nancy Bertin, and Jean-Louis Durrieu. "Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis." Neural computation 21.3 (2009): 793-830.
 Miller, Ryan J. "A Perceptual Evaluation of Short- Time Fourier Transform Window Duration and Diver- gence Cost Function on Audio Source Separation using Non-negative Matrix Factorization." (2020).
 Loizou, Philipos C. "Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum." IEEE Transactions on Speech and Audio Processing 13.5 (2005): 857-869.
 Davis, Kevin A., Ramnarayan Ramachandran, and Bradford J. May. "Auditory processing of spectral cues for sound localization in the inferior colliculus." Jour- nal of the Association for Research in Otolaryngology 4.2 (2003): 148-163.
 Iida, Kazuhiro, Yohji Ishii, and Shinsuke Nishioka. "Personalization of head-related transfer functions in the median plane based on the anthropometry of the lis- tener’s pinnae." The Journal of the Acoustical Society of America 136.1 (2014): 317-333.
 Kartynnik, Yury, et al. "Real-time facial surface geometry from monocular video on mobile GPUs." arXiv preprint arXiv:1907.06724 (2019).
 Ablavatski, Artsiom, et al. "Real-time Pupil Track- ing from Monocular Video for Digital Puppetry." arXiv preprint arXiv:2006.11341 (2020).