See with your ears:
Combining object detection and binaural sound

Humans have the amazing ability to localize sound sources around them. In this project, I am trying to find out if this abiliy can be combined with recent object detection and localization algorithms to enable a visually impaired person to hear their environment. The idea here is to give each object a specific sound.

The human brain is able to localize a sound source by utilizing auditory cues such as differences in timing and intensities. These cues depend on many parameters such as the shape and size of the ear and the head. One way of capturing sound without losing these cues is binaural recording. Although it is now more than 15 years old, this demo is the best binaural sound recording I have come across. Essentially, it is an immersive auditory virtual environment, simulating a barbershop (make sure you use stereo headphones).

If you are still not convinced that, at least in theory, converting visual information into auditory information by using the human sound localization ability, watch this video by Be Smart, which shows that many visually impaired people learn to use auditory cues to navigate their environment.

Demo and Results

Small demo with one object. Make sure to listen to the video using a headset with stereo audio enabled. The binaural audio generation runs in real-time. The video shown below was captured by and ESP32-CAM setup as a web server. KAPWNG was used to merge the recorded frames and the synthesized audio for the sake of this demo.

1. Mathematical Modeling of Binaural Sound

The question is now, how to make a "normal" sound recording sound like it is coming from a certain location in space?
We are lucky since there are many datasets containing the Head-related Impus Response or the Head Related Transfer Function (HRTF) in various settings. HRTF is the Fourier Transformation of the HRIR. They are functions that describe how properties such as the shape and size of the ear and head modify the sounds to sounds that we hear.
So given a (mono) sound signal $s(t)$ coming from a direction, with the elevation $\theta$ and azimuth $\phi$ relative to the head, the left and right channels of a perceived sound can be calculated as follows: $$s^{left}_{\text{percieved}} = hrir^{left}_{\theta, \phi}(t) * s(t) $$ $$s^{right}_{\text{percieved}} = hrir^{right}_{\theta, \phi}(t) * s(t) $$

where $hrir^{left/right}_{\theta, \phi}(t)$ is impulse response resulting from an impulse coming from the direction defined by $\theta, \phi$.

In this porject I used the HRIR dataset "HRTF Measurements of a KEMAR Dummy-Head Microphone" provided by MIT media Lab. This dataset contains the HRIR sampled with $5^{\circ}$ azimuth and $10^{\circ}$ elevation angles.

The process of spatializing a sound boils to localizing the sound source, determining its elevation and azimuth angles and getting the corresponding HRIR signal from the dataset. Since the dataset only contains HRIR signals sampled at certain directions, an interpolation is necessary.

2. Object Detection and Tracking

For detecting objects and tracking them I used a combination of YOLOv8 and StrongSORT as described in this repository. The tracker is used with little modification (only converted into a class for usability and stripped down of stuff that we don't need here). The center of the detected bounding box of each object is convert into azimuth and elevation angles. The detections (object IDs corresponding class ID, azimuth and elevation) are passed to the sound synthesizer.

3. Combining MOT and Binaural Sound Synthesis

TODO: explain how the sound synthesizer and the object tracker were combined.

Sources:

Gardner, B.. “HRTF Measurements of a KEMAR Dummy-Head Microphone.” (1994).