How LLMs are learning to differentiate spatial sounds

Humans have unique sensory functions, among them binaural hearing — meaning we can identify types of sound, as well as what direction it’s coming from and how far away it is, and we can also differentiate multiple sources of sound all occurring at once. While large language models (LLMs) are impressive in their ability to perform audio question answering and speech recognition, translation and synthesis, they have yet to handle such “in-the-wild” spatial audio input. A group of researchers is finally starting to crack that code, introducing BAT, what they are calling the first spatial, audio-based LLM that can reason about sounds in a 3-D environment. The model shows impressive precision in classifying types of audio (such as laughter, heartbeat, and splashing water), sound direction (right, left, below) and sound distance (anywhere from 1 to 10 feet). It also has strong capabilities in spatial reasoning in scenarios where two different sounds are overlapping. Spatial audio — sometimes referred to as ‘virtual surround sound’ — creates the illusion of sound sources in a 3-D space. It is used in applications including virtual reality (VR) and advanced theater systems (as well as other emerging areas, such as the metaverse). But spatial audio is challenging for AI and machine learning (ML), as intelligent agents in 3-D spaces struggle to localize and interpret sound sources. Scientists have attempted to mitigate this with the development of acoustic simulation techniques and algorithms incorporating spatial audio information (such as YouTube-360 and STARSS23).

Full research : Researchers develop a way to make Large Language Models differentiate spatial sounds.

About OODA Analyst