As Meta CEO Mark Zuckerberg envisions, the Metaverse will be a fully immersive virtual experience that rivals reality, at least from the waist down. But the visuals are only part of the overall Metaverse experience. “Getting the right spatial sound is key to delivering a realistic sense of presence in the metaverse,” Zuckerberg wrote in a blog post-Friday. “When you’re at a concert, or just chatting with friends at a virtual table, having a realistic sense of where the sound is coming from makes you feel like you’re there.”
That concert, the blog post notes, will sound very different when performed in a large concert hall than in a high school auditorium because of the differences between their physical spaces and acoustics. As such, Meta’s AI and Reality Lab (MAIR, formerly FAIR) is working with researchers from UT Austin to develop a trio of open-source audio “insights tasks” to help developers build more immersive AR and VR experiences with more lifelike audio.
The first is MAIR’s Visual Acoustic Matching model, which can adapt a sample audio clip to a particular environment by using just an image of the room. Previous simulation models could mimic a room’s acoustics based on its layout — but only if the precise geometry and material properties were already known — or from audio sampled in space, neither of which was particularly accurate. Want to hear what the NY Philharmonic would sound like in San Francisco’s Boom Boom Room? Now you can. Produced results.
“One future use we’re interested in is reliving memories,” Zuckerberg wrote, betting on nostalgia. MAIR’s solution is its Visual Acoustic Matching model, called AViTAR. It “learns acoustic matching from in-the-wild web videos, despite their lack of acoustically mismatched audio and untagged data,” the post said. “Imagine being able to put on AR glasses and see an object with the ability to play back a memory associated with it, such as picking up a tutu and seeing a hologram of your child’s ballet recital. The audio removes reverberation and makes the memory sound like the time you lived it, sitting in your exact seat in the audience.
MAIR’s Visually-Informed Dereverberation (VIDA) mode, on the other hand, will remove the echo effect of playing an instrument in a large, open space such as a subway station or cathedral. You only hear the violin, not its reverberation bouncing off distant surfaces. Specifically, “it learns to remove reverberation based on both the perceived sounds and visual flow, which reveal clues about room geometry, materials, and speaker locations,” the post explained. This technology can more effectively isolate vocals and spoken commands, making them easier to understand for humans and machines.
VisualVoice does the same as VIDA but for voices. It uses both visual and audio cues to learn to separate voices from background noise during its self-supervised training sessions. Meta expects this model to get much work in understanding machine applications and improving accessibility. Think more accurate captions, Siri understanding your request even if the room isn’t quiet, or the acoustics turn into a virtual chat room as speaking people move around the digital space. Again, ignore the lack of legs.
“We envision a future where people can put on AR glasses and relive a holographic memory that looks and sounds like they experienced it from their point of view, or feel immersed by not only the graphics but also the sounds as they play games in a virtual world,” Zuckerberg wrote, noting that AViTAR and VIDA can only apply their tasks to the one photo they were trained for and need a lot more development before being released publicly. “These models bring us even closer to the multimodal, immersive experiences we want to build in the future.” Our editorial team, independent of our parent company, has selected all products recommended by Engadget. Some of our stories contain affiliate links. We may earn an affiliate commission if you buy something through one of these links.