Microsoft improves conference room audio with AI that taps multiple mics

On Sep 14, 2019

It’s always frustrating when conference room audio doesn’t reliably reach parties who’ve dialed in remotely. Poor acoustics and interference invariably contribute to reduced clarity and crispness on the other end of the line, which is why scientists at Microsoft’s Speech and Dialog Research Group recently proposed a system that bolsters audio quality by tapping the mics built into smartphones, laptops, and tablets.

They describe their work which is a part of Project Denmark, Microsoft’s endeavor to move beyond traditional microphone arrays to capture meeting conversations in a paper (“Meeting Transcription Using Asynchronous Distant Microphones“) scheduled to be presented at the Interspeech 2019 conference in Graz, Austria next week.

“The central idea behind our approach is to leverage any internet-connected devices, such as the laptops and smartphones that attendees typically bring to meetings, and virtually form an ad hoc microphone array in the cloud,” wrote principal research Takuya Yoshioka in a blog post accompanying the paper. “With our approach, teams would be able to choose to use the cell phones, laptops, and tablets they already bring to meetings to enable high-accuracy transcription without needing special-purpose hardware.”

It’s simpler in theory than in execution. Yoshioka points out that audio fidelity varies quite a bit device-to-device and that speech signals captured by different microphones aren’t aligned with each other. Exacerbating the challenge, both the number of devices and their relative positions are inconsistent meeting-to-meeting.

The Microsoft team’s solution is an end-to-end system that begins by collecting acoustic signals from different microphones and performing beamforming (a technique that effectively makes mic arrays more sensitive to sound coming from a specific direction), orchestrated by a model that identifies relationships among the signals. In the course of beamforming, the signals are fed downstream to speech recognition and speaker diarization (identification) modules before they’re consolidated, annotated, and sent back to the meeting attendees.

The researchers report that in qualitative tests, their AI system outperformed a single-device system by 14.8% and 22.4% with three and seven microphones, respectively, with a 13.6% diarization error rate when 10% of the recorded speech contained more than one speaker. They note that their system isn’t perfect it was occasionally tripped up by overlapping speech but they say it’s an encouraging step toward crystal-clear conference audio that doesn’t require specialized equipment.

“In summary, our study shows the effectiveness of multiple asynchronous microphones for meeting transcription in real-world scenarios,” wrote Yoshioka and colleagues in the paper. “[W]e gain potentially better spatial coverage since … devices will tend to be distributed around the room and relatively near the speakers. Also, in many use cases, it will be natural for meeting participants to bring and then repurpose their personal devices, in the service of better transcription quality.”

Microsoft’s research in transcription manifested in Microsoft 365 last summer, which gained an autonomous speech-to-text conversion feature that enables meeting participants to search video transcripts. Months later, Microsoft rolled out automated transcriptions for audio and video files in OneDrive and SharePoint.