Audio-video fusion methods for lively speaker detection in conferences


Lively speaker detection is the duty of detecting the particular person(s) talking at a given time. On this context, communication takes place not solely via voice but additionally by non-verbal indicators; due to this fact, audio-only strategies is probably not environment friendly sufficient.

A microphone. Image credit: Pxhere, CC0 Public Domain

A microphone. Picture credit score: Pxhere, CC0 Public Area

A current paper on arXiv.org proposes a technique that depends on audio info mixed with video info.

Researchers merge visible and audio options to acquire a sturdy last detection. Two doable approaches for the evaluation of the audio are analyzed: a supervised method with a neural community and an unsupervised method with a speaker segmentation and clustering technique. A pure visible speaker classifier, based mostly on 3D CNNs, is utilized for visible modalities.

Researchers evaluate two fusions: naive fusion and a fusion based mostly on consideration modules. It’s proven that merging each visible and audio modalities permits increased performances than our video-based system.

Conferences are a typical exercise in skilled contexts, and it stays difficult to endow vocal assistants with superior functionalities to facilitate assembly administration. On this context, a activity like lively speaker detection can present helpful insights to mannequin interplay between assembly contributors. Motivated by our software context associated to superior assembly assistant, we wish to mix audio and visible info to attain the very best efficiency. On this paper, we suggest two several types of fusion for the detection of the lively speaker, combining two visible modalities and an audio modality via neural networks. For comparability objective, classical unsupervised approaches for audio function extraction are additionally used. We count on visible knowledge centered on the face of every participant to be very applicable for detecting voice exercise, based mostly on the detection of lip and facial gestures. Thus, our baseline system makes use of visible knowledge and we selected a 3D Convolutional Neural Community structure, which is efficient for concurrently encoding look and motion. To enhance this technique, we supplemented the visible info by processing the audio stream with a CNN or an unsupervised speaker diarization system. We have now additional improved this technique by including visible modality info utilizing movement via optical circulation. We evaluated our proposal with a public and state-of-the-art benchmark: the AMI corpus. We analysed the contribution of every system to the merger carried out with a view to decide if a given participant is at the moment talking. We additionally mentioned the outcomes we obtained. Moreover, we’ve proven that, for our software context, including movement info tremendously improves efficiency. Lastly, we’ve proven that attention-based fusion improves efficiency whereas decreasing the usual deviation.

Analysis article: Pibre, L., “Audio-video fusion methods for lively speaker detection in conferences”, 2022. Hyperlink: https://arxiv.org/abs/2206.10411






Supply hyperlink

Leave a Reply

Your email address will not be published.