Effects of Lombard Reflex on Deep-Learning-Based Audio-Visual Speech Enhancement Systems

(1)

Effects of Lombard Reflex on Deep-Learning-Based Audio-Visual Speech Enhancement Systems

Daniel Michelsant¹, Zheng-Hua Tan¹, Sigurdur Sigurdsson² and Jesper Jensen^1,2

1Centre for Acoustc Signal Processing Research (CASPR), Aalborg University

2Otcon A/S

{danmi,zt,jje}@es.aau.dk {ssig,jesj}@otcon.com

(2)

About Us

Daniel Michelsant is a PhD Fellow at the Centre for Acoustc Signal Processing Research (CASPR), Aalborg University, Denmark, under the supervision of Zheng-Hua Tan, Sigurdur Sigurdsson and Jesper Jensen. His research interests include speech processing, computer vision and deep learning.

Zheng-Hua Tan is a Professor in the Department of Electronic Systems at Aalborg University, Denmark. He is also a co-founder of CASPR. His research interests include machine learning, deep learning, speech and speaker recogniton, noise-robust speech processing, multmodal signal processing, and social robotcs.

Sigurdur Sigurdsson is a Senior Specialist with Otcon A/S, Copenhagen, Denmark. His research interests include speech enhancement in noisy environments, machine learning and signal processing for hearing aid applicatons.

Jesper Jensen is a Senior Principal Scientst with Otcon A/S, Copenhagen, Denmark, and a Professor in the Department of Electronic Systems, at Aalborg University. He is also a co- founder of CASPR. His main interests include signal retrieval from noisy observatons,

(3)

Instructons

This is a demonstraton regarding the impact of the Lombard effect on speech enhancement.

To navigate the demo you can:

• Click on the blue bar on the right to go to the next page.

• Click on the blue bar on the left to go to the previous page.

• Click on the media to play the content. The media in this demonstraton are playable if they have a red square on the bottom left corner.

(4)

Speech Enhancement

Speech enhancement is the task of estmatng the clean speech of a target speaker immersed in an acoustcally noisy environment, where different sources of disturbance are present, e.g. competng speakers, background music, and reflectons from the walls. Usually this estmaton is done by performing a manipulaton of the tme- frequency representaton of the signal.

Time Domain Time-Frequency Domain

Icons designed by Freepik

(5)

Lombard Effect

In presence of background noise, speakers instnctvely change their speaking style to maintain their speech intelligible. This reflex is known as Lombard effect [1], and it is characterized by:

• an increase in speech sound level [2].

• a longer word duraton [3].

• modificatons of the speech spectrum [2].

• a speech hyper-artculaton [4].

It has been shown that the mismatch between the neutral and the Lombard speaking styles can lead to sub- optmal performance of speaker [5] and speech recogniton [2] systems.

(6)

Deep-Learning-Based Framework

Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout

Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Fully Connected + Leaky-ReLU Fully Connected + Leaky-ReLU Fully Connected + Leaky-ReLU Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm

Face Detection Face Alignment Mouth Region Extraction STFT

Magnitude/Phase Decomposition

Estimated Ideal Amplitude Mask

ISTFT

Video Encoder Audio Encoder Fusion Sub-Network Audio Decoder

Architecture

We use a neural network architecture inspired by [6] and identcal to [7]. For the single-modality systems, one of the encoders is discarded.

(7)

Goal

The purpose of this demo is two-fold:

1. Showing the benefit of using visual informaton of speakers to enhance their speech.

2. Comparing systems trained on non-Lombard (NL) speech with systems trained on Lombard (L) speech.

We trained six deep-learning-based systems:

The systems were trained on the utterances from the Lombard GRID corpus [8], to which speech shaped noise is added at several signal to noise ratos (SNRs).

The following videos are from speakers observed during training (seen speakers).

For more details, refer to [9].

• AO-L – Audio-only trained on Lombard speech.

• VO-L – Video-only trained on Lombard speech.

• AV-L – Audio-visual trained on Lombard speech.

• AO-NL – Audio-only trained on non-Lombard speech.

• VO-NL – Video-only trained on non-Lombard speech.

• AV-NL – Audio-visual trained on non-Lombard speech.

(8)

Speech Enhancement (-20 dB SNR)

UNPROCESSED AO-L VO-L AV-L

Comparison between audio-only (AO), video-only (VO) and audio-visual (AV) systems.

“Lay blue by G zero soon”

“Bin green by Q zero again”

“Bin blue in Z seven please”

(9)

Speech Enhancement (-10 dB SNR)

(10)

Speech Enhancement (0 dB SNR)

(11)

Estmated Speech Quality and Intelligibility

The performance of the models are evaluated in terms of PESQ and ESTOI, because they are good estmators of speech quality and intelligibility, respectvely. PESQ ranges from -0.5 to 4.5, where high values correspond to high speech quality. For ESTOI, whose range is practcally between 0 and 1, higher scores correspond to higher speech intelligibility.

(12)

Speech Enhancement (-20 dB SNR)

VO-L VO-NL

AO-L

AO-NL AV-NL AV-L

Comparison between non-Lombard (NL) and Lombard (L) systems.

(13)

Speech Enhancement (-10 dB SNR)

VO-L VO-NL

AO-L

AO-NL AV-NL AV-L

(14)

Speech Enhancement (0 dB SNR)

VO-L VO-NL

AO-L

AO-NL AV-NL AV-L

(15)

Estmated Speech Quality and Intelligibility

The performance of the models are evaluated in terms of PESQ and ESTOI, because they are good estmators of speech quality and intelligibility, respectvely. PESQ ranges from -0.5 to 4.5, where high values correspond to high speech quality. For ESTOI, whose range is practcally between 0 and 1, higher scores correspond to higher speech intelligibility.

(16)

References

[1] H. Brumm and S. A. Zollinger, “The evoluton of the Lombard effect: 100 years of psychoacoustc research,”

Behaviour, vol. 148, no. 11-13, pp. 1173–1198, 2011.

[2] J.-C. Junqua, “The Lombard reflex and its role on human listeners and automatc speech recognizers,” The Journal of the Acoustcal Society of America, vol. 93, no. 1, pp. 510–524, 1993.

[3] A. L. Pittman and T. L. Wiley, “Recogniton of speech produced in noise,” Journal of Speech, Language, and Hearing Research, vol. 44, no. 3, pp. 487–496, 2001.

[4] M. Garnier, L. Ménard, and B. Alexandre, “Hyper-artculaton in Lombard speech: An actve communicatve strategy to enhance visible speech cues?,” The Journal of the Acoustcal Society of America, vol. 144, no. 2, pp.

1059–1074, 2018.

[5] J. H. L. Hansen and V. Varadarajan, “Analysis and compensaton of Lombard speech across noise type and levels with applicaton to in-set/out-of-set speaker recogniton,” IEEE Transactons on Audio, Speech, and Language Processing, vol. 17, no. 2, pp. 366–378, 2009.

[6] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc. of Interspeech, 2018.

[7] D. Michelsant, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “On training targets and objectve functons for deep- learning-based audio-visual speech enhancement,” arXiv preprint: https://arxiv.org/abs/1811.06234.

[8] N. Alghamdi, S. Maddock, R. Marxer, J. Barker, and G. J. Brown, “A corpus of audio-visual Lombard speech with frontal and profile views,” The Journal of the Acoustcal Society of America, vol. 143, no. 6, pp. EL523–

EL529, 2018.

[9] D. Michelsant, Z.-H. Tan, S. Sigurdsson, J. Jensen, “Effects of Lombard Reflex on the Performance of Deep-

(17)

Effects of Lombard Reflex on Deep-Learning-Based Audio-Visual Speech Enhancement Systems