• Ingen resultater fundet

Effects of Lombard Reflex on Deep-Learning-Based Audio-Visual Speech Enhancement Systems

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Effects of Lombard Reflex on Deep-Learning-Based Audio-Visual Speech Enhancement Systems"

Copied!
17
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Effects of Lombard Reflex on Deep-Learning-Based Audio-Visual Speech Enhancement Systems

Daniel Michelsant1, Zheng-Hua Tan1, Sigurdur Sigurdsson2 and Jesper Jensen1,2

1Centre for Acoustc Signal Processing Research (CASPR), Aalborg University

2Otcon A/S

{danmi,zt,jje}@es.aau.dk {ssig,jesj}@otcon.com

(2)

About Us

Daniel Michelsant is a PhD Fellow at the Centre for Acoustc Signal Processing Research (CASPR), Aalborg University, Denmark, under the supervision of Zheng-Hua Tan, Sigurdur Sigurdsson and Jesper Jensen. His research interests include speech processing, computer vision and deep learning.

Zheng-Hua Tan is a Professor in the Department of Electronic Systems at Aalborg University, Denmark. He is also a co-founder of CASPR. His research interests include machine learning, deep learning, speech and speaker recogniton, noise-robust speech processing, multmodal signal processing, and social robotcs.

Sigurdur Sigurdsson is a Senior Specialist with Otcon A/S, Copenhagen, Denmark. His research interests include speech enhancement in noisy environments, machine learning and signal processing for hearing aid applicatons.

Jesper Jensen is a Senior Principal Scientst with Otcon A/S, Copenhagen, Denmark, and a Professor in the Department of Electronic Systems, at Aalborg University. He is also a co- founder of CASPR. His main interests include signal retrieval from noisy observatons,

(3)

Instructons

This is a demonstraton regarding the impact of the Lombard effect on speech enhancement.

To navigate the demo you can:

• Click on the blue bar on the right to go to the next page.

• Click on the blue bar on the left to go to the previous page.

• Click on the media to play the content. The media in this demonstraton are playable if they have a red square on the bottom left corner.

(4)

Speech Enhancement

Speech enhancement is the task of estmatng the clean speech of a target speaker immersed in an acoustcally noisy environment, where different sources of disturbance are present, e.g. competng speakers, background music, and reflectons from the walls. Usually this estmaton is done by performing a manipulaton of the tme- frequency representaton of the signal.

Time Domain Time-Frequency Domain

Icons designed by Freepik

(5)

Lombard Effect

In presence of background noise, speakers instnctvely change their speaking style to maintain their speech intelligible. This reflex is known as Lombard effect [1], and it is characterized by:

• an increase in speech sound level [2].

• a longer word duraton [3].

• modificatons of the speech spectrum [2].

• a speech hyper-artculaton [4].

It has been shown that the mismatch between the neutral and the Lombard speaking styles can lead to sub- optmal performance of speaker [5] and speech recogniton [2] systems.

(6)

Deep-Learning-Based Framework

Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout Conv + Leaky-ReLU + Batch Norm + Max Pooling + Dropout

Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Conv + Leaky-ReLU + Batch Norm Fully Connected + Leaky-ReLU Fully Connected + Leaky-ReLU Fully Connected + Leaky-ReLU Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm Deconv + Leaky-ReLU + Batch Norm

Face Detection Face Alignment Mouth Region Extraction STFT

Magnitude/Phase Decomposition

Estimated Ideal Amplitude Mask

ISTFT

Video Encoder Audio Encoder Fusion Sub-Network Audio Decoder

Architecture

We use a neural network architecture inspired by [6] and identcal to [7]. For the single-modality systems, one of the encoders is discarded.

(7)

Goal

The purpose of this demo is two-fold:

1. Showing the benefit of using visual informaton of speakers to enhance their speech.

2. Comparing systems trained on non-Lombard (NL) speech with systems trained on Lombard (L) speech.

We trained six deep-learning-based systems:

The systems were trained on the utterances from the Lombard GRID corpus [8], to which speech shaped noise is added at several signal to noise ratos (SNRs).

The following videos are from speakers observed during training (seen speakers).

For more details, refer to [9].

• AO-L – Audio-only trained on Lombard speech.

• VO-L – Video-only trained on Lombard speech.

• AV-L – Audio-visual trained on Lombard speech.

• AO-NL – Audio-only trained on non-Lombard speech.

• VO-NL – Video-only trained on non-Lombard speech.

• AV-NL – Audio-visual trained on non-Lombard speech.

(8)

Speech Enhancement (-20 dB SNR)

UNPROCESSED AO-L VO-L AV-L

Comparison between audio-only (AO), video-only (VO) and audio-visual (AV) systems.

“Lay blue by G zero soon”

“Bin green by Q zero again”

“Bin blue in Z seven please”

(9)

Speech Enhancement (-10 dB SNR)

UNPROCESSED AO-L VO-L AV-L

Comparison between audio-only (AO), video-only (VO) and audio-visual (AV) systems.

“Lay blue by G zero soon”

“Bin green by Q zero again”

“Bin blue in Z seven please”

(10)

Speech Enhancement (0 dB SNR)

UNPROCESSED AO-L VO-L AV-L

Comparison between audio-only (AO), video-only (VO) and audio-visual (AV) systems.

“Lay blue by G zero soon”

“Bin green by Q zero again”

“Bin blue in Z seven please”

(11)

Estmated Speech Quality and Intelligibility

The performance of the models are evaluated in terms of PESQ and ESTOI, because they are good estmators of speech quality and intelligibility, respectvely. PESQ ranges from -0.5 to 4.5, where high values correspond to high speech quality. For ESTOI, whose range is practcally between 0 and 1, higher scores correspond to higher speech intelligibility.

(12)

Speech Enhancement (-20 dB SNR)

“Lay blue by G zero soon”

“Bin green by Q zero again”

“Bin blue in Z seven please”

VO-L VO-NL

AO-L

AO-NL AV-NL AV-L

Comparison between non-Lombard (NL) and Lombard (L) systems.

(13)

Speech Enhancement (-10 dB SNR)

VO-L VO-NL

AO-L

AO-NL AV-NL AV-L

Comparison between non-Lombard (NL) and Lombard (L) systems.

“Lay blue by G zero soon”

“Bin green by Q zero again”

“Bin blue in Z seven please”

(14)

Speech Enhancement (0 dB SNR)

VO-L VO-NL

AO-L

AO-NL AV-NL AV-L

Comparison between non-Lombard (NL) and Lombard (L) systems.

“Lay blue by G zero soon”

“Bin green by Q zero again”

“Bin blue in Z seven please”

(15)

Estmated Speech Quality and Intelligibility

The performance of the models are evaluated in terms of PESQ and ESTOI, because they are good estmators of speech quality and intelligibility, respectvely. PESQ ranges from -0.5 to 4.5, where high values correspond to high speech quality. For ESTOI, whose range is practcally between 0 and 1, higher scores correspond to higher speech intelligibility.

(16)

References

[1] H. Brumm and S. A. Zollinger, “The evoluton of the Lombard effect: 100 years of psychoacoustc research,”

Behaviour, vol. 148, no. 11-13, pp. 1173–1198, 2011.

[2] J.-C. Junqua, “The Lombard reflex and its role on human listeners and automatc speech recognizers,” The Journal of the Acoustcal Society of America, vol. 93, no. 1, pp. 510–524, 1993.

[3] A. L. Pittman and T. L. Wiley, “Recogniton of speech produced in noise,” Journal of Speech, Language, and Hearing Research, vol. 44, no. 3, pp. 487–496, 2001.

[4] M. Garnier, L. Ménard, and B. Alexandre, “Hyper-artculaton in Lombard speech: An actve communicatve strategy to enhance visible speech cues?,” The Journal of the Acoustcal Society of America, vol. 144, no. 2, pp.

1059–1074, 2018.

[5] J. H. L. Hansen and V. Varadarajan, “Analysis and compensaton of Lombard speech across noise type and levels with applicaton to in-set/out-of-set speaker recogniton,” IEEE Transactons on Audio, Speech, and Language Processing, vol. 17, no. 2, pp. 366–378, 2009.

[6] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc. of Interspeech, 2018.

[7] D. Michelsant, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “On training targets and objectve functons for deep- learning-based audio-visual speech enhancement,” arXiv preprint: https://arxiv.org/abs/1811.06234.

[8] N. Alghamdi, S. Maddock, R. Marxer, J. Barker, and G. J. Brown, “A corpus of audio-visual Lombard speech with frontal and profile views,” The Journal of the Acoustcal Society of America, vol. 143, no. 6, pp. EL523–

EL529, 2018.

[9] D. Michelsant, Z.-H. Tan, S. Sigurdsson, J. Jensen, “Effects of Lombard Reflex on the Performance of Deep-

(17)

Back to the

Title Page

Referencer

RELATEREDE DOKUMENTER

Specifically, we train deep bi-directional Long Short-Term Mem- ory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker

Distance metric learning based methods tend to learn distance metrics for camera pairs using pairwise labeled data between those cameras, whereas end to end Deep learning based

In this project the emphasis is on classification based on the pitch of the signal, and three classes, music, noise and speech, is used.. Unfortunately pitch is not

• Alle succeser er at finde i områder hvor store mængder repræsentativ data er tilgængelig?. • Alle fiaskoer skyldes manglende eller

Her research interests include Knowledge and Innovation Management, Impact of Information Systems in Organizations, Life Long Learning at the Higher Education level, Social

In grade 1, Danish students used a talking book with TTS (text-to-speech) and participated in a learning design with emphasis on decoding and reading for meaning in

These include methodological questions such as design- based research, presentations of educational designs such as online learning and discussions of learning

Larsen, Improving Music Genre Classification by Short-Time Feature Integration , IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. Hansen, PHONEMES