• Ingen resultater fundet

Aalborg Universitet Single-Microphone Speech Enhancement and Separation Using Deep Learning Kolbæk, Morten

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Aalborg Universitet Single-Microphone Speech Enhancement and Separation Using Deep Learning Kolbæk, Morten"

Copied!
141
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Aalborg Universitet

Single-Microphone Speech Enhancement and Separation Using Deep Learning

Kolbæk, Morten

Publication date:

2018

Document Version Other version

Link to publication from Aalborg University

Citation for published version (APA):

Kolbæk, M. (2018). Single-Microphone Speech Enhancement and Separation Using Deep Learning. Aalborg Universitetsforlag. Ph.d.-serien for Det Tekniske Fakultet for IT og Design, Aalborg Universitet

https://www.youtube.com/watch?v=cGPWFYaG3C4

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

- Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

- You may not further distribute the material or use it for any profit-making activity or commercial gain - You may freely distribute the URL identifying the publication in the public portal -

Take down policy

If you believe that this document breaches copyright please contact us at vbn@aub.aau.dk providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from vbn.aau.dk on: July 14, 2022

(2)

Single-Microphone Speech Enhancement and Separation Using Deep Learning

November 30, 2018 Morten Kolbæk

PhD Fellow

Department of Electronic Systems Aalborg University

Denmark

(3)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 1

Supervisors: Prof. Jesper Jensen, AAU, Oticon Prof. Zheng-Hua Tan, AAU

Stay Abroad: Dr. Dong Yu, Tencent AI Lab / Microsoft Research

(4)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32

Agenda

2

Introduction:

Cocktail Party Problem

Speech Enhancement and Separation Deep Learning

Scientific Contributions:

Generalization of Deep Learning based Speech Enhancement Human Receivers - Speech Intelligibility

Machine Receivers - Speaker Verification

On STOI Optimal Deep Learning based Speech Enhancement

Permutation Invariant Training for Deep Learning based Speech Separation

(5)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 2

Part I

Introduction

(6)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32

The Cocktail Party Problem

2

Cocktail Party Problem

Speech Enhancement and Separation

Deep Learning

(7)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 3

The Cocktail Party Problem

How do we recognize what one person is saying when others are speaking at the same time (the "cocktail party problem")? On what logical basis could one design a machine ("filter") for carrying out such an operation?

– Colin Cherry, 1953.

(8)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(9)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(10)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(11)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(12)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(13)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(14)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(15)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(16)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(17)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

The Cocktail Party Problem

The Vision: Solve the Problem

(18)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 4

Speech Enhancement and Separation

Cocktail Party Problem

Speech Enhancement and Separation

Deep Learning

(19)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 5

Single-Microphone Speech Enhancement

First Task of the Thesis

Speaker 1

Noise Speaker 1

Speech Enhancement

Algorithm

(20)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 6

Single-Microphone Speech Separation

Second Task of the Thesis

Speaker 1 Speaker 2 Noise

Speaker 1

Speaker 2

Speech Separation

Algorithm

(21)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 7

Speech Enhancement and Separation

Two Motivating Applications

Why Is Solving the Cocktail Party Problem Important?

Human Receivers

I Potential:

Hundreds of millions of people worldwide have a hearing loss.

I Challenge:

Hearing impaired often struggle in "cocktail party" situations.

I Solution:

Algorithms that can enhance the speech signal of interest.

I Application:

Hearing Assistive Devices e.g. hearing aids or cochlear implants.

Machine Receivers

I Potential:

Millions of people vocally interact with smartphones.

I Challenge:

These devices operate in complex acoustic environments.

I Solution:

Noise-robust human-machine interface.

I Application:

Social robots or digital

assistants e.g. Google Asst., Siri, etc.

(22)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 7

Speech Enhancement and Separation

Two Motivating Applications

Why Is Solving the Cocktail Party Problem Important?

Human Receivers

I Potential:

Hundreds of millions of people worldwide have a hearing loss.

I Challenge:

Hearing impaired often struggle in "cocktail party" situations.

I Solution:

Algorithms that can enhance the speech signal of interest.

I Application:

Hearing Assistive Devices e.g. hearing aids or cochlear implants.

Machine Receivers

I Potential:

Millions of people vocally interact with smartphones.

I Challenge:

These devices operate in complex acoustic environments.

I Solution:

Noise-robust human-machine interface.

I Application:

Social robots or digital

assistants e.g. Google Asst., Siri, etc.

(23)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions are valid (sometimes they are not).

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

data and computational resources.

(24)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

(25)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions are valid (sometimes they are not).

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

data and computational resources.

(26)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

(27)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions are valid (sometimes they are not).

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

data and computational resources.

(28)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

(29)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions are valid (sometimes they are not).

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

data and computational resources.

(30)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

(31)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Speech Enhancement and Separation

Old Problem: Whats new?

Whats new? – A paradigm shift!

Classical Paradigm

I Derive

the solution using

specific

mathematical models that

approximate

speech and noise.

I

Simplifying assumptions for mathematical tractability.

I

Generally not data-driven.

I

Good performance when assumptions are valid (sometimes they are not).

Deep Learning Paradigm

I Learn

the solution using

general

mathematical models that have "observed"

speech and noise.

I

No explicit assumptions.

I

Data-driven.

I

State-of-the-art performance given enough

data and computational resources.

(32)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 8

Deep Learning

Cocktail Party Problem

Speech Enhancement and Separation

Deep Learning

(33)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 9

Deep Learning

What is it?

Unknown Function

f (x) y

x

”Learned” Function

f ˆ (x) y ˆ x

ˆ y ≈ y I Deep Learning: Subfield of Machine

Learning.

I Machine Learning: Use data to "learn"

or approximate unknown functions f (x)

that can be used to make predictions.

(34)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 9

Deep Learning

What is it?

Unknown Function

f (x) y

x

”Learned” Function

f ˆ (x) y ˆ x

ˆ y ≈ y I Deep Learning: Subfield of Machine

Learning.

I Machine Learning: Use data to "learn"

or approximate unknown functions f (x)

that can be used to make predictions.

(35)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) = ax6+bx5+cx4

+dx3+ex2+fx+g

4-params. (Small Capacity) fˆ2(x) = ax3+bx2+cx+d I Goal:

Find parameters of

1(x)

and

2(x)

that best

explain the observations. Income $

Subjectiv e Happiness Scale

(36)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) = ax6+bx5+cx4

+dx3+ex2+fx+g

4-params. (Small Capacity) fˆ2(x) = ax3+bx2+cx+d I Goal:

Find parameters of

Subjectiv e Happiness Scale

(37)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) = ax6+bx5+cx4

+dx3+ex2+fx+g

4-params. (Small Capacity) fˆ2(x) = ax3+bx2+cx+d I Goal:

Find parameters of

1(x)

and

2(x)

that best

explain the observations. Income $

Subjectiv e Happiness Scale

(38)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) = ax6+bx5+cx4

+dx3+ex2+fx+g

4-params. (Small Capacity) fˆ2(x) = ax3+bx2+cx+d I Goal:

Find parameters of

Subjectiv e Happiness Scale

(39)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) = ax6+bx5+cx4

+dx3+ex2+fx+g

4-params. (Small Capacity) fˆ2(x) = ax3+bx2+cx+d I Goal:

Find parameters of

1(x)

and

2(x)

that best

explain the observations. Income $

Subjectiv e Happiness Scale

(40)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) =−0.2x6+2.5x5−8.1x4

+10.3x3−5.4x2+1.2x+0.3

4-params. (Small Capacity) fˆ2(x) =−22.2x3+2.6x2+3.8x−0.6

I Goal:

Find parameters of

Subjectiv e Happiness Scale

(41)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) =−0.2x6+2.5x5−8.1x4

+10.3x3−5.4x2+1.2x+0.3

4-params. (Small Capacity) fˆ2(x) =−22.2x3+2.6x2+3.8x−0.6

I Goal:

Find parameters of

1(x)

and

2(x)

that best

explain the observations. Income $

Subjectiv e Happiness Scale

Overfitting!

(42)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) = 1.1x6−6.5x5+15.1x4

−18.0x3+11.0x2−2.7x+0.6

4-params. (Small Capacity) fˆ2(x) = 18.2x3−19.4x2+9.3x−1.2

I Goal:

Find parameters of

Subjectiv e Happiness Scale

(43)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) =−0.3x6+3.2x5+11.1x4

+17.3x3−13.6x2+5.6x−0.5

4-params. (Small Capacity) fˆ2(x) =−9.2x3+2.9x2+1.1x−0.2

I Goal:

Find parameters of

1(x)

and

2(x)

that best

explain the observations. Income $

Subjectiv e Happiness Scale

(44)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) =−0.1x6+2.0x5−7.7x4

+13.3x3−11.7x2+5.3x−0.5

4-params. (Small Capacity) fˆ2(x) = 10.9x3−10.4x2+5.1x−0.5

I Goal:

Find parameters of

Subjectiv e Happiness Scale

(45)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 10

Deep Learning

What is it? – Classical Regression Example

I Estimate Happiness from income

I Hypothesis: Happiness is associated with income.

I Data: Perceived happiness and income from people.

I Candidate Models:

7-params. (Big Capacity) fˆ1(x) =−0.1x6+2.0x5−7.7x4

+13.3x3−11.7x2+5.3x−0.5

4-params. (Small Capacity) fˆ2(x) = 10.9x3−10.4x2+5.1x−0.5

I Goal:

Find parameters of

1(x)

and

2(x)

that best

explain the observations. Income $

Subjectiv e Happiness Scale

Good Generalization!

(46)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 11

Deep Learning

What is it? – Essentially Regression with Deep Neural Networks

I Deep Learning

I "Regression" using Deep Neural Networks.

I Deep Neural Network

I Non-linear function with potentially MANY (millions) parameters.

I If big enough, they can approximate any function.

I With enough data, they can learn complex mappings.

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

ˆ y1

ˆ y2

ˆ y3

ˆ y4

ˆ y5

ˆ y6

ˆ y7

ˆ y8

ˆ y9

ˆ y10 a

b c d e f g

x1

x2

x3

x4

x5

x6

x7

x8

x9

ˆ y1

ˆ y2

ˆ y3

ˆ y4

ˆ y5

ˆ y6

ˆ y7

ˆ y8

ˆ y9

a b c d e f g

(47)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 11

Deep Learning

What is it? – Essentially Regression with Deep Neural Networks

I Deep Learning

I "Regression" using Deep Neural Networks.

I Deep Neural Network

I Non-linear function with potentially MANY (millions) parameters.

I If big enough, they can approximate any function.

I With enough data, they can learn complex mappings.

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

ˆ y1

ˆ y2

ˆ y3

ˆ y4

ˆ y5

ˆ y6

ˆ y7

ˆ y8

ˆ y9

ˆ y10 a

b c d e f g

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

ˆ y1

ˆ y2

ˆ y3

ˆ y4

ˆ y5

ˆ y6

ˆ y7

ˆ y8

ˆ y9

ˆ y10

a b c d e f g

(48)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 11

Deep Learning

What is it? – Essentially Regression with Deep Neural Networks

I Deep Learning

I "Regression" using Deep Neural Networks.

I Deep Neural Network

I Non-linear function with potentially MANY (millions) parameters.

I If big enough, they can approximate any function.

I With enough data, they can learn complex mappings.

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

ˆ y1

ˆ y2

ˆ y3

ˆ y4

ˆ y5

ˆ y6

ˆ y7

ˆ y8

ˆ y9

ˆ y10 a

b c d e f g

x1

x2

x3

x4

x5

x6

x7

x8

x9

ˆ y1

ˆ y2

ˆ y3

ˆ y4

ˆ y5

ˆ y6

ˆ y7

ˆ y8

ˆ y9

a b c d e f g

(49)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 11

Deep Learning

What is it? – Essentially Regression with Deep Neural Networks

I Deep Learning

I "Regression" using Deep Neural Networks.

I Deep Neural Network

I Non-linear function with potentially MANY (millions) parameters.

I If big enough, they can approximate any function.

I With enough data, they can learn complex mappings.

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

ˆ y1

ˆ y2

ˆ y3

ˆ y4

ˆ y5

ˆ y6

ˆ y7

ˆ y8

ˆ y9

ˆ y10 a

b c d e f g

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

ˆ y1

ˆ y2

ˆ y3

ˆ y4

ˆ y5

ˆ y6

ˆ y7

ˆ y8

ˆ y9

ˆ y10

a b c d e f g

(50)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(51)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(52)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(53)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(54)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(55)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(56)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(57)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(58)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(59)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26% +14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

www.pwc.com/AI PwC research shows

global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost). The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(60)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Deep Learning

What Can It Do?

Game changer

$15.7 trillion

AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined.

Sizing the prize

What’s the real value of AI for your business and how can you capitalise?

+26%

+14%

Artificial intelligence (AI) is a source of both huge excitement and apprehension. What are the real opportunities and threats for your business? Drawing on a detailed analysis of the business impact of AI, we identify the most valuable commercial opening in your market and how to take advantage of them.

PwC research shows global GDP could be up to 14% higher in 2030 as a result of AI – the equivalent of an additional $15.7 trillion – making it the biggest commercial opportunity in today’s fast changing economy.

The greatest gains from AI are likely to be in China (boost of up to 26% GDP in 2030) and North America (potential 14% boost).

The biggest sector gains will be in retail, f nancial services and healthcare as AI increases productivity, product quality and consumption.

(61)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Part II

Scientific Contributions

(62)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 12

Generalization of DNN based Speech Enhancement

Human Receivers - Speech Intelligibility

Generalization of Deep Learning based Speech Enhancement Human Receivers - Speech Intelligibility

Machine Receivers - Speaker Verification

On STOI Optimal Deep Learning based Speech Enhancement

Permutation Invariant Training for Deep Learning based Speech Separation

Summary and Conclusion

(63)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 13

Generalization of DNN based Speech Enhancement

Human Receivers - Motivation and Research Gap

Promising Results

I Recent studies show that speech enhancement algorithms based on deep learning outperform classical techniques.

I DNNs trained and tested in"narrow"conditions.

Research Gap

I Unknownhow these algorithms perform in general"broader"conditions and in conditions with a mismatch between training and test.

y[n]

r(k, m) ˆ

g(k, m) ˆa(k, m) x[n]ˆ

Assump. / Prior Framing,

Transform

Gain Estimator e.g. DNN

Synthesis, Overlap-add

y[n] : Noisy speech (time-domain) r(k, m) : Noisy speech (transform-domain) ˆ

g(k, m) : Estimated gain ˆ

a(k, m) : Enhanced speech (transform-domain) ˆ

x[n] : Enhanced speech (time-domain)

(64)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 13

Generalization of DNN based Speech Enhancement

Human Receivers - Motivation and Research Gap

Promising Results

I Recent studies show that speech enhancement algorithms based on deep learning outperform classical techniques.

I DNNs trained and tested in"narrow"conditions.

Research Gap

I Unknownhow these algorithms perform in general"broader"conditions and in conditions with a mismatch between training and test.

y[n]

r(k, m) ˆ

g(k, m) ˆa(k, m) x[n]ˆ

Assump. / Prior Framing,

Transform

Gain Estimator e.g. DNN

Synthesis, Overlap-add

y[n] : Noisy speech (time-domain) r(k, m) : Noisy speech (transform-domain) ˆ

g(k, m) : Estimated gain ˆ

a(k, m) : Enhanced speech (transform-domain) ˆ

x[n] : Enhanced speech (time-domain)

(65)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32 13

Generalization of DNN based Speech Enhancement

Human Receivers - Motivation and Research Gap

Promising Results

I Recent studies show that speech enhancement algorithms based on deep learning outperform classical techniques.

I DNNs trained and tested in"narrow"conditions.

Research Gap

I Unknownhow these algorithms perform in general"broader"conditions and in conditions with a mismatch between training and test.

y[n]

r(k, m) ˆ

g(k, m) ˆa(k, m) x[n]ˆ

Assump. / Prior Framing,

Transform

Gain Estimator e.g. DNN

Synthesis, Overlap-add

y[n] : Noisy speech (time-domain) r(k, m) : Noisy speech (transform-domain) ˆ

g(k, m) : Estimated gain ˆ

a(k, m) : Enhanced speech (transform-domain) ˆ

x[n] : Enhanced speech (time-domain)

(66)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32

Generalization of DNN based Speech Enhancement

14

Human Receivers - Contribution

Contribution

I We studied generalizability capability of deep neural network-based speech enhancement algorithms for additive-noise corrupted speech [1].

I Specifically, our goal was to study the generalization error w.r.t. three dimensions:

I Speaker Identity I Signal-to-Noise Ratio I Noise type

I We trained multiple DNNs with various priors.

I Generalization was evaluated using PESQ and STOI, which are speech quality and intelligibility estimators, respectively.

y[n]

r(k, m) ˆ

g(k, m) ˆa(k, m) ˆx[n]

Prior Framing,

Analysis

Deep Neural Network

Synthesis, Overlap-add

"Speaker ID" "SNR" "Noise type"

y[n] = x[n] + αv[n]

(67)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32

Generalization of DNN based Speech Enhancement

14

Human Receivers - Contribution

Contribution

I We studied generalizability capability of deep neural network-based speech enhancement algorithms for additive-noise corrupted speech [1].

I Specifically, our goal was to study the generalization error w.r.t. three dimensions:

I Speaker Identity I Signal-to-Noise Ratio I Noise type

I We trained multiple DNNs with various priors.

I Generalization was evaluated using PESQ and STOI, which are speech quality and intelligibility estimators, respectively.

y[n]

r(k, m) ˆ

g(k, m) ˆa(k, m) ˆx[n]

Prior Framing,

Analysis

Deep Neural Network

Synthesis, Overlap-add

"Speaker ID" "SNR" "Noise type"

y[n] = x[n] + αv[n]

[1] M. Kolbæk,et al.,IEEE TASLP, 2017

(68)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32

Generalization of DNN based Speech Enhancement

14

Human Receivers - Contribution

Contribution

I We studied generalizability capability of deep neural network-based speech enhancement algorithms for additive-noise corrupted speech [1].

I Specifically, our goal was to study the generalization error w.r.t. three dimensions:

I Speaker Identity I Signal-to-Noise Ratio I Noise type

I We trained multiple DNNs with various priors.

I Generalization was evaluated using PESQ and STOI, which are speech quality and intelligibility estimators, respectively.

y[n]

r(k, m) ˆ

g(k, m) ˆa(k, m) ˆx[n]

Prior Framing,

Analysis

Deep Neural Network

Synthesis, Overlap-add

"Speaker ID" "SNR" "Noise type"

y[n] = x[n] + αv[n]

(69)

Morten Kolbæk | Single-Microphone Speech Enhancement and Separation Using Deep Learning

32

Generalization of DNN based Speech Enhancement

14

Human Receivers - Contribution

Contribution

I We studied generalizability capability of deep neural network-based speech enhancement algorithms for additive-noise corrupted speech [1].

I Specifically, our goal was to study the generalization error w.r.t. three dimensions:

I Speaker Identity I Signal-to-Noise Ratio I Noise type

I We trained multiple DNNs with various priors.

I Generalization was evaluated using PESQ and STOI, which are speech quality and intelligibility estimators, respectively.

y[n]

r(k, m) ˆ

g(k, m) ˆa(k, m) ˆx[n]

Prior Framing,

Analysis

Deep Neural Network

Synthesis, Overlap-add

"Speaker ID" "SNR" "Noise type"

y[n] = x[n] + αv[n]

[1] M. Kolbæk,et al.,IEEE TASLP, 2017

Referencer

RELATEREDE DOKUMENTER

Specifically, we train deep bi-directional Long Short-Term Mem- ory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker

Figure 5 – The iHARPERELLA background field corrected phase was used to reconstruct susceptibility maps using iLSQR, MEDI and DeepQSM.. All 27 participants were

His research interests include machine learning, deep learning, speech and speaker recogniton, noise-robust speech processing, multmodal signal processing, and

Considering the NILM, the aggregated data can be considered as noisy input (the noise here is the energy consumption of appliances other than the target

The Spectral Subtraction method subtracts an estimate of the noise magnitude spectrum from the noisy speech magnitude spectrum and transforms it back to the time domain using the

Distance metric learning based methods tend to learn distance metrics for camera pairs using pairwise labeled data between those cameras, whereas end to end Deep learning based

The results are based on a full scale experiment in the education, Master of Industrial Information Technology (MII) and is one of many projects deeply rooted in the project

The results are based on a full scale experiment in the education, Master of Industrial Information Technology (MII) and is one of many projects deeply rooted in the project