Algorithm Evaluation & Comparison - Algorithms for Source Separation

Many of the described algorithms are developed from a machine learning outset, where the goal is to maximize the signal-to-error ratio (SER) on the test set: the higher the better.

However, in audio applications, the evaluation should take into account how the output of the algorithm would sound. Thus, a source separation algorithm should be evaluated according to the degree to which the sounds are perceived as separated. A related issue is audio coding such as MP3,⁹ where an increased SER is acceptable, so long as the deteriorations are inaudible to a human listener.

9Short for MPEG-1 Audio Layer 3

2.8. ALGORITHM EVALUATION & COMPARISON Conversely, serious artifacts in the processed audio caused by some algorithms may result in relatively small decline in SER.

Ideally, the output of all the mentioned algorithms for single-channel separa-tion of speech should be exposed to human subjective evaluasepara-tion. In the case of speech, the second best solution may be to expose the algorithms to a standard au-tomatic speech recognizer (ASR). This was done in the 2007 Speech Separation Challenge.¹⁰ However, this approach has its own inherent weakness in that the ASR may exhibit an undesired pattern of sensibilities. Ellis (2004) discusses the evaluation of speech separation algorithms.

One might speculate that a purist Bayesian machine learner might dislike the idea of using different cost-functions for learning parameters and for evaluating those. A more fundamentally sound approach would consist in optimizing a dis-tance measure which is founded on the proper psychoacoustic principles.

10Seehttp://www.dcs.shef.ac.uk/^∼martin/SpeechSeparationChallenge.

htm.

Chapter 3 Multi-channel Separation

Multiple sensors are exploited in naval surveillance, where hydrophone arrays are used to map the positions of vessels. In electroencephalography (EEG), electrodes are placed on the scalp to monitor brain activity. Similarly, modern hearing aids are equipped with multiple microphones. It is common to these examples that the intensity interfering signals is significant in relation to the target signals. Multiple sensors are used to amplify signals originating from a given direction in space and to suppress the signals from other directions, thus increasing the target-to-interferer ratio. In its basic form, this is known as beamforming, a term which usually refers to linear array processing and can be regarded as a spatial general-ization of classical filtering techniques (Krim and Viberg, 1996). More generally, signal separation algorithms, linear as well as non-linear, may benefit from the added discrimination power provided by multiple sensors and this is indeed the topic of the chapter.

The content is organized as follows: the convolutive model for multi-channel mixtures is defined in in section 3.1. The major part of the coverage focuses on methods that are based on second-order statistics, or, Gaussian signal assump-tions (section 3.2). Other methods, e.g., those based on higher-order statistics and non-Gaussian distributions, are reviewed briefly in section 3.3. Comments on published work co-authored by me are situated in the vicinity of the their relatives in the literature.

3.1 Scope and Problem Formulation

In the context of separation of audio signals, multiple microphones have been em-ployed with some level of success. Weinstein et al. (1993); Yellin and Weinstein (1996) provide the earliest evidence that speech signals could be separated from their mixtures, which were recorded in a real room. Interest in the field has since surged, so much that Pedersen et al. (2007) can cite 299 articles on the subject.

The count is much higher if the more general problem of multi-channel separa-tion is considered: At the 2006 conference on Independent Component Analysis (ICA) and Blind Source Separation in Charleston, 120 papers were presented.¹ This is the sixth meeting on the topic since 1999. The major part of the research is concerned with blind separation of instantaneous linear mixtures, that is, given the observation modelx(t) =As(t), estimateAand infer the sourcess(t). Under assumptions of independency and non-Gaussian sources, this problem can some-times be solved using ICA, see chapter 4.

The coverage here, on the other hand, is exclusively devoted to the set of problems that are best described by a convolutive model,

y(t) =

L−1

τ=0

A(τ)s(t−τ) +v(t) (3.1) where the observedy(t)is a vector of mixture signals at timet,s(t)andv(t)are the source and noise vectors, respectively. The mapping is governed by A(τ), which is a set of mixing matrices atLdifferent lags. Assuming that the sources, s(t), are mutually, statistically independent and that the channel, A(τ), is un-known, the overall goal is to estimateAand infers(t).

The convolutive model arises when the mixture is not instantaneous, that is, when the sources mix into the sensors as filtered versions. One instance of this arises when there are different time-delays between a given source and the sensors.

This naturally occurs in acoustics scenarios,e.g.rooms, where the sounds travel different distances between the sources and the sensors, and, additionally, multiple echoes of an emitted sound are observed at a sensor (see figure 3.1). In acoustic

1The conference web site is located at http://www.cnel.ufl.edu/ica2006/

papers accepted.php

3.1. SCOPE AND PROBLEM FORMULATION

Figure 3.1: The convolutive mixing model exemplified: the sounds are reflected by the walls of the room and arrive at the microphones with various delays and attenuations. The corresponding observation model is a convolution sum of the source signals and the impulse responses.

mixtures, we can thus regard(A)_ij(τ)as describing the room impulse response between sourcej and sensori. In general, the model cannot be inverted, and the sources cannot be retrieved, but a solution exists in many special cases, which are described in the following sections.

Nothing entirely general can be said about the identifiability of the sources and the channel, since it naturally depends on the assumptions included in the separation algorithm. However for the set of methods that assume little, e.g., that the sources are independent or uncorrelated, the source signals, s(t), can be determined only up to an arbitrary filtering. This is because filtered versions of the room impulse functions in(A)_ij(τ)may be cancelled by applying the inverse filter to(s)_j(t). However, if the source separation algorithms have been informed of, e.g., the scale or the coloring of s(t), the ambiguity is reduced accordingly.

Sometimes the arbitrary filtering of the inferred sources is undesirable, and we may choose to project back to the sensor space, in which case the ambiguities in (A)_ij(τ) and (s)_j(t) cancel out. Practically speaking, this means that we infer the audio sources as they sound at the microphones.

Furthermore, the source index may be permuted arbitrarily, in that the model is invariant to a permutation of the elements ofs(t)and the columns ofA(τ). In the case of equal number of sources and sensors (Q = P), we can only hope to

estimatePs(t)andA(τ)P⁻¹, wherePis a permutation matrix.

An important simplification occurs when the convolutive mixing model (3.1) reduces to a pure attenuate-and-delay model, where only a single filter tap is non-zero. In this case, thei, j’th element ofA(τ)is redefined as

(τ) = δ(τ −∆_ij) (3.2)

whereδ(τ)is the Kronecker delta function and∆_ij is the delay involved between thej’th source and the i’th sensor. Acoustic mixing in an anechoic room is ap-propriately represented by (3.2).

3.1.1 Frequency Domain Formulation

Many algorithms work in the (Fourier) frequency domain, where multiplication approximately replaces convolution. Therefore, I redefine (3.1) by applying the discrete Fourier transform (DFT) to windowed frames ofy(t), obtaining,

y⁽ⁿ⁾_k =A_ks⁽ⁿ⁾_k +e⁽ⁿ⁾_k (3.3) where y⁽ⁿ⁾_k , s⁽ⁿ⁾_k and Ak are the frequency domain versions of the correspond-ing time-domain signals at discrete frequenciesk. The window (time) index is n. There is a residual term,e⁽ⁿ⁾_k , which is partly due to additive noise,v(t), and partly due to the fact that equation 3.1 is a linear convolution rather than a cir-cular one. When the window length is much larger than L, the latter mismatch vanishes, that ish^|e_|x^k^|

k|i → 0. The notation used indicates that the channel,A_k, is assumed constant on the the time-scale of the estimation, which may sometimes be a rather strict constraint,e.g., excluding a cocktail party situation with overly mobile participants.

3.1.2 Frequency Permutation Problem

The transformation to the frequency domain is particularly useful, because it al-lows efficient ICA methods to be applied independently to each bin,k, in equa-tion 3.3. However, there is a serious challenge associated with following such an

3.2. DECORRELATION

In document Algorithms for Source Separation - with Cocktail Party Applications (Sider 32-39)