Decorrelation - Algorithms for Source Separation - with Cocktail Party Applications

decoupled across frequencies. This has the consequence that the inversion to the time-domain has been made difficult unless the permutation can be harmonized, so that it is the same for all bins. Assumptions regarding the channel and the sources can be exploited for this purpose. Consider for example a pure delay-and-attenuate mixing system (3.2), which can be regarded as modelling an anechoic room. Then the estimated A(τ)ˆ should be sought permutation-corrected so that the amplitude is constant across frequency and the phase is linear in frequency.

Alternatively, the frequency permutation problem can be fixed by using the structure in the sources. One possibility is to optimize the correcting permuta-tion so that it maximizes the correlapermuta-tion of the amplitudes across frequencies. In fact, Anem¨uller and Kollmeier (2000) turned this criterion into a full separation algorithm.

3.2 Decorrelation

In signal processing, it is a common theme to base a solution on the second-order statistics of the signals. Ignoring the means, which can be pre-subtracted and post-added, this means that the relevant information is contained in the auto and cross-correlation functions. In the context of multi-channel separation, this translates to ensuring that the cross-correlation between the sources is zero at all lags. The time-lagged covariance of the source estimateˆs(t)is defined

Λ(τ) =

ˆs(t)ˆs^>(t−τ)

(3.4) whereτ is the lag time. The goal is to diagonalize Λ(τ). Molgedey and Schus-ter (1994) showed that for instantaneous mixtures (those that are constrained to L = 1 in equation 3.1) diagonalization in fact retrieves the actual sources, ex-cept for a scaling and permutation uncertainty. In fact, they showed that Λ(τ) is only required to be diagonal at τ = 0 and additionally at a lag different from zeroτ = τ₀. The solution toA is obtained by solving an eigenvalue problem.² It is a condition that the ratio between the auto-correlation coefficients at these

2It is assumed thatAis invertible.

lags is different across sources in order for the problem to be solvable using this technique. Parra and Sajda (2003) generalized the eigenvalue solution to other statistics than lagged covariance matrices, providing a quick-and-dirty method in many instances.

In the case of the full convolutive model (3.1), the decorrelation of stationary sources does not achieve the identification of the mixing system or the inference of the sources as noted by,e.g., Gerven and Compernolle (1995). This can be real-ized by considering the decorrelation criterion (3.4) in the frequency domain. The auto/cross power spectra ofx⁽ⁿ⁾_t ,C⁽ⁿ⁾_k , depend on the spectra ofs⁽ⁿ⁾_t as follows,

C_k=A_kD_kA^H_k +E_k (3.5)

whereD⁽ⁿ⁾_k is a diagonal matrix with the powers of the sources as elements. The power spectrum residual, E_k vanishes when e_k is small. Now it can be seen that the channel and the source spectra are ill-determined because{A_k,D_k}and n

A_kUD

1 2

k,Io

are solutions that produce identical statistics, Λ(τ)and hence in-distinguishable. The orthogonal matrix,U, obeys toUU^> =I. Hence, additional discriminative properties of the sources need to be present in order to overcome this limitation.

In order to identify the model, Weinstein et al. (1993) suggested to take advan-tage of a fairly common quality of real-world signals, namely that their statistics vary in time. For example, speech signals can be considered non-stationary if measured across windows that are sufficiently short (but still long enough to ob-tain a reliable estimate). Thus, we extend (3.5) to account for the non-stationarity, C^(m)_k ≈AkD^(m)_k A^H_k (3.6) wheremis the window index not the be confused with the index in (3.3). The key point is that, if different auto/cross power spectra are measured at multiple times (withAk fixed), then the the number of constraints increase at a higher rate than the number of unknowns. Parra and Spence (2000) turned (3.6) into a practical algorithm employing gradient descent as the vehicle of optimization. The problem of different permutations across frequency was approached by constraining the

3.2. DECORRELATION filter length, L, to be sufficiently smaller than the window length of the DFT, effectively ensuring smooth frequency responses.

Rahbar and Reilly (2005); Olsson and Hansen (2006a) note that the non-stationary observation model (3.6) fits in the framework of multi-way analysis (Smilde et al., 2004). This can be seen by comparing to the symmetric version of the parallel factor (PARAFAC) model which is defined xijk = PF

f=1aifbjfakf, whereaif andbjf are the loading matrices andF is the number of factors. The loading matrices have been shown to be identifiable for quite a high number of factors, lower bounded by a theorem by Kruskal (1977). The treatment of (3.6) may still be to gain further from the body of analysis and algorithm accumulated in the field of multi-way analysis.

3.2.1 Contributions IV-VI

Cost-functions which depend on second-order-statistics only often result from placing Gaussian assumptions on the variables of a linear generative model. In my work on time-domain algorithms, I indeed assumed Gaussianity and was able to derive maximum posterior (MAP) inference for the sources and maximum-likelihood estimators for the parameters. A linear state-space model which allows time-varying parameters was employed, including an autoregressive (AR) process with Gaussian innovation noise as a source model.³ Olsson and Hansen (2004b) applied maximum-likelihood learning to the parameters of the model using an expectation-maximization (EM) algorithm to do so (Dempster et al., 1977). On the E-step, the sources are inferred using the Kalman smoother. The parameters are re-estimated on the M-step. In order to reach convergence, the E and M steps were invoked alternatingly. We successfully separated speech signals that were mixed in a convolutive model and showed that the method is resilient to additive Gaussian noise. As an integral part of the Kalman filter implementation, the like-lihood of the model parameters given the observed data is computed in the process of inferring the sources. This can be used in a model control framework, where the objective is to estimate the number of active sources in each time-window.

For this purpose, Olsson and Hansen (2004a) employed the Bayesian Information

3See the papers for details.

Criterion (BIC, Schwartz, 1978), which is an approximation of the Bayes fac-tor/marginal likelihood of the model. The main computational component in BIC is the likelihood computation.

An effort was made to tailor the algorithm to a specific domain, namely the separation of speech signals. For that purpose, a native part of linear-state space models, known as the controlsignal, can be used to shift the mean of the inno-vation noise process that drives the sources. Olsson and Hansen (2005) used a parameterized speech model as a control signal, effectively attracting the solution to be in agreement with the speech model. We used the model of McAulay and Quateri (1986), who coded fragments of speech signals in terms of a sum of a period signal and colored noise. As a necessary addition to the algorithm, the time-varying fundamental frequencies and harmonic amplitudes and phases are estimated.

Zhang et al. (2006) extended our algorithm to account for a non-linear distor-tion of the observed mixtures and showed that the new method performs better than ours on synthetic data. S¨arel¨a (2004); Chiappa and Barber (2005); Pedersen et al. (2007) referred to our work on this topic.

3.2.2 Contribution VII

Having formulated our favorite generative model of data, it is often a major obsta-cle to choose the parameters of that model. In this case and in many other cases, there are a number of unobserved sources or missing data which influence the model. This precludes direct maximum-likelihood (ML) learning, as the complete likelihood function depends on data which are unavailable. Rather, themarginal likelihood should be optimized, requiring the formulation of a prior probability distribution for the sources. However, the resulting marginalization integral may not be easily optimized with respect to the parameters. The EM algorithm is an iterative approach to obtaining the ML estimate, both in terms of simplicity of analysis and ease of implementation.

Slow convergence is a major caveat which is associated with the EM algorithm but also with, e.g., steepest gradient descent. We (Olsson et al., 2007) discuss the possibility of extracting the gradient information from the EM algorithm and

3.3. OTHER METHODS

In document Algorithms for Source Separation - with Cocktail Party Applications (Sider 39-43)