Gaussian Mixture Models - Speaker Density Models

Speaker Density Models

5.2 Gaussian Mixture Models

The choice of a stochastic model for speaker identication has to be made with certain criteria in mind. Density models are used to describe the distribution of a data set, meaning that the model that is chosen must be able to t the training data. The model must also be able to recognize test data that has a distribution similar to that of the training data. This ability is referred to as the generalization ability of the model. It deteriorates if the model is too nely tuned to the training data, as this means that test data cannot be recognized if it deviates a little from the training data.

There are two main subsets of density models; parametric and non-parametric. The non-parametric method does not have a pre-specied form and depends entirely on the data itself with no prior assumptions made. This leads to the ability to estimate the real density probability very closely, though for data sets of large dimensionality the problems of inadequate storage space and lengthy computational time may arise. In cases where there may be missing data points, the non-parametric model does not provide a good representation of the data.

The parametric methods, on the other hand, have a pre-specied functional form that depends on a number of parameters that can be adjusted. These adjustments are made when the parametric model is tted to the data set during the training, or enrollment, phase. When data is sparse, the model retains to a certain level its ability to represent the input data. The disadvantage of these methods is that the density model may be unable to provide a good representation of the true input data density, as the latter may deviate substantially from the model's basic form. A third alternative to these methods are the semi-parametric methods [15].

The advantage of using semi-parametric methods is that they allow many degrees of freedom, making them more exible and sensitive to the true density function of the input data than the parametric density models. The structure and parameters within the semi-parametric model, however, ensure that the density function has a known way of behaving and is thus more robust when dealing with sparse data than the non-parametric methods are, though they are subject to the curse of dimensionality explained in Section 4.2.

Semi-parametric distributions can be realized as mixture distributions [15]. The den-sity model that is implemented here is the Mixture of Gaussians (MoG) model [41]. MoG models are chosen as they are known to be able to approximate any density with arbitrary precision, and because they have been proven to be very well suited for speech modelling tasks and subsequent text-independent speaker identication [59].

A MoG model consists of, as the name implies, a mixture of Gaussian distributions.

A Gaussian distribution is dened by two parameters: µ, the mean, andσ², the variance.

These parameters are dened in d-dimensional space as the mean vector µ of dimension d×1 and the covariance matrix Σ of dimension d×d. The Gaussian density model, N(x;µ,Σ), is dened in Eq.(5.5). The dimensionality is determined by the dimension of the feature sets that are modelled. The frame of input data xis used free of frame index n here so as to simplify the initial derivations.

N(x;µ,Σ) = 1

(2π)^d/2|Σ|^1/2 expⁿ− 1

2(x−µ)^TΣ⁻¹(x−µ)^o (5.5)

A third parameter denes the Mixture of Gaussians model. This is the mixing weight vector of dimensionality M ×1, where M is the number of Gaussian components in the model. The MoG model is thus dened as a weighted sum of Gaussian density functions that is dependent on M Gaussian components and their corresponding mixing weights, denoted as P(j), j = 1, . . . , M. The mixing weights are all positive and sum to unity.

The MoG is dened in Eq.(5.6).

p(x) =

j=1

P(j)N(x;µj,Σj) (5.6)

where M is the number of mixture components,N(x;µ_j,Σ_j) is the j^th Gaussian compo-nent density function, P(j) is the probability for the j^th component andp(x) is the MoG model for the feature vector of an observation sequence. The constraints that apply to

the probabilities that contribute to the mixture model are listed below:

j=1

P(j) = 1, 0≤P(j)≤1,

p(x|j)dx= 1

The number,M, of Gaussian components in the model has to be prespecied. Initially, a common value for M will be determined, whereafter separate values for each speaker model, M_i, will be implemented to determine whether this leads to an increase in overall classication performance. Apart from the number of components, the mixture model is exible and not dependent on any prior knowledge of the distribution of data points in the feature vectors that are used as input for training or for testing. More specically, this means that the MoG model is suitable for the text-independent task, as no predened sequence of words has to be used as input to the model.

Each speaker is represented by a MoG model that is dened by a parameter set θ_i, so that p(x) of Eq.(5.6) can be denoted as p(x;θi). The speaker-specic parameter set consists of the parameters P_i(j), µ_i,j and Σ_i,j, for1≤i≤S and 1≤j ≤M.

To illustrate a basic MoG, a simple 1-dimensional MoG is derived. The data used to estimate the model is 100 frames from the training sentence a for Speaker 1, and the feature vector used is the 5^th MFCC. The distribution of these data points is shown in Figure 5.1.

0 20 40 60 80 100 120

−30

−25

−20

−15

−10

−5 0 5

Data points for 5th MFCC, Sp1, sentence a

Frame index

5th MFCC value

Figure 5.1: The values of the 5^th MFCC for 100 frames of Sp1, sentence a

The MoG model is implemented with M = 3 components. In Figure 5.2, the three Gaussian components are shown and the resulting overall model is drawn, based on the weights of each of the mixture components.

The means of the three Gaussians in the MoG model vary from -19 to 0, which roughly corresponds to the region that the data points in Firgure 5.1 occupy, as can be seen from the values of these points along the y-axis.

−350 −30 −25 −20 −15 −10 −5 0 5 10 15 0.1

0.2 0.3 0.4 0.5 0.6 0.7

MoG for M=3, 100 frames of training data from Speaker 1

p(x)

mixture model j=1 j=2 j=3

Figure 5.2: The 1-dimensional Mixture of Gaussians model forM = 3, the 5^th MFCC for 100 frames from Sp1, sentence a

In document IMM, Denmarks Technical University (Sider 66-69)