The linear Gaussian case - Making Faces { State-Space Models Applied to Multi-Modal Signal Proc

4.4 The linear Gaussian case

Before comparing the EM algorithm and the gradient alternative, parameter estimation in the linear Gaussian case i.e. the Kalman Model will be discussed.

In the following iterative methods for system identication will be considered.

However, there are other alternatives e.g. one-shot algorithms like the subspace method described inVan Overschee and de Moor(1996);Viberg(1995).

4.4.1 Identiability

The unknown parameters in the Kalman Filter (equation (3.4)) are all entries in the transition matrix F , all entries in the observation matrix H, and the two covariance matrices Q and R. Unfortunately, these parameters are not uniquely dened from the observations. The simplest way to see this is to insert an invertible matrix in the equations

Uxk = UF U ¹Uxk 1+ Uvk 1; v N(0; Q) (4.6a) zk = HU ¹Uxk+ wk; w N(0; R) (4.6b) With the new variables ^x_k = Ux_k; ^F = UF U ¹ and so on, the output of the system (z) remains the same. For a general system identication problem { where the hidden space has a physical interpretation { this is o course unde-sired. Even when one does not care about the hidden space the non-uniqueness can cause problems for the optimization.

In the machine learning community the un-identiability of the parameters has in general not been considered a problem for exampleWelling(2000) derives the E-M algorithm for parameter estimation without even mentioning the problem.

InGhahramani and Hinton(1996);Roweis and Ghahramani (1999) it is noted that, without loss of generality, the transition noise Q can be set to the identity.

However, by remembering that cov(Uv) = Ucov(v)U^T it is obvious that if Q is the identity, U can be any rotation matrix. Hence, restrictions on Q removes scale ambiguities, but leaves rotations.

Kevin Murphy's toolbox Bayes Net Toolbox for Matlab² also implements pa-rameter estimation in Kalman Filters along these lines without taking the un-identiability into account.

In the control community dierent canonical forms for the state-space system have been investigated and for single input single output (SISO) systems e.g.

the observer and the controller canonical form solves the problem. However, for multiple input multiple output (MIMO) systems, the canonical forms require information about how many complex eigenvalues the system contains. The problem arises because the hidden states can have two interpretations. Either a state can describe a variable of the system or it can describe a delay. The problem is treated in e.g. Gevers and Wertz(1984);Overbeek and Ljung(1982),

2http://www.ai.mit.edu/~murphyk/Software/

in these papers non-unique overlapping parameterizations are suggested as well as unique representations requiring the estimation of so called 'structure indices'.

InMcKelvey et al.(2004) a data driven local parametrization is used.

The choice is now whether to keep the full system with all its ambiguities, xate Q and get rid of the scaling problems, choose F to be diagonal and introduce complex states, try to estimate the structure indices, or under-parameterize the system by choosing F to be diagonal and real.

Since this work stems from the machine learning community, the initial choice was over-parametrization. However, as the work progressed this approach be-came increasingly unattractive due mainly to the non-uniqueness of the solution and thereby non-reproducible results. At later stages xating Q was used, still leaving the rotation ambiguities.

4.4.2 Kalman derivatives

To nd the optimal parameters in the Kalman Filter the derivatives of the lower bound B(; ) (equation (4.2)) must be found with respect to the parameters . Here is the smoothing solution of the hidden state x₁: : : x_T thus q(xj) = p(x_1:Tjz_1:T). The parameter contain F , H, Q, R and the initial distribution x₀ and ₀.

By looking at the lower bound at B =

q(xj) logp(x; zj)

q(xj)dx (4.7)

= Z

q(xj) log p(x; zj)dx Z

q(xj) ln q(xj)dx (4.8) it is seen that when is xed, only the termR

q(xj) log p(x; zj)dx can be optimized.

Following e.g. Welling(2000) the joint probability of the hidden variables and the observation given the parameters can be factorized by using the Markov property of the model

p(x1:T; z1:Tj) = p(x0j) YT k=2

p(xkjxk 1; ) YT k=1

p(zkjxk; ): (4.9) Since all probabilities are dependent on , the dependence is left out in the following to simplify the notation

B = Z

p(x_1:Tjz_1:T) log[p(x_1:T; z_1:T)]dx_1:T

= Z

p(x_1:Tjz_1:T)h

log[p(x₀)] +X^T

k=2

log[p(x_kjx_{k 1})] +X^T

k=1

log[p(z_kjx_k)]i dx_1:T

4.4 The linear Gaussian case 73

So far everything holds for general distributions and both linear and non-linear smoothers. Now the Gaussian distributions for the KF (equations (3.5)) can be inserted

B = 1

2 Z

p(x1:Tjz1:T)h

(4.10) dxlog[2] + log[j0j] + (x1 x0)^T₀¹(x1 x0)

+ X^T

k=2

d_xlog[2] + log[jQj] + (x_k F x_{k 1})^TQ ¹(x_k F x_{k 1})

+ X^T

k=1

dzlog[2] + log[jRj] + (zk Hxk)^TR ¹(zk Hxk) i dx1:T

Beginning with the transition matrix F , the bound B can be dierentiated with respect to all the components in . Note that capital T denotes the length of the time series and^Tthe transpose

@FB = @

h 1

2 Z

p(x1:Tjz1:T) XT k=2

(xk F xk 1)^TQ ¹(xk F xk 1)dx1:T

= 1

2 Z

p(x_1:Tjz_1:T)X^T

k=2

(2Q ¹F x_{k 1}x^T_{k 1} 2Q ¹x_kx^T_{k 1})dx_1:T

= Q ¹FX^T

k=2

< x_{k 1}x^T_{k 1}> +Q ¹X^T

k=2

< x_kx^T_{k 1}> (4.11) Where the expectation of the transition < x_kx^T_{k 1} >=< x_kx^T_{k 1} >_p(x_1:T_jz_1:T₎ is needed. Fortunately, this quantity was derived in section 3.1.3. A similar derivation holds for the observation matrix H

@HB = @

h 1

2 Z

p(x_1:Tjz_1:T)X^T

k=1

(z_k Hx_k)^TR ¹(z_k Hx_k)dx_1:Ti

= 1

2 Z

p(x_1:Tjz_1:T)X^T

k=1

(2R ¹Hx_kx^T_k 2R ¹z_kx^T_k)dx_1:T

= R ¹H

XT k=1

< xkx^T_k > +R ¹ XT k=1

zk< x^T_k > (4.12) For the covariances R and Q a few extra calculations are needed since these ma-trixes are symmetric and positive denite. Dierentiating a symmetric matrix is dierent than a normal matrix. When solving for _@R^@ B = 0 as is done in the EM algorithm there is no need to take the symmetry into account however, if a gradient based minimization is used it is important to use the correct gradient.

the derivative of a symmetric matrix A can be found by dierentiating as usual pretending that the matrix A* has no structure and then transform the result by: _@A^@F = (_@A^@B +_@A^@B^T) diag(_@A^@F ) (Petersen,2004).

However, this does not ensure that the matrix is symmetric and positive denite after a gradient step. Two dierent approaches handles this. The rst is to keep A diagonal and optimize the deviation instead of the variance ². Another approach but with full covariance structure is to write A = A₀A^T₀ and optimize A₀, this unfortunately introduces new degrees of freedom since another division of A A₁= A₀U ³could just as well have been chosen. With this new choice of parameter the derivative is changed but fortunately not much. It can be found by using the chain rule. _@A^@₀B = (_@A^@ B + _@A^@ B)A₀. Using the second method the 'standard' derivative is needed

@RB = @

h 1

2 Z

p(x1:Tjz1:T) (4.13)

XT k=1

log[jRj] + (zk Hxk)^TR ¹(zk Hxk) dx1:T

= 1

2 Z

p(x1:Tjz1:T) XT k=1

(R ^T R ^T(zk Hxk)(zk Hxk)^TR ^Tdx1:T

= 1

2(T R ^T R ^T(HX_T

k=1< x_kx^T_k > H^T+X^T

k=1

z_kz^T_k

HX^T

k=1

< x_k> z^T_k X^T

k=1

z_k< x_k>^TH^T)R ^T

In this work the second method with a parameter change is employed.

For the derivative of B with respect to Q the same considerations as for R

3U can be any rotation matrix

4.4 The linear Gaussian case 75

applies since Q is also symmetric

@QB = @

h 1

2 Z

p(x1:Tjz1:T) (4.14)

XT k=2

log[jQj] + (x_k F x_{k 1})^TQ ¹(x_k F x_{k 1}) dx_1:Ti

= 1

2 Z

p(x1:Tjz1:T) XT

k=2

(Q ¹ Q ¹(x_k F x_{k 1})(x_k F x_{k 1})^TQ ¹dx_1:T

= 1

2((T 1)Q ¹ Q ¹(F XT k=2

< xk 1x^T_{k 1}> F^T+ XT k=2

< xkx^T_k >

F XT k=2

< xk 1x^T_k >

XT k=2

< xkx^T_{k 1}> F^T)Q ¹ (4.15) where the sumP_T

k=2< xk 1x^T_{k 1}>=P_{T 1}

k=1 < xkx^T_k >6=P_T

k=2< xkx^T_k > for nite length sequences. The mean and the covariance for the initial condition can also be found, the estimate of these however, are more uncertain since they only rely on the beginning of sequence

@x0B = @

@x0

h 1

2 Z

p(x1:Tjz1:T)(x1 x0)^T₀¹(x1 x0)dx1:T

= 1

2 Z

p(x1:Tjz1:T)2₀¹(x1 x0)dx1:T

= ₀¹(< x1> x0) (4.16)

where < x1>=< x1>_p(x_1:T_jz_1:T₎is the expectation of the rst hidden variable over the posterior distribution

@₀B = @

@₀

h 1

2 Z

p(x_1:Tjz_1:T) (4.17)

d_xlog[2] + log[j₀j] + (x₁ x₀)^T₀¹(x₁ x₀) dx_1:Ti

= 1

2 Z

p(x_1:Tjz_1:T)

¹ ¹(x₁ x₀)(x₁ x₀)^T ¹ dx_1:T

= 1

2 ¹

¹ < x₁x^T₁ > < x₁> x^T₀ x₀< x₁>^T+x₀x^T₀

with < x₁x^T₁ >=< x₁x^T₁ >_p(x_1:T_jz_1:T₎.

As mentioned earlier the gradients can be used either in the EM-algorithm where the exact update can be calculated by setting them equal to zero, or in a gradient based optimizer. In the following section a comparison of the two methods on three dierent problems will be carried out.

In document Making Faces { State-Space Models Applied to Multi-Modal Signal Processing (Sider 71-76)