• Ingen resultater fundet

Variational Bayesian Factor Analysis

Figure 3.2: Graphical model for variational Bayesian factor analysis The model proposed for variational Bayesian factor analysis (VBFA) can be seen on gure 3.2. Compared to MLFA we have now placed distributions on the factor loading matrix and the noise covariance. The noise covariance matrix Ψ in MLFA is replaced by its inverse, the noise precision matrix ϕ = Ψ−1. In order to do automatic latent dimensionality determination a hierarchical prior is used on the factor loading matrixA. Theαnode serves as regularization parameter that eectively can shut o unneeded factors.

The priors in the model are dened as

p(A|α) = YK

k=1

N¡

ak|0,α1

k

(3.15)

p(α|aα,bα) = YK

k=1

G¡

αk|aαk, bαk¢

(3.16)

p(ϕ|aϕ,bϕ) = Yp

j=1

G¡

ϕj|aϕj, bϕj¢

(3.17) where ak denotes a column vector that corresponds to the kth column in A. We have an isotropic Gaussian on each column in A, where the hyper-parameter αk controls the precision. Since each column in A is given zero mean, what will happen ifαkgoes to innity is that the variance for thekth column will go to zero. When the variance for thekthcolumn goes to zero we eectively shut down the kth factor. One can think of this as an ARD prior that can switch o unneeded factors, where the factors are the inputs to the system. Both α and ϕ are precisions, so they are given factorized Gamma priors. Remember thatϕis a diagonal matrix and the prior is therefore only specied for the diagonal elements.

3.2. VARIATIONAL BAYESIAN FACTOR ANALYSIS 27 The log marginal likelihood for the model we lower bound as

L = ln

whereq(S,A,α,ϕ)is our approximation to the true posteriorp(S,A,α,ϕ|X) which is not analytically tractable. To derive an VBEM algorithm we need to the assume that the hidden variables and the parameters are independent q(S,A,α,ϕ) ≈q(S)q(A,α,ϕ). Furthermore we approximate q(A,α,ϕ)≈ q(A)q(α)q(ϕ) so the resulting variational posterior is fully factorized. This further approximation signicantly eases the derivation since now all varia-tional posteriors are on the same form as the conjugate priors.

The complete log likelihood for the model is given by

L(θ,S) = N

To derive the VBE-step we proceed by writing up the expected complete log likelihoodhL(θ,S)iq(θ) with respect to the parameter posteriors and neglect terms not depending on the hidden variables

hL(θ,S)iq(θ)= from which we can infer thatq(S) is of the form

q(S) =

where

To compute the expectationhA>ϕAi we use the fact thatϕ is diagonal hA>ϕAi=

Xp

j=1

jihaja>j i (3.24) whereaj is a column vector corresponding to the jth row of A. Note that the VBE-step is equivalent to the E-step in MLFA, but with expected values for the parameters. This concludes the VBE-step.

The VBM-step

To derive the VBM-step we need to write the expected complete log likeli-hoodhL(θ,S)iq(θj6=i)q(S)with respect to all other variational posteriors. The derivations follow below.

The q(A) distribution

In order to estimateq(A) we start from equation (3.19) and retain only the terms that containA. To infer the posterior we furthermore have to write the expected complete likelihood as a sum over the rows ofA. Letaj denote a column vector that corresponds to thejth row in A and let α = diag [α]

from which we can infer thatq(A)is of the form q(A) =

3.2. VARIATIONAL BAYESIAN FACTOR ANALYSIS 29

In order to estimate q(α) we start again from equation (3.19) and retain only the terms that contain α. Thekth column of Ais denoted by ak from which we can infer thatq(α) is of the form

q(α) =

We can compute the expectationha>kaki by

ha>kaki=

In order to estimate q(ϕ) we start again from equation (3.19) and retain only the terms that contain ϕ.

hL(θ,S)iq(A)q(α)q(S)= from which we can infer thatq(ϕ)is of the form

q(ϕ) =

We can compute the expectationD

(xjia>j si)2 This concludes the VBM-step.

Hyperparameter Optimization

The model has hyperparameters{aϕj, bϕj}and hyperhyperparameters{aαk, bαk}. We can optimize these by writing the expected complete log likelihood as a function of only these parameters, dierentiate and set to zero. Writing the expected complete log likelihood as a function of{aϕj, bϕj}and neglecting constant terms we get the xed point equations

hL(aϕj, bϕj)iq(θ)q(S) = aϕj lnbϕj ln Γ(aϕj) + (aϕj 1)hlnϕji −bϕjji

3.2. VARIATIONAL BAYESIAN FACTOR ANALYSIS 31 where ψ(x) = ∂x ln Γ(x) is the digamma function and hlnϕji = ψ(baϕj) lnbbϕj. Since the priors for α and ϕ are of same type, the xed point equa-tions for {aαk, bαk}are identical.

We can solve the xed point equations by Newton-Raphson, but we must assure that the Newton-Raphson iterations yield a solution where both aϕj andbϕj remains positive. This we can overcome by performing the iterations in the exponential domain. In this way subtraction becomes multiplication and we are assured that we nd a solution with positive values. The iterations are given by whereψ0(·) is the derivative of the digamma function. When aϕj,new is esti-mated we can simply insert in equation (3.41) to nd the update forbϕj,new. If we propose that the model should have equivalent priors, meaning that all aϕj =aϕj0 and bϕj =bϕj0 we simply just replace hlnϕji by 1pPp

j=1hlnϕji and ji by 1pPp

j=1ji in equation (3.42). In chapter 4, the eect of hyperpa-rameter optimization is investigated for both type of priors.

Calculation of Fm

We need to calculate the lower bound Fm if we want to use it as guide for model selection and to monitor convergence. The functional can be written

Fm =hlnp(X|S,A,α,ϕ)iq(S)q(A)q(α)q(ϕ)

KL(q(S)||p(S))− hKL(q(A)||p(A|α))iq(α)

KL(q(α)||p(α|aα,bα))KL(q(ϕ)||p(ϕ|aϕ,bϕ))

(3.43)

where the individual terms can be calculated by

hlnp(X|S,A,α,ϕ)iq(S)q(A)q(α)q(ϕ)=

hKL(q(A)||p(A|α))iq(α)=−p 2

XK

k=1

³

ψ(baαk)lnbbαk

´

1 2

Xp

j=1

³

ln(j)a |+ tr h

I

³

Σ(j)a +m(j)a m(j)>a

´ hαi

(3.46)

KL(q(α)||p(α|aα,bα) = XK

k=1

à b

aαklnbbαk−aαklnbαklnΓ(baαk) Γ(aαk)

+bαkki −baαk+ (baαk −aαk)(ψ(baαk)lnbbαk)

!(3.47)

KL(q(ϕ)||p(ϕ|aϕ,bϕ) = Xp

j=1

à b

aϕj lnbbϕj −aϕjlnbϕj lnΓ(baϕj) Γ(aϕj)

+bϕjji −baϕj + (baϕj −aϕj)(ψ(baϕj)lnbbϕj)

!(3.48)

3.2. VARIATIONAL BAYESIAN FACTOR ANALYSIS 33

A

Avb

Φ

Φvb

0 200 400 600 800 1000

10−4 10−2 100 102 104

log[∂F/∂t]

200 400 600 800 1000

10−2 100

1 / α

Figure 3.3: Upper left show the reference factor loading matrix and noise pre-cisions. Bottom left show the learnt factor loading matrix and noise prepre-cisions.

Upper right show the log to the dierence of Fm during 103 iterations. Plotting Fm in this way its more clear when the bound increases. Bottom right show the inverse precisions (variances) on the columns ofA. Notice thatFmincreases faster when a column is on its way to be shut down.

VBFA in Action

To demonstrate the model I created a reference factor analysis model with precisions drawn from Gamma distributions and a factor loading matrix drawn from a Gaussian distribution controlled by the precisions. From this model I generated a p = 10 dimensional training set of size N = 103. The reference latent dimensionality wasK= 3. No hyperparameter optimization was invoked. The hyperparameters were set asaαk =bαk =bϕj =bϕj = 10−3 corresponding to non-informative priors for all practical purposes. The algo-rithm was run for103iterations with a maximum dimensionality ofkmax = 6 and a random initialization. Figure 3.3 summarize the results. The learnt latent dimensionality is K = 3 since only three columns are left back. No-tice also that the columns in the learnt factor loading matrix resemble the reference model (no rotation indeterminacy). In chapter 4 we will discuss this further.

3.3 Maximum Likelihood Extended Factor