August18,2009 SummerSchoolonManifoldLearninginImageandSignalAnalysis,August17-21,2009,HvenOleWinther PCA,ICAandbeyond

(1)

PCA, ICA and beyond

Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven

Ole Winther

Technical University of Denmark (DTU) & University of Copenhagen (KU)

August 18, 2009

(2)

• Motivation – multivariate data

• Principal component analysis (PCA) classic

• Model based approach – probabilistic PCA (pPCA)

• Identifiability – independent component analysis (ICA)

• InfoMax, smoothness and beyond.

(3)

Motivation – multivariate data

• Data is often (but not always) represented as a matrix ofd featuresandN samples:

size(X) = [d N]

• In statsd =p,N=nand data matrix transposedX→X^T

• Collaborative filtering:

X=item-user matrix

• Gene expression:

X=gene-tissue matrix

• Text analysis:

(4)

Collaborative filtering

(5)

• Netflix - online movie rental (DVDs).

• Collaborative filtering – predict user rating from past behavior of user.

• Improve Netflix own system by 10% to win.

• training.txt –R=10⁸ratings, scale 1 to 5 ford =17.770 movies andN =480.189 users.

• qualifying.txt – 2.817.131 movie-user pairs, (continuous) predictions submitted to Netflix returns a RMSE.

• Rating matrixXmostly missing values, 98.5%.

(6)

Collaborative filtering task

• Relatively large data set - 10⁸data points

• Very heterogeneous - viewers and movies with few ratings

• Ratings∈ {1,2,3,4,5}noisy (subjective use of scale, non-stationary,. . . )

• Complex model needed to capture latent structure

• Regularization!

(7)

• Netflix prize - some key performance numbers

Method RMSE % Improv.

Cinematch 0.9514 0%

“PCA” ∼0.89-0.92 ∼5-6%

Grand prize 0.8563 10%

• RMSE = root mean squared error

• Two teams (Ensemble and BellKor’s Pragmatic Chaos) are above 10%

• but prize not handed out yet (of Aug 2009).

(8)

Gene Expression

DNA Micro-Arrayd ∼6000 andN ∼60 cancer tissues.

10 20 30 40 50 60

1000

2000

3000

4000

5000

6000

(9)

Gene Expression

Protein signalling network textbook – Sachs et. al. Science, 2005.

(10)

Gene Expression

• Single cell flow cytometry measurements of 11 phosphorylated proteins and phospholipids.

• Data was generated from a series of stimulatory cues and inhibitory interventions.

• Observational data: 1755 general stimulatory conditions,

• Experimental data∼80%not used in our approach.

• Not “smallnlargep”!

(11)

Latent semantic analysis (LSA)

Bag of words representation – term-document matrix

(12)

Principal component analysis

• Principal components (PCs): the orthogonal directions with most variance.

• Empirical co-variance of (centered) data:

S= 1 NXX^T

• size(S) = [ d d ]

• PCs: eigen-vectors ofS Su_i =λ_iu_i

−2 −1 0 1 2

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2

Plot axis√ λ_iu_i

(13)

Principal component analysis

2 4 6

40 50 60 70 80 90 100

−2 0 2

Typical steps in PCA.

(14)

PCA – maximum variance formulation

• Project data{x_n}_n=1,...,N onto directions{u_i}_i=1,...,M.

• We find directions sequentially,u₁first.

• Meanof projected data: u^T₁¯xwith

x¯= 1 N

N

X

n=1

x_n.

• Varianceof projected data:

1 N

N

X

n=1

n

u^T₁(xn−x)¯ o2

=u^T₁Su₁.

(15)

PCA – maximum variance formulation

• Maximize variance

u^T₁Su₁ with respect tou₁.

• But we need a constraint to avoidu₁→ ∞:

u^T₁Su₁+λ1(1−u^T₁u₁)

• λ₁Lagrange mulitplier.

• Solution - eigenvalue problem Su₁=λ₁u₁.

(16)

PCA – minimum error reconstruction

x2

x1

x_n

xen

u1

Find best reconstructing orthonormal directions{u_i}.

(17)

• Orthonormal directions{u_i}:u^T_i u_j =δ_ij.

• Lower dimensional subspace:

x˜_n=

M

X

i=1

α_niu_i+

D

X

i=M+1

b_iu_i

• Minimize J({u_i}) = 1

N

X

n=1

||x_n−˜x_n||²=. . .=

D

X

i=M+1

u^T_i Su_i =

D

X

i=M+1

λ_i.

(18)

• Database ofN,d =28×28=784 pixel values

• Mean and first four PCs

Mean

• Reconstruction x˜_n=

M

X

i=1

(x^T_nu_i)u_i+

D

X

i=M+1

(¯x^T_nu_u)u_i= ¯x+

M

X

i=1

h

(x_n−x)¯ ^Tu_ii u_i

(19)

Where is the signal?

(a)

0 200 400 600

0 1 2 3

x 10⁵

(b)

0 200 400 600

0 1 2 3x 10⁶

Original

(20)

Singular Value Decomposition (SVD)

• SVD– a simpler way to do PCA:[U,D,V] =SVD(X) X=UDV^T

• UandVared×d andN×N orthonormal matrices:

U^TU=I_d V^TV=I_N

• Ddiagonald×Nofsingular values(≥0) sorted:

S= 1

NX^TX= 1

NUDV^TVD^TU^T = 1

NUD^TDU^T

• Columns ofUare the eigenvectors ofS,the PCs:

Su_i = D_ii²

N u_i and λi = D_ii² N .

• Projection of all data onith PC:u^T_i X=u^T_i UDV^T =D_iiv^T_i .

(21)

Singular Value Decomposition (SVD)

• Project onto PCs: U_M,d×M:

X˜_M ≡U^T_MX=U^T_MUDV^T =D_MV^T_M

• D_M,M×MandV_M,N×M.

• Covariance of projected data:

S˜ = 1

NX˜_MX˜^T_M = 1

ND_MV^T_MV_MD_M = 1 ND²_M

• Whitening Xb_M ≡√

ND⁻¹_M U^T_MX=

√

NV^T_M → Sb=I_M .

• Lossyprojection backtod-dim. space:

(22)

Continuous Latent Variable Models

• Latent variablezis unobserved but can be learned from data.

• Translations (2) and rotation (1) latent variables in the example.

• Mapping(W,z_n)→x_nin generalnon-linear.

• We often consider simplerlinearmodels:

x_n =Wz_n+n.

(23)

• Explain data by latent variables + noise:

x=Wz+ X=WZ+E

• x,d dimensional data vector

• W,d×M dimensional ‘mixing matrix’.

• z,M dimensional latent variable or source vector.

• ,d dimensional noise or residual vector,

x_in

W_im z_mn

_in σ_i²

i=1, . . . ,d

n=1, . . . ,N m=1, . . . ,M

(24)

Linear latent variable model forcollaborative filtering

• v_n: M-dimensional “taste” vector of viewern.

• um :M-dimensional “profile” vector moviem.

• Latent variablehmn:

h_mn=u_m·v_n+mn N(mn|0, σ²)

• σ²is the noise level.

• LearnUandVtraining data and predictr_m⁰_n⁰ ≈u_m⁰·v_n⁰

(25)

Probabilistic PCA

• Let us try to understand PCA as a latent variable model:

x=Wz+

• SVD gives a hint:

X = WZ+E=UDV^T =U_MD_MV^T_M+E

⇒

W = U_MD_M and Z=V^T_M

(26)

Probabilistic PCA

• Tipping and Bishop considered a specific assumption:

P(z) = Norm(z;0,I) P(;σ²) = Norm(;0, σ²I)

• Under this modelxis Gaussian with x = Wz+=0

xx^T = Wzz^TW^T +^T =WW^T +σ²I

• Distribution of datum

P(x;W, σ²) =Norm(x;0,WW^T +σ²I)

(27)

Probabilistic PCA

• Log likelihood forWandσ²is joint distribution of all data:

logL(θ;X) = X

n

logP(x_n|W, σ²)

= −N 2

n

log det 2πΣ +Tr h

Σ⁻¹Sio

• Model covariance: Σ =WW^T +σ²I

• Empirical covariance: S= _N¹XX^T

• The solution: theMPCs will the largest eigenvalues.

• The remaining variance will be explainedσ²I.

(28)

Probabilistic PCA

• Let us try to solve the cocktail party problem:

Recordings=Mixing×Speakers Or

x=Wz

• Use pPCA to estimateWandz.

• Ignore complications of room acoustics.

(29)

Probabilistic PCA

• Stop sign! Non-uniqueness of solution!

• Likelihood only depends uponWthroughΣ =WW^T +σ²I

• RotateW:

W←WUe

• leave covariance unchanged

WW^T =WUUe ^TWe =WeWe^T .

• This can also be seen directly from the model:

Wz = WUUe ^Tez=Weez z = U^Tez

• The distribution is invariant

(30)

Independent component analysis (ICA)

• Prior knowledge to the rescue!

• Real signals are not Gaussian

• Examplex=w₁z₁+w₂z₂

• withz₁andz₂independent and heavy tailed.

• We exploit this information by putting this into our model!

−40 −20 0 20 40

−10

−5 0 5 10 15

(31)

• Allow for more generalz-distribution

• Still assume independent identically distribution (iid):

P(Z) =Y

mn

P(z_mn)

• Many choices possible:heavy tailed(positive kurtosis), uniform,discrete(think of wireless communication), positive(used to decompose spectra, images, etc.) and negative kurtosis.

• Extension totemporal/spatial correlations(time series,

(32)

• Summary linear generative models x=Wz+.

• Probabilistic PCA:

p(z,) =N(z;0,I)N(;0, σ²I)

• Factor analysis

p(z,) =N(z;0,I)N(;0,diag(σ₁², . . . , σ_D²))

• Independent component analysis p(z,) =

M

Y

m=1

p(z_m)p()

• Encode a priori knowledge inp(z_m), e.g. heavy tails.

(33)

• Bell and Sejnowski Algorithm aka InfoMax

• Assumption square mixing and no noise x=Wz W: d×d

• Likelihood - one sample p(x|W) =

Z

dzP(x|W,z)P(z) = Z

dzδ(x−Wz)P(z)

• Make change of variablesy=Wzanddy=|W|dz:

p(x|W) = 1

|W|

Z

dyδ(x−y)P(W⁻¹y)

= 1

P(W⁻¹x)

(34)

• Non-iid data – temporal/spatial correlations z_mnz_m⁰_n⁰ =δ_mm⁰K_m,nn⁰

• It is easy to prove that rotation ofz:Uzwill no longer leave statistics ofzunchanged,

• if kernels are differentK_m6=K_m⁰ for different variables m6=m⁰!

• Second statistics alone is therefore enough for identifiability!

• Molgedey and Schuster algorithm (aka MAF) one example using second order statistics

• Use of Gaussian processes (GP) another.

(35)

Beyond PCA and ICA

• Kernel PCA –Component analysis in feature space x→Φ(x)

• Nonlinear latent variable model x=Wf(z) +

• Fully probabilistic (and Bayesian) rather than one hammer (SVD) for all data

• Sparsity, Bayesian networks and latent variables models

−1 0 1

0.5 1

(36)

Summary and reading

• PCA and SVD

• Generative model – continuous latent variables

• Probabilistic PCA and ICA

• Non-linear and Bayesian extensions

• Books: C. Bishop, Pattern Recognition and Machine Learning, Springer and D.

MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge