PCA, ICA and beyond
Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven
Ole Winther
Technical University of Denmark (DTU) & University of Copenhagen (KU)
August 18, 2009
• Motivation – multivariate data
• Principal component analysis (PCA) classic
• Model based approach – probabilistic PCA (pPCA)
• Identifiability – independent component analysis (ICA)
• InfoMax, smoothness and beyond.
Motivation – multivariate data
• Data is often (but not always) represented as a matrix ofd featuresandN samples:
size(X) = [d N]
• In statsd =p,N=nand data matrix transposedX→XT
• Collaborative filtering:
X=item-user matrix
• Gene expression:
X=gene-tissue matrix
• Text analysis:
Collaborative filtering
Collaborative filtering
• Netflix - online movie rental (DVDs).
• Collaborative filtering – predict user rating from past behavior of user.
• Improve Netflix own system by 10% to win.
• training.txt –R=108ratings, scale 1 to 5 ford =17.770 movies andN =480.189 users.
• qualifying.txt – 2.817.131 movie-user pairs, (continuous) predictions submitted to Netflix returns a RMSE.
• Rating matrixXmostly missing values, 98.5%.
Collaborative filtering
Collaborative filtering task
• Relatively large data set - 108data points
• Very heterogeneous - viewers and movies with few ratings
• Ratings∈ {1,2,3,4,5}noisy (subjective use of scale, non-stationary,. . . )
• Complex model needed to capture latent structure
• Regularization!
Collaborative filtering
• Netflix prize - some key performance numbers
Method RMSE % Improv.
Cinematch 0.9514 0%
“PCA” ∼0.89-0.92 ∼5-6%
Grand prize 0.8563 10%
• RMSE = root mean squared error
• Two teams (Ensemble and BellKor’s Pragmatic Chaos) are above 10%
• but prize not handed out yet (of Aug 2009).
Gene Expression
DNA Micro-Arrayd ∼6000 andN ∼60 cancer tissues.
10 20 30 40 50 60
1000
2000
3000
4000
5000
6000
Gene Expression
Protein signalling network textbook – Sachs et. al. Science, 2005.
Gene Expression
• Single cell flow cytometry measurements of 11 phosphorylated proteins and phospholipids.
• Data was generated from a series of stimulatory cues and inhibitory interventions.
• Observational data: 1755 general stimulatory conditions,
• Experimental data∼80%not used in our approach.
• Not “smallnlargep”!
Latent semantic analysis (LSA)
Bag of words representation – term-document matrix
Principal component analysis
• Principal components (PCs): the orthogonal directions with most variance.
• Empirical co-variance of (centered) data:
S= 1 NXXT
• size(S) = [ d d ]
• PCs: eigen-vectors ofS Sui =λiui
−2 −1 0 1 2
−2
−1.5
−1
−0.5 0 0.5 1 1.5 2
Plot axis√ λiui
Principal component analysis
2 4 6
40 50 60 70 80 90 100
−2 0 2
−2 0 2
−2 0 2
−2 0 2
Typical steps in PCA.
PCA – maximum variance formulation
• Project data{xn}n=1,...,N onto directions{ui}i=1,...,M.
• We find directions sequentially,u1first.
• Meanof projected data: uT1¯xwith
x¯= 1 N
N
X
n=1
xn.
• Varianceof projected data:
1 N
N
X
n=1
n
uT1(xn−x)¯ o2
=uT1Su1.
PCA – maximum variance formulation
• Maximize variance
uT1Su1 with respect tou1.
• But we need a constraint to avoidu1→ ∞:
uT1Su1+λ1(1−uT1u1)
• λ1Lagrange mulitplier.
• Solution - eigenvalue problem Su1=λ1u1.
PCA – minimum error reconstruction
x2
x1
xn
xen
u1
Find best reconstructing orthonormal directions{ui}.
PCA – minimum error reconstruction
• Orthonormal directions{ui}:uTi uj =δij.
• Lower dimensional subspace:
x˜n=
M
X
i=1
αniui+
D
X
i=M+1
biui
• Minimize J({ui}) = 1
N
N
X
n=1
||xn−˜xn||2=. . .=
D
X
i=M+1
uTi Sui =
D
X
i=M+1
λi.
PCA – minimum error reconstruction
• Database ofN,d =28×28=784 pixel values
• Mean and first four PCs
Mean
• Reconstruction x˜n=
M
X
i=1
(xTnui)ui+
D
X
i=M+1
(¯xTnuu)ui= ¯x+
M
X
i=1
h
(xn−x)¯ Tuii ui
PCA – minimum error reconstruction
Where is the signal?
(a)
0 200 400 600
0 1 2 3
x 105
(b)
0 200 400 600
0 1 2 3x 106
Original
Singular Value Decomposition (SVD)
• SVD– a simpler way to do PCA:[U,D,V] =SVD(X) X=UDVT
• UandVared×d andN×N orthonormal matrices:
UTU=Id VTV=IN
• Ddiagonald×Nofsingular values(≥0) sorted:
S= 1
NXTX= 1
NUDVTVDTUT = 1
NUDTDUT
• Columns ofUare the eigenvectors ofS,the PCs:
Sui = Dii2
N ui and λi = Dii2 N .
• Projection of all data onith PC:uTi X=uTi UDVT =DiivTi .
Singular Value Decomposition (SVD)
• Project onto PCs: UM,d×M:
X˜M ≡UTMX=UTMUDVT =DMVTM
• DM,M×MandVM,N×M.
• Covariance of projected data:
S˜ = 1
NX˜MX˜TM = 1
NDMVTMVMDM = 1 ND2M
• Whitening XbM ≡√
ND−1M UTMX=
√
NVTM → Sb=IM .
• Lossyprojection backtod-dim. space:
Continuous Latent Variable Models
• Latent variablezis unobserved but can be learned from data.
• Translations (2) and rotation (1) latent variables in the example.
• Mapping(W,zn)→xnin generalnon-linear.
• We often consider simplerlinearmodels:
xn =Wzn+n.
Continuous Latent Variable Models
• Explain data by latent variables + noise:
x=Wz+ X=WZ+E
• x,d dimensional data vector
• W,d×M dimensional ‘mixing matrix’.
• z,M dimensional latent variable or source vector.
• ,d dimensional noise or residual vector,
xin
Wim zmn
in σi2
i=1, . . . ,d
n=1, . . . ,N m=1, . . . ,M
Continuous Latent Variable Models
Linear latent variable model forcollaborative filtering
• vn: M-dimensional “taste” vector of viewern.
• um :M-dimensional “profile” vector moviem.
• Latent variablehmn:
hmn=um·vn+mn N(mn|0, σ2)
• σ2is the noise level.
• LearnUandVtraining data and predictrm0n0 ≈um0·vn0
Probabilistic PCA
• Let us try to understand PCA as a latent variable model:
x=Wz+
• SVD gives a hint:
X = WZ+E=UDVT =UMDMVTM+E
⇒
W = UMDM and Z=VTM
Probabilistic PCA
• Tipping and Bishop considered a specific assumption:
P(z) = Norm(z;0,I) P(;σ2) = Norm(;0, σ2I)
• Under this modelxis Gaussian with x = Wz+=0
xxT = WzzTWT +T =WWT +σ2I
• Distribution of datum
P(x;W, σ2) =Norm(x;0,WWT +σ2I)
Probabilistic PCA
• Log likelihood forWandσ2is joint distribution of all data:
logL(θ;X) = X
n
logP(xn|W, σ2)
= −N 2
n
log det 2πΣ +Tr h
Σ−1Sio
• Model covariance: Σ =WWT +σ2I
• Empirical covariance: S= N1XXT
• The solution: theMPCs will the largest eigenvalues.
• The remaining variance will be explainedσ2I.
Probabilistic PCA
• Let us try to solve the cocktail party problem:
Recordings=Mixing×Speakers Or
x=Wz
• Use pPCA to estimateWandz.
• Ignore complications of room acoustics.
Probabilistic PCA
• Stop sign! Non-uniqueness of solution!
• Likelihood only depends uponWthroughΣ =WWT +σ2I
• RotateW:
W←WUe
• leave covariance unchanged
WWT =WUUe TWe =WeWeT .
• This can also be seen directly from the model:
Wz = WUUe Tez=Weez z = UTez
• The distribution is invariant
Independent component analysis (ICA)
• Prior knowledge to the rescue!
• Real signals are not Gaussian
• Examplex=w1z1+w2z2
• withz1andz2independent and heavy tailed.
• We exploit this information by putting this into our model!
−40 −20 0 20 40
−10
−5 0 5 10 15
Independent component analysis (ICA)
• Allow for more generalz-distribution
• Still assume independent identically distribution (iid):
P(Z) =Y
mn
P(zmn)
• Many choices possible:heavy tailed(positive kurtosis), uniform,discrete(think of wireless communication), positive(used to decompose spectra, images, etc.) and negative kurtosis.
• Extension totemporal/spatial correlations(time series,
Independent component analysis (ICA)
• Summary linear generative models x=Wz+.
• Probabilistic PCA:
p(z,) =N(z;0,I)N(;0, σ2I)
• Factor analysis
p(z,) =N(z;0,I)N(;0,diag(σ12, . . . , σD2))
• Independent component analysis p(z,) =
M
Y
m=1
p(zm)p()
• Encode a priori knowledge inp(zm), e.g. heavy tails.
Independent component analysis (ICA)
• Bell and Sejnowski Algorithm aka InfoMax
• Assumption square mixing and no noise x=Wz W: d×d
• Likelihood - one sample p(x|W) =
Z
dzP(x|W,z)P(z) = Z
dzδ(x−Wz)P(z)
• Make change of variablesy=Wzanddy=|W|dz:
p(x|W) = 1
|W|
Z
dyδ(x−y)P(W−1y)
= 1
P(W−1x)
Independent component analysis (ICA)
• Non-iid data – temporal/spatial correlations zmnzm0n0 =δmm0Km,nn0
• It is easy to prove that rotation ofz:Uzwill no longer leave statistics ofzunchanged,
• if kernels are differentKm6=Km0 for different variables m6=m0!
• Second statistics alone is therefore enough for identifiability!
• Molgedey and Schuster algorithm (aka MAF) one example using second order statistics
• Use of Gaussian processes (GP) another.
Beyond PCA and ICA
• Kernel PCA –Component analysis in feature space x→Φ(x)
• Nonlinear latent variable model x=Wf(z) +
• Fully probabilistic (and Bayesian) rather than one hammer (SVD) for all data
• Sparsity, Bayesian networks and latent variables models
−1 0 1
−1 0 1
0.5 1
Summary and reading
• PCA and SVD
• Generative model – continuous latent variables
• Probabilistic PCA and ICA
• Non-linear and Bayesian extensions
• Books: C. Bishop, Pattern Recognition and Machine Learning, Springer and D.
MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge