NIPS TKML 2010 1
Applications of tensor decomposition in data mining and machine learning
Morten Mørup
Section for Cognitive Systems DTU Informatics
Technical University of Denmark
2
Goal of unsupervised Learning
(Ghahramani & Roweis, 1999)
Perform dimensionality reduction
Build topographic maps
Find the hidden causes or sources of the data
Model the data density
Cluster data
Purpose of unsupervised learning
(Hinton and Sejnowski, 1999)
Extract an efficient internal representation of the statistical structure implicit in the inputs
Tensor decomposition and unsupervised learning
3
Analysis of massive amounts of data will be the main
driving force of all sciences in the future!!
What is a Tensor/Multi-way array?
x X X
Vector
Order 1 tensor
Matrix
Order 2 tensor
3-way array
Order 3 tensor
Additional Definitions:
Multidimensional array N-way array
Hypermatrix
Unfortunately, multi-way structure has been widely ignored in many fields of research!
X X
Tensors are everywhere
6
Chemistry
Psychology
Computer Vision Bio-informatics
Neuroscience
Multi-relational modeling in Complex networks
Stimuli type
1 3
2 4 5
1 3
2 4
5
Why bother about tensor decomposition?
Tensor decomposition admit uniqueness of the
decomposition without additional constraints such as orthogonality and independence
Tensor decomposition methods can identify
components even when facing very poor signal to noise ratios (SNR) and when only a relatively small fraction of all the data is observed.
Tensor decomposition can explicitly take into account the multi-way structure of the data that would
otherwise be lost when analyzing the data by collapsing some of the modes to form a matrix
7
History of Tensor decomposition
Hitchcock 1927 (not the filmmaker!)
generalized 2-way rank to n-way (i.e. first proposed the CP and Tucker model) as well as introduced the notion of n-mode rank
Cattell 1944
Parallel Proportional Profiles
(to resolve rotational indeterminacy in factor analysis)
Tucker 1966
Proposed the Tucker model
Harshman and Carrol & Chang 1970
Independently proposed the PARAFAC and CanDecomp models (CP model)
Harshman Cattel
Carrol
Tucker
(familiar) matrix computation when dealing with tensors
The N-mode product
(a generalization of the matrix product for tensors)
10
U Σ V T
Outer product
11
U Σ V T
u
dv
dKronecker and Khatri-Rao product
12 Kronecker product
Khatri-Rao product
Inner product and Frobenius norm of a tensor
13
Summary of operators
The Tucker model (Tucker 1966)
(also proposed by Hitchcock in 1927)
Tucker representation is not unique:
The Tucker model is particularly useful for compressing a tensor
X G
A
B
C
Canonical Polyadic form (Hitchcock, 1927)
Canonical Decomposition (Carrol&Chang, 1970)
Parallel Factor Analysis (Harshman, 1970)
(CP)
Representation unique in general:
Kruskal’s Uniqueness condition for the CP model
17
Kruskal
The criterion has been generalized to arbitrary order N arrays
Lim, Comon, Comptes Rendus de l'Académie des sciences, 2009
K-rank closely related to the notion of spark in compressive sensing and an operational criteria for uniquenes given by measures of coherence
Kruskall, Psychometrica, 1976
Kruskall, Linear Algebra Applications, 1977
Sidiropoulos, Bro, Journal of Chemometrics, 2000
CP and Tucker Rank and Multilinear rank
Model estimation by Alternating Least squares (ALS)
CP model estimation
Tucker model estimation
Higher Order Orthogonal Iteration
(HOOI) Approximate solution or initialization by the Higher Order SVD (HOSVD)
Tucker:
CP:
For further details on model estimation see for instance:
Tomasi, PhD Thesis 2006
Comon, Luciano, Almeida, Journal of Chemometrics 2009
Many other tensor factorization approaches
Block Term Decomposition
+ + ...
DEDICOM
CP
PARAFAC2
X
+ +...
Tucker
Many More ...
"which group of subjects behave differently on which variables under which conditions?"
Kiers & Mechelen, Psychological Methods 2001 Kroonenberg, Applied Multiway Analysis 2008
Murakami & Kronenberg, Multivariate Behavioral Research 2003
Explained variation 44 % Random permuted data 2%
Application in Chemistry
“Second order advantage, making it possible to do quantitative chemical analytes even in the presence of un-calibrated
interferents"
Appellof & Davidson, Analytical Chemistry 1981 Bro, Critical Reviews in Analytical Chemistry 2006
Beer-Lambert’s law stating a linear relation between absorbance of light and the concentration of a compound
Independent Component Analysis
Wireless Communication
X
microphones x time A
microphones x peopleS
people x time…
A
1,1A
2,1A
1,2”The coctail party problem”
Sidiropolous, Member, Giannakis, Bro, IEEE trans on Signal Processing 2000
Cumulants obey multi-linearity
CP modeling admit to exploit different types of diversity for unique recovery
S Statistical IndependentComon, Signal Processing, 1994
Application in Bio-informatics
Omberg, Meyerson, Kobayashi, Drury, Difley, Alter, Molecular Systems Biology, 2009
In micro-array analysis tensor decomposition useful in order to remove experimental artifacts and detect distinct stimulus dependent sets of functionally related genes
Vasilescue, Terzopoulos, ECCV 2002
Yan, Xu, Yang, Zhang, Tang, Zhang, IEEE transactions on Image Processing, 2007 Extraction of patterns that generalize well across modes of variation
”TensorFaces” demonstrated to be significantly more accurate than PCA
as multiple interrelated subspaces can collaborate to discriminate different classes
Application in Web-mining
Sun, Tao, Faloutsos, KDD 2006
Acar, Camtepe, Krishnamoorthy, Yener, Intelligence and Security Informatics 2005
Kolda, Bader, Kenny, ICDM 2005
Captured well underlying group structure
Decomposition automatically identifed topics along with associated authoritative web-pages
”CubeSVD” significantly improve Web search performance over LSI and Collaborative Filtering approaches Sun, Zeng, Liu, Lu, Chen,WWW 2005
Bader, Harshman, Kolda, ICDM2007
Decomposition had strong correspondence with known job classifications while revealing the patterns of communication between these roles and the changes in communication pattern over time
”Dynamic/Streaming Tensor Analysis” useful for anomaly detection and multi-way LSI
27
Assumption: Data instantaneous mixture of temporal signatures. (PCA/ICA/NMF)
Flaw: XAS=(AQ
-1)( QS)=ÂŜ Representation not unique!
Application in Neuroscience
The EEG/MEG/fMRI Party problem
28
Space
Time
From 2-way to multi-way analysis
Preaverage
29
Space
Time
Space
Time
Subj 1
Subj 2
Subj N
Separate Analysis
space time
Subj 1
Subj 2
Subj N
space time
Subj 1 Subj 2 Subj N
(identical spatial map, varying time series)
Concatenation
(identical time series varying spatial maps)
3 common ways of avoiding tensors
30
Bilinear Model:
Trilinear Model:
Assumption: Data instantaneous mixture of temporal signatures.
(PCA/ICA/NMF)
Assumption: Data instantaneous mixture of temporal signatures that are expressed to various degree over the Subjects/trials
(Canonical Decomposition, Parallel Factor (CP))
(weighted averages over the trials)
Multilinear modelling
Mult. Mod. admits non-ambiguous extraction of concistent patterns of activation
Space
Time
Common fixes: Impose orthogonality, regularization or non-negativity
constraints by analyzing data transformed to a time-frequency domain representation
Wavelet transformed data
channel
time-frequency
channel
time
Mørup, Hansen, Herman, Parnas, Arnfred, NeuroImage 2006 Mørup, Hansen, Arnfred, Journal of Neuroscience Methods, 2007
www.ERPWAVELAB.org
33
Degeneracy often a result of multi- linear models being too restrictive
Trilinear model can encompass:
Variability in strength over repeats
However, other common causes of variation are:
Delay Variability
Shape Variability
Trial 1 Trial 2
Trial 1 Trial 2
ShiftCP
ConvCP, PARAFAC2
34
Modelling Delay Variability
Shifted CP:
35
36
Mørup, Hansen, Arnfred, Lim, Madsen, NeuroImage 2008
Multilinear modeling explicitly extract consistent
patterns of activation across trials/subjects
37
retinotopic mapping paradigm
(Analysis by Kristoffer Hougaard Madsen)
Mørup, Hansen, Arnfred, Lim, Madsen, NeuroImage 2008
38
Modeling Shape (and delay) Variability
convolutive CP:
*
Mørup, Hougard, Hansen, Nips workshop on New Directions in Statistical Learning for Meaningful and Reproducible fMRI Analysis 2008
39
ConvCP: Can model arbitrary number of component delays within the trials and account for shape variation within the convolutional model representation. Redundancy between what is coded in C and B resolved by imposing sparsity on C.
CP, ShiftCP and ConvCP
Mørup, Hougard, Hansen, Nips workshop on New Directions in Statistical Learning for Meaningful and Reproducible fMRI Analysis 2008
40
Analysis of fMRI data
(Voxels x Time x Trials)
Each trial consists of a visual stimulus delivered as an annular full-field checkerboard reversing at 8 Hz.
’ is L1
sparsity regularization imposed on third mode
Mørup, Hougard, Hansen, Nips workshop on New Directions in Statistical Learning for Meaningful and Reproducible fMRI Analysis 2008
Determining the number of components is in general a challenging problem in tensor decomposition
Contrary to SVD, CP and TUCKER is not nested, thus each model order needs to be estimated separately.
D
maxCP models but L
maxM
maxN
maxdifferent TUCKER models for a 3rd order tensor!
41
Common approaches to model order selection
CP: Core Consistency Diagnostic
compare CP-model with unit diagonal core to the
corresponding Tucker model core having same loadings.
CP and Tucker: Cross-validation based on test set formed by treating entries in tensor as missing
CP Model uniquely identifiable even when 99% of data treated as missing!
Bro, Kjeldahl, Smilde, Kiers, Anals of Bioanalytical Chemistry, 2008 Bro, Kiers, Journal of Chemometrics, 2003
Acar, Dunlavy, Kolda, Mørup, Chemometrics and Intelligent Laboratory Systems, 2010
42
Automatic Relevance Determination for model order estimation
Use regularization to simplify the Tucker core forming a unique representation as well as enable interpolation between the Tucker (full core) and CP (diagonal core) model.
Use regularization to turn off excess components in the CP and Tucker model and thereby select the model order at the cost of fitting one conventional model.
Tune the regularization strength from data by Automatic Relevance Determination (ARD) based on Bayesian learning.
Mørup and Hansen, Journal of Chemometrics 2009
43
Automatic Relevance Determination (ARD)
Automatic Relevance Determination (ARD) is a hierarchical Bayesian approach widely used for model selection
In ARD hyper-parameters explicitly represents the relevance of different features by defining their
range of variation.
(i.e., Range of variation0 Feature removed)
44
Sparse Tucker decomposition by ARD
Brakes into standard Lasso/BPD sub-problems of the form Update of regularization parameters by ARD
Maximum a posteriori (MAP) estimation
Mørup and Hansen, Journal of Chemometrics 2009
45
Tucker(10,10,10) models were fitted to the data, given are below the extracted cores
Synthetic Tucker(3,4,5) Tucker(3,6,4) Tucker(3,3,3) Tucker(?,4,4) Tucker(4,4,4)
Mørup and Hansen, Journal of Chemometrics 2009
Current research: Reversible jump Markov Chain
Monte Carlo - a fully Bayesian approach to estimate parameter uncertainty and model order.
46
Schmidt, Mørup, Eusipco 2010
47
Summary
Multiway analysis is a rapidly growing research field fueled by the growth in storage capabilities and
computational power of modern computers.
Tensor decomposition approaches admit explicitly to take multimodal structure into account that is lost when
collapsing the data into a matrix
Tensor decomposition can admit unique recovery even when most of the data is missing and facing low SNR
Extensions of the basic models such as CP and Tucker can improve the identification of underlying structure and
alleviate CP degeneracy.
Many extensions of the basic CP and Tucker models exists
and the future will likely bring many new and interesting
tensor decomposition models and applications.
Some Tensors Decomposition at NIPS conference 2010
Peter Hoff’s Tutorial talk Monday
CP decomposition of countries cold war relations,
modes: country x country x time)
Tucker decomposition of trade
modes: export x import x indexed commonality x years
Memisevic et al. ”Gated Softmax Classification”
CP decomposition
modes: Input variable index x hidden variable index x class index
Mørup et al. ”Infinite Relational modeling of Functional Connectivity in Resting State fMRI”
Tucker2 Decomposition
modes: Voxel x Voxel x Subject
48
49
Mørup, Madsen, Dogonowski, Siebner, Hansen, NIPS 2010
72 subjects: 42 multiple schlerosis and 30 normal subjects
Multilinear modeling explicitly extract consistent functional groups (Z) of
voxels across subjects while the communication patterns between these
groups (
(s)) are subject specific.
Connectivity in Resting State fMRI
50 Extracted functional groups and their
median communication pattern Between group Communications patterns that sign. diff. between MS and Normal subj.
51
Lars Kai Hansen Sidse Marie Arnfred
Lek-Heng Lim
Tammy Kolda
Kristoffer Hougaard Madsen Evrim Acar
Mikkel N. Schmidt Daniel M. Dunlavy
Hartwig Siebner
Anne-Marie Dogonowski
Suggested introductory reading resources
Acar E, Yener B, Unsupervised multiway data analysis: A literature survey, IEEE Transactions on Knowledge and Data Engineering, 2009 21(1):6- 20.
Kolda TG, Bader BW, Tensor decompositions and applications, SIAM Review, 2009, 51(3):455-500
Kroonenberg PM, Applied multiway data analysis, John Wiley & Sons, 2008 Mørup M, Applications of tensor (multi-way array) factorizations and
decompositions in data mining, WIREs Data Mining and Knowledge Discovery (to appear)
(preprints avaiable upon request, including more detailed information on referenced material)
Smilde A, Bro R, Geladi P, Multi-way Analysis: Applications in the Chemical Sciences, John Wiley & Sons, 2004.
52