Applications of tensor decomposition in data mining and machine learning

(1)

NIPS TKML 2010 1

Applications of tensor decomposition in data mining and machine learning

Morten Mørup

Section for Cognitive Systems DTU Informatics

Technical University of Denmark

(2)

2

Goal of unsupervised Learning

(Ghahramani & Roweis, 1999)

 Perform dimensionality reduction

 Build topographic maps

 Find the hidden causes or sources of the data

 Model the data density

 Cluster data

Purpose of unsupervised learning

(Hinton and Sejnowski, 1999)

 Extract an efficient internal representation of the statistical structure implicit in the inputs

Tensor decomposition and unsupervised learning

(3)

3

Analysis of massive amounts of data will be the main

driving force of all sciences in the future!!

(4)

What is a Tensor/Multi-way array?

x X X

Vector

Order 1 tensor

Matrix

Order 2 tensor

3-way array

Order 3 tensor

Additional Definitions:

Multidimensional array N-way array

Hypermatrix

(5)

Unfortunately, multi-way structure has been widely ignored in many fields of research!

X X

(6)

Tensors are everywhere

6

Chemistry

Psychology

Computer Vision Bio-informatics

Neuroscience

Multi-relational modeling in Complex networks

Stimuli type

1 3

2 4 5

1 3

2 4

5

(7)

Why bother about tensor decomposition?

 Tensor decomposition admit uniqueness of the

decomposition without additional constraints such as orthogonality and independence

 Tensor decomposition methods can identify

components even when facing very poor signal to noise ratios (SNR) and when only a relatively small fraction of all the data is observed.

 Tensor decomposition can explicitly take into account the multi-way structure of the data that would

otherwise be lost when analyzing the data by collapsing some of the modes to form a matrix

7

(8)

History of Tensor decomposition

Hitchcock 1927 (not the filmmaker!)

generalized 2-way rank to n-way (i.e. first proposed the CP and Tucker model) as well as introduced the notion of n-mode rank

Cattell 1944

Parallel Proportional Profiles

(to resolve rotational indeterminacy in factor analysis)

Tucker 1966

Proposed the Tucker model

Harshman and Carrol & Chang 1970

Independently proposed the PARAFAC and CanDecomp models (CP model)

Harshman Cattel

Carrol

Tucker

(9)

(familiar) matrix computation when dealing with tensors

(10)

The N-mode product

(a generalization of the matrix product for tensors)

10

U Σ V T

(11)

Outer product

11

U Σ V T

u

^d

v

^d

(12)

Kronecker and Khatri-Rao product

12 Kronecker product

Khatri-Rao product

(13)

Inner product and Frobenius norm of a tensor

13

Summary of operators

(14)

The Tucker model (Tucker 1966)

(also proposed by Hitchcock in 1927)

Tucker representation is not unique:

(15)

The Tucker model is particularly useful for compressing a tensor

X  G

A

B

C

(16)

Canonical Polyadic form (Hitchcock, 1927)

Canonical Decomposition (Carrol&Chang, 1970)

Parallel Factor Analysis (Harshman, 1970)

(CP)

Representation unique in general:

(17)

Kruskal’s Uniqueness condition for the CP model

17

Kruskal

The criterion has been generalized to arbitrary order N arrays

Lim, Comon, Comptes Rendus de l'Académie des sciences, 2009

K-rank closely related to the notion of spark in compressive sensing and an operational criteria for uniquenes given by measures of coherence

Kruskall, Psychometrica, 1976

Kruskall, Linear Algebra Applications, 1977

Sidiropoulos, Bro, Journal of Chemometrics, 2000

(18)

CP and Tucker  Rank and Multilinear rank

(19)

Model estimation by Alternating Least squares (ALS)

CP model estimation

Tucker model estimation

Higher Order Orthogonal Iteration

(HOOI) Approximate solution or initialization by the Higher Order SVD (HOSVD)

Tucker:

CP:

For further details on model estimation see for instance:

Tomasi, PhD Thesis 2006

Comon, Luciano, Almeida, Journal of Chemometrics 2009

(20)

Many other tensor factorization approaches

Block Term Decomposition

+ + ...

DEDICOM

CP

PARAFAC2

X

+ +...

Tucker

Many More ...

(21)

"which group of subjects behave differently on which variables under which conditions?"

Kiers & Mechelen, Psychological Methods 2001 Kroonenberg, Applied Multiway Analysis 2008

Murakami & Kronenberg, Multivariate Behavioral Research 2003

Explained variation 44 % Random permuted data 2%

(22)

Application in Chemistry

“Second order advantage, making it possible to do quantitative chemical analytes even in the presence of un-calibrated

interferents"

Appellof & Davidson, Analytical Chemistry 1981 Bro, Critical Reviews in Analytical Chemistry 2006

Beer-Lambert’s law stating a linear relation between absorbance of light and the concentration of a compound

(23)

Independent Component Analysis

Wireless Communication

X

microphones x time

 A

microphones x people

S

people x time

…

A

_1,1

A

_2,1

A

_1,2

”The coctail party problem”

Sidiropolous, Member, Giannakis, Bro, IEEE trans on Signal Processing 2000

Cumulants obey multi-linearity

CP modeling admit to exploit different types of diversity for unique recovery

S Statistical Independent

Comon, Signal Processing, 1994

(24)

Application in Bio-informatics

Omberg, Meyerson, Kobayashi, Drury, Difley, Alter, Molecular Systems Biology, 2009

In micro-array analysis tensor decomposition useful in order to remove experimental artifacts and detect distinct stimulus dependent sets of functionally related genes

(25)

Vasilescue, Terzopoulos, ECCV 2002

Yan, Xu, Yang, Zhang, Tang, Zhang, IEEE transactions on Image Processing, 2007 Extraction of patterns that generalize well across modes of variation

”TensorFaces” demonstrated to be significantly more accurate than PCA

as multiple interrelated subspaces can collaborate to discriminate different classes

(26)

Application in Web-mining

Sun, Tao, Faloutsos, KDD 2006

Acar, Camtepe, Krishnamoorthy, Yener, Intelligence and Security Informatics 2005

Kolda, Bader, Kenny, ICDM 2005

Captured well underlying group structure

Decomposition automatically identifed topics along with associated authoritative web-pages

”CubeSVD” significantly improve Web search performance over LSI and Collaborative Filtering approaches Sun, Zeng, Liu, Lu, Chen,WWW 2005

Bader, Harshman, Kolda, ICDM2007

Decomposition had strong correspondence with known job classifications while revealing the patterns of communication between these roles and the changes in communication pattern over time

”Dynamic/Streaming Tensor Analysis” useful for anomaly detection and multi-way LSI

(27)

27

Assumption: Data instantaneous mixture of temporal signatures. (PCA/ICA/NMF)

Flaw: ^XAS=(AQ

^-1

⁾⁽ ^QS)=ÂŜ Representation not unique!

Application in Neuroscience



The EEG/MEG/fMRI Party problem

(28)

28

Space

Time

From 2-way to multi-way analysis

(29)

Preaverage

29

Space

Time

Space

Time

Subj 1

Subj 2

Subj N



Separate Analysis



space time

Subj 1

Subj 2

Subj N



space time

Subj 1 Subj 2 Subj N

(identical spatial map, varying time series)

Concatenation

(identical time series varying spatial maps)

3 common ways of avoiding tensors

(30)

30

Bilinear Model:

Trilinear Model:

Assumption: Data instantaneous mixture of temporal signatures.

(PCA/ICA/NMF)

Assumption: Data instantaneous mixture of temporal signatures that are expressed to various degree over the Subjects/trials

(Canonical Decomposition, Parallel Factor (CP))

(weighted averages over the trials)

Multilinear modelling

Mult. Mod. admits non-ambiguous extraction of concistent patterns of activation

(31)

Space

Time

Common fixes: Impose orthogonality, regularization or non-negativity

constraints by analyzing data transformed to a time-frequency domain representation

(32)

Wavelet transformed data

channel

time-frequency

channel

time

Mørup, Hansen, Herman, Parnas, Arnfred, NeuroImage 2006 Mørup, Hansen, Arnfred, Journal of Neuroscience Methods, 2007

www.ERPWAVELAB.org

(33)

33

Degeneracy often a result of multi- linear models being too restrictive

Trilinear model can encompass:

 Variability in strength over repeats

However, other common causes of variation are:

 Delay Variability

 Shape Variability

Trial 1 Trial 2

ShiftCP

ConvCP, PARAFAC2

(34)

34

Modelling Delay Variability

Shifted CP:

(35)

35

(36)

36

Mørup, Hansen, Arnfred, Lim, Madsen, NeuroImage 2008

Multilinear modeling explicitly extract consistent

patterns of activation across trials/subjects

(37)

37

retinotopic mapping paradigm

(Analysis by Kristoffer Hougaard Madsen)

Mørup, Hansen, Arnfred, Lim, Madsen, NeuroImage 2008

(38)

38

Modeling Shape (and delay) Variability

convolutive CP:

*

Mørup, Hougard, Hansen, Nips workshop on New Directions in Statistical Learning for Meaningful and Reproducible fMRI Analysis 2008

(39)

39

ConvCP: Can model arbitrary number of component delays within the trials and account for shape variation within the convolutional model representation. Redundancy between what is coded in C and B resolved by imposing sparsity on C.

CP, ShiftCP and ConvCP

(40)

40

Analysis of fMRI data

(Voxels x Time x Trials)

Each trial consists of a visual stimulus delivered as an annular full-field checkerboard reversing at 8 Hz.

’ is L₁

sparsity regularization imposed on third mode

(41)

 Determining the number of components is in general a challenging problem in tensor decomposition

 Contrary to SVD, CP and TUCKER is not nested, thus each model order needs to be estimated separately.

 D

^max

CP models but L

^max

M

^max

N

^max

different TUCKER models for a 3rd order tensor!

41

Common approaches to model order selection

 CP: Core Consistency Diagnostic

compare CP-model with unit diagonal core to the

corresponding Tucker model core having same loadings.

 CP and Tucker: Cross-validation based on test set formed by treating entries in tensor as missing

CP Model uniquely identifiable even when 99% of data treated as missing!

Bro, Kjeldahl, Smilde, Kiers, Anals of Bioanalytical Chemistry, 2008 Bro, Kiers, Journal of Chemometrics, 2003

Acar, Dunlavy, Kolda, Mørup, Chemometrics and Intelligent Laboratory Systems, 2010

(42)

42

Automatic Relevance Determination for model order estimation

 Use regularization to simplify the Tucker core forming a unique representation as well as enable interpolation between the Tucker (full core) and CP (diagonal core) model.

 Use regularization to turn off excess components in the CP and Tucker model and thereby select the model order at the cost of fitting one conventional model.

 Tune the regularization strength from data by Automatic Relevance Determination (ARD) based on Bayesian learning.

Mørup and Hansen, Journal of Chemometrics 2009

(43)

43

Automatic Relevance Determination (ARD)

 Automatic Relevance Determination (ARD) is a hierarchical Bayesian approach widely used for model selection

 In ARD hyper-parameters explicitly represents the relevance of different features by defining their

range of variation.

(i.e., Range of variation0  Feature removed)

(44)

44

Sparse Tucker decomposition by ARD

Brakes into standard Lasso/BPD sub-problems of the form Update of regularization parameters by ARD

Maximum a posteriori (MAP) estimation

(45)

45

Tucker(10,10,10) models were fitted to the data, given are below the extracted cores

Synthetic Tucker(3,4,5) Tucker(3,6,4) Tucker(3,3,3) Tucker(?,4,4) Tucker(4,4,4)

(46)

Current research: Reversible jump Markov Chain

Monte Carlo - a fully Bayesian approach to estimate parameter uncertainty and model order.

46

Schmidt, Mørup, Eusipco 2010

(47)

47

Summary

 Multiway analysis is a rapidly growing research field fueled by the growth in storage capabilities and

computational power of modern computers.

 Tensor decomposition approaches admit explicitly to take multimodal structure into account that is lost when

collapsing the data into a matrix

 Tensor decomposition can admit unique recovery even when most of the data is missing and facing low SNR

 Extensions of the basic models such as CP and Tucker can improve the identification of underlying structure and

alleviate CP degeneracy.

 Many extensions of the basic CP and Tucker models exists

and the future will likely bring many new and interesting

tensor decomposition models and applications.

(48)

Some Tensors Decomposition at NIPS conference 2010

 Peter Hoff’s Tutorial talk Monday

CP decomposition of countries cold war relations,

modes: country x country x time)

Tucker decomposition of trade

modes: export x import x indexed commonality x years

 Memisevic et al. ”Gated Softmax Classification”

CP decomposition

modes: Input variable index x hidden variable index x class index

 Mørup et al. ”Infinite Relational modeling of Functional Connectivity in Resting State fMRI”

Tucker2 Decomposition

modes: Voxel x Voxel x Subject

48

(49)

49

Mørup, Madsen, Dogonowski, Siebner, Hansen, NIPS 2010

72 subjects: 42 multiple schlerosis and 30 normal subjects

Multilinear modeling explicitly extract consistent functional groups (Z) of

voxels across subjects while the communication patterns between these

groups (

^(s)

) are subject specific.

Connectivity in Resting State fMRI

(50)

50 Extracted functional groups and their

median communication pattern Between group Communications patterns that sign. diff. between MS and Normal subj.

(51)

51

Lars Kai Hansen Sidse Marie Arnfred

Lek-Heng Lim

Tammy Kolda

Kristoffer Hougaard Madsen Evrim Acar

Mikkel N. Schmidt Daniel M. Dunlavy

Hartwig Siebner

Anne-Marie Dogonowski

(52)

Acar E, Yener B, Unsupervised multiway data analysis: A literature survey, IEEE Transactions on Knowledge and Data Engineering, 2009 21(1):6- 20.

Kolda TG, Bader BW, Tensor decompositions and applications, SIAM Review, 2009, 51(3):455-500

Kroonenberg PM, Applied multiway data analysis, John Wiley & Sons, 2008 Mørup M, Applications of tensor (multi-way array) factorizations and

decompositions in data mining, WIREs Data Mining and Knowledge Discovery (to appear)

(preprints avaiable upon request, including more detailed information on referenced material)

Smilde A, Bro R, Geladi P, Multi-way Analysis: Applications in the Chemical Sciences, John Wiley & Sons, 2004.

52