• Ingen resultater fundet

Geometric Methods and Manifold Learning

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Geometric Methods and Manifold Learning"

Copied!
111
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

s lid e C o lo r,

Geometric Methods and Manifold Learning

Mikhail Belkin, Ohio State University,

prepared jointly with Partha Niyogi

(2)

High Dimensional Data

When can we avoid the curse of dimensionality?

Smoothness

rate ≈ (1/n)

sd

splines,kernel methods, L

2

regularization...

Sparsity

wavelets, L

1

regularization, LASSO, compressed sensing..

Geometry

graphs, simplicial complexes, laplacians, diffusions

(3)

Geometry and Data: The Central Dogma

Distribution of natural data is non-uniform and

concentrates around low-dimensional structures.

The shape (geometry) of the distribution can be

exploited for efficient learning.

(4)

Manifold Learning

Learning when data ∼ M ⊂ R N Clustering: M → { 1, . . . , k }

connected components, min cut

Classification: M → {− 1, +1 }

P on M × {−1, +1}

Dimensionality Reduction: f : M → R n n << N

M unknown: what can you learn about M from data?

e.g. dimensionality, connected components holes, handles, homology

curvature, geodesics

(5)

Formal Justification

Speech

speech ∈ l

2

generated by vocal tract Jansen and Niyogi (2005)

Vision

group actions on object leading to different images Donoho and Grimes (2004)

Robotics

configuration spaces in joint movements

Graphics

Manifold + Noise may be generic model in high dimensions.

(6)

Take Home Message

Geometrically motivated approach to learning

nonlinear, nonparametric, high dimensions

Emphasize the role of the Laplacian and Heat Kernel

Semi-supervised regression and classification Clustering and Homology

Randomized Algorithms and Numerical Analysis

(7)

Principal Components Analysis

Given x 1 , . . . , x n ∈ R D

Find y 1 , . . . , y n ∈ R such that

y i = w · x i

and

max w Variance ( { y i } ) = X

i

y i 2 = w T X

i

x i x T i

! w

X

(8)

Manifold Model

Suppose data does not lie on a linear subspace.

Yet data has inherently one degree of freedom.

(9)

An Acoustic Example

u(t) s(t)

l

(10)

An Acoustic Example

u(t) s(t)

l

One Dimensional Air Flow

(i)

∂V∂x

= −

ρcA2 ∂P∂t

(ii)

∂P∂x

= −

Aρ ∂V∂t

V (x, t) = volume velocity

P (x, t) = pressure

(11)

Solutions

0.65 0.7

0.75 0.8

0.85 0.9

0.95 1

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

beta−1 beta−3

beta−7

u(t) = P

n=1

α

n

sin(nω

0

t) ∈ l

2

s(t) = P

n=1

β

n

sin(nω

0

t) ∈ l

2

(12)

Acoustic Phonetics

A

l

l 2

1

A 2

1

Vocal Tract modeled as a sequence of tubes.

(e.g. Stevens, 1998)

(13)

Vision Example

f : R 2 → [0, 1]

F = { f | f (x, y) = v(x − t, y − r ) }

(14)

Robotics

g : S 2 × S 2 × S 2 → R 3

h (θ 1 , φ 1 ), (θ 2 , φ 2 ), (θ 3 , φ 3 ) i → (x, y, z )

(15)

Manifold Learning

Learning when data ∼ M ⊂ R N Clustering: M → { 1, . . . , k }

connected components, min cut

Classification/Regression: M → {− 1, +1 } or M → R

P on M × {−1, +1} or P on M × R

Dimensionality Reduction: f : M → R n n << N

M unknown: what can you learn about M from data?

e.g. dimensionality, connected components holes, handles, homology

curvature, geodesics

(16)

Differential Geometry

All you wanted to know about

differential geometry but were

afraid to ask, in 10 easy slides!

(17)

Embedded manifolds

M k ⊂ R N

Locally (not globally) looks like Euclidean space.

0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000

1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111

S 2 ⊂ R 3

(18)

Tangent space

000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000

111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111

T p M k ⊂ R N

k -dimensional affine subspace of R N .

(19)

Tangent vectors and curves

0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000

1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111

(20)

Tangent vectors and curves

0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000

1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111

v

φ (t)

φ(t) : R → M k

dφ(t) d t

0

= V

Tangent vectors <———> curves.

(21)

Tangent vectors as derivatives

0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000

1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111

v

φ (t)

f : M k → R

(22)

Tangent vectors as derivatives

0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000

1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111 1111111111111111111111111111111111111111

v

φ (t)

f : M k → R

φ(t) : R → M k

f (φ(t)) : R → R

df

dv = d f (φ(t)) d t

0

Tangent vectors <———> Directional derivatives.

(23)

Riemannian geometry

Norms and angles in tangent space.

000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000

111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111

w

v

h v, w i k v k , k w k

(24)

Length of curves and geodesics

000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000

111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111

φ(t) : [0, 1] → M k

l(φ) =

Z 1

0

dφ dt

dt

Can measure length using norm in tangent space.

Geodesic — shortest curve between two points.

(25)

Gradient

000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000

111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111

v

φ (t)

f : M k → R

h∇ f, v i ≡ df dv

Tangent vectors <———> Directional derivatives.

(26)

Exponential map

000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000

111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111

(t) r

φ p

w v

q

exp p : T p M k → M k exp p (v) = r exp p (w) = q

Geodesic φ(t)

φ(0) = p, φ( k v k ) = q dφ(t) dt

0

= v

(27)

Laplace-Beltrami operator

000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000 000000000000000000000000000000000000000

111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111 111111111111111111111111111111111111111

x

2

p x

1

f : M k → R

exp p : T p M k → M k

M f (p) ≡ X

i

2 f (exp p (x))

∂x 2 i

Orthonormal coordinate system.

(28)

Intrinsic Curvature

cannot flatten —— can flatten

nonzero curvature —— zero curvature

No accurate map of Earth exists – Gauss’s theorem.

(29)

Dimensionality Reduction

Given x 1 , . . . , x n ∈ M ⊂ R N ,

Find y 1 , . . . , y n ∈ R d where d << N

ISOMAP (Tenenbaum, et al, 00) LLE (Roweis, Saul, 00)

Laplacian Eigenmaps (Belkin, Niyogi, 01)

Local Tangent Space Alignment (Zhang, Zha, 02) Hessian Eigenmaps (Donoho, Grimes, 02)

Diffusion Maps (Coifman, Lafon, et al, 04)

Related: Kernel PCA (Schoelkopf, et al, 98)

(30)

Algorithmic framework

(31)

Algorithmic framework

(32)

Algorithmic framework

Neighborhood graph common to all methods.

(33)

Isomap

1. Construct Neighborhood Graph.

2. Find shortest path (geodesic) distances.

D ij is n × n

3. Embed using Multidimensional Scaling.

(34)

Multidimensional Scaling

Idea: Distances Inner products Embedding 1. Inner product from distances:

h x , x i − 2 h x , y i + h y , y i = k x − y k 2

A ii + A jj − 2A ij = D ij

Answer:

A = − 1

2 HDH where H = I − 1

n 11 T

In general only an approximation.

(35)

Multidimensional Scaling

2. Embedding from inner products (same as PCA!).

Consider a positive definite matrix A . Then A ij corresponds to inner products.

A =

X n

i =1

λ i φ i φ T i

Then for any x ∈ {1, . . . , n}

ψ(x) = p

λ 1 φ i (x), . . . , p

λ k φ k (x)

∈ R k

(36)

Isomap

From Tenenbaum, et al. 00

(37)

Unfolding flat manifolds

Isomap:

“unfolds” a flat manifold isometric to a convex domain in R n .

Hessian Eigenmaps:

“unfolds” and flat manifold isometric to an arbitrary domain in R n .

LTSA can also find an unfolding.

(38)

Locally Linear Embedding

1. Construct Neighborhood Graph.

2. Let x 1 , . . . , x n be neighbors of x . Project x to the span of

x 1 , . . . , x n .

3. Find barycentric coordinates of x ¯ .

x

3

x

x

x

1

2

x

¯

x = w 1 x 1 + w 2 x 2 + w 3 x 3

w 1 + w 2 + w 3 = 1

Weights w 1 , w 2 , w 3 chosen,

so that x ¯ is the center of mass.

(39)

Locally Linear Embedding

4. Construct sparse matrix W . i th row is barycentric coordinates of x ¯ i in the basis of its nearest neighbors.

5. Use lowest eigenvectors of (I − W ) t (I − W ) to embed.

(40)

Laplacian and LLE

O x

x

1

2

x

3

X w

i

x

i

= 0

X w

i

= 1

Hessian H . Taylor expansion :

f (x

i

) = f (0) + x

ti

∇f + 1

2 x

ti

Hx

i

+ o(kx

i

k

2

)

(I − W )f (0) = f (0) − X

w

i

f (x

i

) ≈ f (0) − X

w

i

f (0) − X

i

w

i

x

ti

∇f − 1 2

X

i

x

ti

Hx

i

=

= − 1 2

X

i

x

ti

Hx

i

≈ −trH = ∆f

(41)

Laplacian Eigenmaps

Step 1

[

Constructing the Graph

]

e

ij

= 1 ⇔ x

i

“close to” x

j

1. ǫ-neighborhoods. [

parameter

ǫ ∈ R ] Nodes i and j are connected by an edge if

|| x

i

− x

j

||

2

< ǫ

2. n nearest neighbors. [

parameter

n ∈ N ] Nodes i and j are connected by an edge if i is among

n nearest neighbors of j or j is among n nearest neighbors of i.

(42)

Laplacian Eigenmaps

Step 2

. [

Choosing the weights

].

1. Heat kernel. [

parameter

t ∈ R ]. If nodes i and j are connected, put

W

ij

= e

||xixj||2 t

2. Simple-minded. [

No parameters

]. W

ij

= 1 if and only if vertices i and j are connected by an

edge.

(43)

Laplacian Eigenmaps

Step 3.

[

Eigenmaps

] Compute eigenvalues and eigenvectors for the generalized eigenvector problem:

Lf = λDf

D is diagonal matrix where

D

ii

= X

j

W

ij

L = D − W

Let f

0

, . . . , f

k−1

be eigenvectors.

Leave out the eigenvector f

0

and use the next m lowest eigenvectors for embedding in an

(44)

Diffusion Distance

Heat diffusion operator H t .

δ x and δ y initial heat distributions.

Diffusion distance between x and y :

k H t δ x − H t δ y k L

2

Difference between heat distributions after time t .

(45)

Diffusion Maps

Embed using weighted eigenfunctions of the Laplacian:

x → (e −λ

1

t f 1 (x), e −λ

2

t f 2 (x), . . .)

Diffusion distance is (approximated by) the distance between the embedded points.

Closely related to random walks on graphs.

(46)

Justification

Find y 1 , . . . , y n ∈ R

min X

i,j

(y i − y j ) 2 W ij

Tries to preserve locality

(47)

A Fundamental Identity

But

1 2

X

i,j

(y i − y j ) 2 W ij = y T L y

X

i,j

(y

i

− y

j

)

2

W

ij

= X

i,j

(y

i2

+ y

j2

− 2y

i

y

j

)W

ij

= X

i

y

i2

D

ii

+ X

j

y

j2

D

jj

− 2 X

i,j

y

i

y

j

W

ij

= 2 y

T

L y

(48)

Embedding

λ = 0 → y = 1

y min

T

1 =0

y T L y

Let Y = [ y

1

y

2

. . . y

m

]

X

i,j

||Y

i

− Y

j

||

2

W

ij

= trace(Y

T

LY )

subject to Y

T

Y = I .

Use eigenvectors of L to embed.

(49)

PCA versus Laplacian Eigenmaps

0 20 40

0

10

20

30

40

nz = 75 −5 0 5

x 10−3

−8

−6

−4

−2 0 2 4 6 8x 10−3

−2 0 2

−4

−2 0 2 4

(50)

On the Manifold

smooth map f : M → R Z

M k∇ M f k 2 ≈ X

i∼j

W ij (f i − f j ) 2

Recall standard gradient in R k of f (z 1 , . . . , z k )

∇f =

 

 

 

 

∂f

∂z1

∂f

∂z2

·

·

∂f

∂zk

 

 

 

 

(51)

Curves on Manifolds

Consider a curve on M

c(t) ∈ M t ∈ (−1, 1) p = c(0); q = c(τ )

f (c(t)) : (−1, 1) → R

| f (0) − f (τ ) | > d G (p, q) k∇ M f (p) k

(52)

Stokes Theorem

A Basic Fact

Z

M k∇ M f k 2 = Z

f · ∆ M f

This is like

X

i,j

W ij (f i − f j ) 2 = f T Lf

where

M f is the manifold Laplacian

(53)

Manifold Laplacian

Recall ordinary Laplacian in R k This maps

f (x 1 , . . . , x k ) → −

X k

i =1

2 f

∂x 2 i

!

Manifold Laplacian is the same on the tangent space.

(54)

Properties of Laplacian

Eigensystem

M f = λ i φ i

λ i ≥ 0 and λ i → ∞

{ φ i } form an orthonormal basis for L 2 ( M )

Z

k∇ M φ i k 2 = λ i

(55)

The Circle: An Example

φ

− d 2 u

dt 2 = λu where u(0) = u(2π)

Eigenvalues are

λ n = n 2

Eigenfunctions are

sin(nt), cos(nt)

(56)

From graphs to manifolds

f : M → R x ∈ M x 1 , . . . , x n ∈ M

Graph Laplacian:

L t n (f )(x) = f (x) X

j

e

kx−xjk2

t

− X

j

f (x j )e

kx−xjk2 t

Theorem [pointwise convergence] t n = n

k+2+1 α

n→∞ lim

(4πt n )

k+22

n L t n

n

f (x) = ∆ M f (x)

(57)

From graphs to manifolds

Theorem [convergence of eigenfunctions]

t→ 0 lim ,n→∞ Eig [L t n

n

] → Eig [∆ M ]

Belkin Niyogi 06

(58)

Estimating Dimension from Laplacian

λ 1 ≤ λ 2 . . . ≤ λ j ≤ . . .

Then

A + 2

d log(j ) ≤ log(λ j ) ≤ B + 2

d log(j + 1)

Example: on S 1

λ j = j 2 = ⇒ log(λ j ) = 2

1 log(j )

(Li and Yau; Weyl’s asymptotics)

(59)

Visualization

Data representation, dimensionality reduction, visualization

Visualizing spaces of digits.

Partiview, Ndaona, Surendran 04

(60)

Motion estimation

Markerless motion estimation: inferring joint angles.

Corazza, Andriacchi, Stanford Biomotion Lab, 05, Partiview, Surendran

Isometrically invariant representation. [link]

Eigenfunctions of the Laplacian are invariant under

isometries.

(61)

Graphics, etc

Laplacian from meshes/non-probabilistic point clouds.

Belkin, Sun, Wang 08, 09

(62)

Recall

Heat equation in R n :

u(x, t) – heat distribution at time t .

u(x, 0) = f (x) – initial distribution. x ∈ R n , t ∈ R .

R

n

u(x, t) = du

dt (x, t)

Solution – convolution with the heat kernel:

u(x, t) = (4πt)

n2

Z

R

n

f (y)e

kx−yk

2 4t

dy

(63)

Proof idea (pointwise convergence)

Functional approximation:

Taking limit as t → 0 and writing the derivative:

R

n

f (x) = d dt

(4πt)

n2

Z

R

n

f (y)e

kx−yk

2 4t

dy

0

(64)

Proof idea (pointwise convergence)

Functional approximation:

Taking limit as t → 0 and writing the derivative:

R

n

f (x) = d dt

(4πt)

n2

Z

R

n

f (y)e

kx−yk

2 4t

dy

0

R

n

f (x) ≈ − 1

t (4πt)

n2

f (x) − Z

R

n

f (y)e

kx−yk

2 4t

dy

(65)

Proof idea (pointwise convergence)

Functional approximation:

Taking limit as t → 0 and writing the derivative:

R

n

f (x) = d dt

(4πt)

n2

Z

R

n

f (y)e

kx−yk

2 4t

dy

0

R

n

f (x) ≈ − 1

t (4πt)

n2

f (x) − Z

R

n

f (y)e

kx−yk

2 4t

dy

Empirical approximation:

Integral can be estimated from empirical data.

R

n

f (x) ≈ − 1

t (4πt)

n2

f (x) − X

f (x i )e

kx−xik

2 4t

!

(66)

Some difficulties

Some difficulties arise for manifolds:

Do not know distances.

Do not know the heat kernel.

||x−y||

x

y

M

dist (x,y)

M

(67)

Some difficulties

Some difficulties arise for manifolds:

Do not know distances.

Do not know the heat kernel.

||x−y||

x

y

M

dist (x,y)

M

Careful analysis needed.

(68)

The Heat Kernel

H t (x, y) = P

i e −λ

i

t φ i (x)φ i (y)

in R d , closed form expression

H t (x, y) = 1

(4πt) d/ 2 e

kx−yk

2 4t

Goodness of approximation depends on the gap

H t (x, y) − 1

(4πt) d/ 2 e

kx−yk

2 4t

H t is a Mercer kernel intrinsically defined on manifold.

(69)

Three Remarks on Noise

1. Arbitrary probability distribution on the manifold:

convergence to weighted Laplacian.

2. Noise off the manifold:

µ = µ M

d

+ µ R

N

Then

t→ lim 0 L t f (x) = ∆f (x)

3. Noise off the manifold:

z = x + η ( ∼ N (0, σ 2 I ))

We have

(70)

NLDR: some references

A global geometric framework for nonlinear dimensionality reduction.

J.B. Tenenbaum, V. de Silva and J. C. Langford, 00.

Nonlinear Dimensionality Reduction by Locally Linear Embedding.

L. K. Saul and S. T. Roweis. 00

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.

M.Belkin, P.Niyogi, 01.

Hessian Eigenmaps: new locally linear embedding techniques for high-dimensional data. D. L.

Donoho and C. Grimes, 02.

Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent Space Alignment.

Zhenyue Zhang and Hongyuan Zha. 02.

Charting a manifold. Matthew Brand, 03

Diffusion Maps. R. Coifman and S. Lafon. 04.

Many more: http://www.cse.msu.edu/∼lawhiu/manifold/

(71)

Unlabeled data

Reasons to use unlabeled data in inference:

Pragmatic:

Unlabeled data is everywhere. Need a way to use it.

Philosophical:

The brain uses unlabeled data.

(72)

Geometry of classification

How does shape of the data affect classification?

Manifold assumption.

Cluster assumption.

Reflect our understanding of structure of natural data.

(73)

Intuition

(74)

Intuition

(75)

Intuition

(76)

Intuition

Geometry of data changes our notion of similarity.

(77)

Manifold assumption

(78)

Manifold assumption

(79)

Manifold assumption

Geometry is important.

(80)

Geodesic Nearest Neighbors

20 50 100 500 1000 5000

8 15 30 50

Number of Labeled Points

Error rate, %

k−NN

Geodesic k−NN

(81)

Cluster assumption

(82)

Cluster assumption

00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000 00000000000000000000000000

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

11111111111111111111111111

(83)

Unlabeled data

Geometry is important.

(84)

Unlabeled data

Geometry is important.

Unlabeled data to estimate geometry.

(85)

Manifold assumption

Manifold/geometric assumption:

functions of interest are smooth with respect to the

underlying geometry.

(86)

Manifold assumption

Manifold/geometric assumption:

functions of interest are smooth with respect to the underlying geometry.

Probabilistic setting:

Map X → Y . Probability distribution P on X × Y .

Regression/(two class)classification: X → R .

(87)

Manifold assumption

Manifold/geometric assumption:

functions of interest are smooth with respect to the underlying geometry.

Probabilistic setting:

Map X → Y . Probability distribution P on X × Y . Regression/(two class)classification: X → R .

Probabilistic version:

conditional distributions P (y | x) are smooth with respect to

the marginal P (x) .

(88)

What is smooth?

Function f : X → R . Penalty at x ∈ X :

1 δ k +2

Z

small δ

(f (x) − f (x + δ)) 2 p(x)d δ ≈ k∇ f k 2 p(x)

Total penalty – Laplace operator:

Z

X k∇ f k 2 p(x) = h f, ∆ p f i X

(89)

What is smooth?

Function f : X → R . Penalty at x ∈ X :

1 δ k +2

Z

small δ

(f (x) − f (x + δ)) 2 p(x)d δ ≈ k∇ f k 2 p(x)

Total penalty – Laplace operator:

Z

X k∇ f k 2 p(x) = h f, ∆ p f i X

Two-class classification – conditional P (1 | x) .

(90)

Example

−1 0 1 2

−1 0 1 2

γ

A

= 0.03125 γ

I

= 0 SVM

−1 0 1 2

−1 0 1 2

Laplacian SVM

γ

A

= 0.03125 γ

I

= 0.01

−1 0 1 2

−1 0 1 2

Laplacian SVM

γ

A

= 0.03125 γ

I

= 1

(91)

Example

−1 0 1 2

−1 0 1 2

γ

A

= 0.03125 γ

I

= 0 SVM

−1 0 1 2

−1 0 1 2

Laplacian SVM

γ

A

= 0.03125 γ

I

= 0.01

−1 0 1 2

−1 0 1 2

Laplacian SVM

γ

A

= 0.03125 γ

I

= 1

(92)

Regularization

Estimate f : R N → R

Data: ( x 1 , y 1 ), . . . , ( x l , y l )

Regularized least squares (hinge loss for SVM):

f = argmin

f ∈H

1 l

X (f ( x i ) − y i ) 2 + λ k f k 2 K

fit to data + smoothness penalty

k f k K incorporates our smoothness assumptions.

Choice of k k K is important.

(93)

Algorithm: RLS/SVM

Solve : f = argmin

f ∈H

1 l

X (f ( x i ) − y i ) 2 + λ k f k 2 K

k f k K is a Reproducing Kernel Hilbert Space norm with kernel K ( x , y ) .

Can solve explicitly (via Representer theorem):

f ( · ) =

X l

i =1

α i K ( x i , · )

[α 1 , . . . , α l ] t = ( K + λI ) 1 [y 1 , . . . , y l ] t

( ) = ( )

(94)

Manifold regularization

Estimate f : R N → R

Labeled data: ( x 1 , y 1 ), . . . , ( x l , y l )

Unlabeled data: x l +1 , . . . , x l + u f = argmin

f ∈H

1 l

X (f ( x i ) − y i ) 2 + λ A k f k 2 K + λ I k f k 2 I

fit to data + extrinsic smoothness + intrinsic smoothness

Empirical estimate:

k f k 2 I = 1

(l + u) 2 [f ( x 1 ), . . . , f ( x l + u )] L [f ( x 1 ), . . . , f ( x l + u )] t

(95)

Laplacian RLS/SVM

Representer theorem (discrete case):

f ( · ) =

l + u

X

i =1

α i K ( x i , · )

Explicit solution for quadratic loss:

¯

α = (J K + λ A lI + λ I l

(u + l) 2 LK ) 1 [y 1 , . . . , y l , 0, . . . , 0] t

( K ) ij = K ( x i , x j ), J = diag (1, . . . , 1

| {z } , 0, . . . , 0

| {z } )

(96)

Experimental results: USPS

10 20 30 40

0 5 10 15 20

RLS vs LapRLS

45 Classification Problems

Error Rates

10 20 30 40

0 5 10 15 20

SVM vs LapSVM

45 Classification Problems

Error Rates

10 20 30 40

0 5 10 15 20

TSVM vs LapSVM

45 Classification Problems

Error Rates

0 5 10 15

0 5 10 15

Out−of−Sample Extension

LapRLS (Unlabeled)

LapRLS (Test)

0 5 10 15

0 5 10 15

Out−of−Sample Extension

LapSVM (Unlabeled)

LapSVM (Test)

0 2 4 6

0 5 10

15Std Deviation of Error Rates

SVM (o) , TSVM (x) Std Dev

LapSVM Std Dev TSVM

LapSVM SVM

LapSVM RLS

LapRLS

(97)

Experimental comparisons

Dataset → g50c Coil20 Uspst mac-win WebKB WebKB WebKB

Algorithm ↓ (link) (page) (page+link)

SVM (full labels) 3.82 0.0 3.35 2.32 6.3 6.5 1.0

SVM (l labels) 8.32 24.64 23.18 18.87 25.6 22.2 15.6

Graph-Reg 17.30 6.20 21.30 11.71 22.0 10.7 6.6

TSVM 6.87 26.26 26.46 7.44

14.5 8.6

7.8

Graph-density 8.32 6.43 16.92 10.48 - - -

∇ TSVM 5.80 17.56 17.61 5.71 - - -

LDS 5.62 4.86 15.79

5.13

- - -

LapSVM

5.44 3.66 12.67

10.41 18.1 10.5

6.4

(98)

Geometry of clustering

Probability distribution P .

What are clusters? Geometric question.

How does one estimate clusters given finite data?

(99)

Spectral graph clustering

−0.46 0.46

0.46 0.26

−0.46

−0.26

0.46

(100)

Spectral graph clustering

−0.46

−0.46

−0.26

0.46

0.46 0.26

0.46

(101)

Spectral graph clustering

−0.46

−0.46

−0.26

0.46

0.46 0.26

L =

 

 

 

 

 

2 −1 −1 0 0 0

−1 2 −1 0 0 0

−1 −1 3 −1 0 0

0 0 −1 3 −1 −1

0 0 0 −1 2 −1

0 0 0 −1 −1 2

 

 

 

 

 

Unnormalized clustering:

L e 1 = λ 1 e 1 e 1 = [ − 0.46, − 0.46, − 0.26, 0.26, 0.46, 0.46]

(102)

Spectral graph clustering

−0.46

−0.46

−0.26

0.46

0.46 0.26

L =

 

 

 

 

 

2 −1 −1 0 0 0

−1 2 −1 0 0 0

−1 −1 3 −1 0 0

0 0 −1 3 −1 −1

0 0 0 −1 2 −1

0 0 0 −1 −1 2

 

 

 

 

 

Unnormalized clustering:

L e 1 = λ 1 e 1 e 1 = [ − 0.46, − 0.46, − 0.26, 0.26, 0.46, 0.46]

Normalized clustering:

Le 1 = λ 1 De 1 e 1 = [ − 0.31, − 0.31, − 0.18, 0.18, 0.31, 0.31]

(103)

Graph Clustering: Mincut

Mincut: minimize the number (total weight) of edges cut).

argmin

S

X

i∈S, j ∈V −S

w ij

(104)

Graph Laplacian

f

f f

1

f

f f

2

3 4

5

6

L =

 

 

 

 

 

2 −1 −1 0 0 0

−1 2 −1 0 0 0

−1 −1 3 −1 0 0

0 0 −1 3 −1 −1

0 0 0 −1 2 −1

0 0 0 −1 −1 2

 

 

 

 

 

Basic fact:

X

i∼j

(f i − f j ) 2 w ij = 1

2 f t L f

(105)

Graph Laplacian

f

f f

1

f

f f

2

3 4

5

6

L =

 

 

 

 

 

2 −1 −1 0 0 0

−1 2 −1 0 0 0

−1 −1 3 −1 0 0

0 0 −1 3 −1 −1

0 0 0 −1 2 −1

0 0 0 −1 −1 2

 

 

 

 

 

argmin

S

X

i∈S, j ∈V −S

w ij = argmin

f

i

∈{− 1 , 1 }

X

i∼j

(f i − f j ) 2 = 1

8 argmin

f

i

∈{− 1 , 1 }

f t L f

Relaxation gives eigenvectors.

(106)

Consistency of spectral clustering

Limit behavior of spectral clustering.

x 1 , . . . , x n n → ∞

Sampled from probability distribution P on X .

Theorem 1:

Normalized spectral clustering (bisectioning) is consistent.

Theorem 2:

Unnormalized spectral clustering may not converge depending on the spectrum of L and P .

von Luxburg Belkin Bousquet 04

(107)

Continuous Cheeger clustering

Isoperimetric problem. Cheeger constant.

M M

1 1

δ M 1

M

h = inf vol n 1 ( δ M 1 )

min (vol n ( M 1 ) , vol n ( M − M 1 ))

(108)

Continuous spectral clustering

Laplacian eigenfunction as a relaxation of the isoperimetric problem.

M

1 1

δ M 1

M M

cut to cluster

e

1

e

1

= λ

1

e

1

h = inf vol n 1 ( δ M 1 )

min (vol n ( M 1 ) , vol n ( M − M 1 ))

0 = λ 0 ≤ λ 1 ≤ λ 2 ≤ . . .

h ≤

√ λ 1

2 [Cheeger]

(109)

Estimating volumes of cuts

δ S

δ S

X

i∈ blue

X

j∈ red

w ij p d j d j

w ij = e

kxi−xjk2 4t

d i = X

j

w ij

Theorem:

vol(δS) ≈ 2 N

1 (4πt)

n/2

r π

t 1

tS

L 1

S

L is the normalized graph Laplacian and 1 S is the indicator vector

of points in S . (Narayanan Belkin Niyogi, 06)

(110)

Clustering

Clustering is all about geometry of unlabeled data (no labeled data!).

Need to combine probability density with the geometry of

the total space.

(111)

Future Directions

Machine Learning Scaling Up

Multi-scale

Geometry of Natural Data

Geometry of Structured Data Algorithmic Nash embedding

Graphics / Non-randomly sampled data Random Hodge Theory

Partial Differential Equations

Referencer

RELATEREDE DOKUMENTER