• Ingen resultater fundet

Cluster analysis

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Cluster analysis"

Copied!
41
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Cluster analysis

Jens C. Frisvad BioCentrum-DTU

Biological data analysis and chemometrics

Based on H.C. Romesburg: Cluster analysis for researchers, Lifetime Learning Publications, Belmont, CA, 1984

P.H.A. Sneath and R.R. Sokal: Numericxal Taxonomy, Freeman, San Fransisco, CA, 1973

(2)

Two primary methods

• Cluster analysis (no projection)

– Hierarchical clustering – Divisive clustering

– Fuzzy clustering

• Ordination (projection)

– Principal component analysis – Correspondence analysis

– Multidimensional scaling

(3)

Advantages of cluster analysis

• Good for a quick overview of data

• Good if there are many groups in data

• Good if unusual similarity measures are needed

• Can be added on ordination plots (often as a minimum spanning tree, however)

• Good for the nearest neighbours, ordination better for the deeper relationships

(4)

Different clustering methods

• NCLAS: Agglomerative clustering by distance optimization

• HMCL: Agglomerative clustering by homogeneity optimization

• INFCL: Agglomerative clustering by information theory criteria

• MINGFC: Agglomerative clustering by global optimization

• ASSIN: Divisive monothetic clustering

• PARREL: Partitioning by global optimization

• FCM: Fuzzy c-means clustering

• MINSPAN: Minimum spanning tree

• REBLOCK: Block clustering (k-means clustering)

(5)

SAHN clustering

• Sequential agglomerative hierarchic nonoverlapping clustering

(6)

Single linkage

• Nearest neighbor, minimum method

• Close to minimum spanning tree

• Contracting space

• Chaining possible

αJ = 0.5, αK = 0.5, β = 0, γ = -0.5

• UJ,K = min Ujk

L K L

J K

J L

K K

L J J

L K

J

U U U U U

U

( , )

= α

,

+ α

,

+ β

,

+ γ

,

,

(7)
(8)
(9)

Complete linkage

• Furthest neighbor, maximum method

• Dilating space

• αJ = 0.5, αK = 0.5, β = 0, γ = 0.5

• UJ,K = max Ujk

(10)
(11)
(12)

Average linkage

• Aritmetic average

– Unweighted: UPGMA (group average) – Weighted: WPGMA

• Centroid

– Unweighted centroid (Centroid) – Weighted centroid (Median)

(13)

From Sneath and Sokal, 1973, Numerical taxonomy

(14)

Ordinary clustering

• Obtain the data matrix

• Transform or standardize the data matrix

• Select the best resemblance or distance measure

• Compute the resemblance matrix

• Execute the clustering method (often UPGMA = average linkage)

• Rearrange the data and resemblance matrices

• Compute the cophenetic correlation coefficient

(15)

Binary similarity coefficients

(between two objects i and j)

j i

1 0

1 a b

0 c d

(16)

Matches and mismatches

• m = a + b (number of matches)

• u = c + d (number of mismatches)

• n = m + u = a + b + c + d (total sample size)

• Similarity (often 0 to 1)

• Dissimilarity (distance) (often 0 to 1)

• Correlation (-1 to 1)

(17)

Simple matching coefficient

• SM = (a + d) / (a + b + c + d) = m / n

• Euclidean distance for binary data:

• D = 1-SM = (b +c) / (a + b + c + d) = u / n

(18)

Avoiding zero zero comparisons

• Jaccard = J = a / (a +b +c)

• Sørensen or Dice: DICE = 2a / (2a + b + c)

(19)

Correlation coefficients

Yule: (ad – bc) / (ad + bc)

) )(

)(

)(

( /

)

(ad bc a b c d a c b d

PHI = − + + + +

(20)

Other binary coefficients

• Hamann = H = (a + d – b –c) / (a + b + c + d)

• Rogers and Tanimoto = RT = (a + d) / (a + 2b + 2c + d)

• Russel and Rao = RR = a / (a + b + c + d)

• Kulzynski 1 = K1 = a / (b + c)

• UN1 = (2a + 2d) / (2a + b + c + 2d)

• UN2 = a / (a + 2b + 2c)

• UN3 = (a + d) / (b + c)

(21)

Distances for quantitative (interval) data Euclidean and taxonomic distance

+

=

= E

ij k

x

ki

x

kj

EUCLID ( )

2

+

=

= ij k xki xkj

d n

DIST 1 ( )2

(22)

Bray-Curtis and Canberra distance

) (

/ kj

k ki

k ki kj

ij x x x x

d

BRAYCURT = =

+

) (

1 /

+

= k xki xkj k xki xkj

CANBERRA n

(23)

Average Manhattan distance (city block)

=

=

ij k

x

ki

x

kj

M n

MANHAT 1

(24)

Chi-squared distance

=

= k

k j kj i

ki

ij x

x x x

x d

CHISQ

2

. .

(25)

Cosine coefficient

∑ ∑

=

= cij k xkixkj k xki k xkj

COSINE / 2 2

(26)

Step 1. Obtain the data matrix

10 20 30 30 5

5 20 10 15 10

Object

Feature

1

2

1 2 3 4 5

(27)

Objects and features

• The five objects are plots of farm land

• The features are

– 1. Water-holding capacity (%) – 2. Weight % soil organic matter

• Objective: find the two most similar plots

(28)
(29)

Resemblance matrix

- - - - -

18.0 - - - -

20.6 14.1 - - -

22.4 11.2 5.00 - -

7.07 18.0 25.0 25.5 -

1 2 3 4 5

1 2 3 4 5

(30)
(31)

- - - -

18.0 - - -

7.07 18.0 - -

21.5 12.7 25.3 -

1

2

5

(34)

1 2 5 (34)

Revised resemblance matrix

(32)
(33)

Revised resemblance matrix

- - -

12.7 - -

18.0 23.4 -

2

(34)

(15)

2 (34) (15)

(34)
(35)

Rvised resemblance matrix

- -

21.6 -

(15)

(234)

(15) (234)

(36)
(37)
(38)
(39)

Cophenetic correlation coefficient

(Pearson product-moment correlation coefficient)

• A comparison of the similarities according to the similarity matrix and the similarities according to the dendrogram

∑ ∑ ∑ ∑ ∑ ∑ ∑

= −

) ) )(

/ 1 ( )(

) )(

/ 1 ( (

) )(

)(

/ 1 (

2 2

2 , 2

y n

y x

n x

y x

n

r

X Y

xy

(40)

NTSYS

• Import matrix

• Transpose matrix if objects are rows (they are supposed to be columns in NTSYS) (transp in transformation / general)

• Consider log1 or autoscaling (standardization)

• Select similarity or distance measure (similarity)

• Produce similarity matrix

(41)

NTSYS (continued)

• Select clustering procedure (often UPGMA) (clustering)

• Calculate cophenetic matrix (clustering)

• Compare similarity matrix with cophenetic matix (made from the dendrogram) and

write down the cophenetic correlation (graphics, matrix comparison)

• Write dendrogram (graphics, treeplot)

Referencer

RELATEREDE DOKUMENTER

I There is no println in (real) hardware I We need to write tests for the development I Debugging versus regression tests. 34

Make the body of every method as long as possible - hopefully you never write any methods or functions with fewer than a thousand lines of code, deeply nested, of

•  Object and background seed pixels (Boykov and Jolly, ICCV 01). •  Bounding Box of object (Rother et al.

The analysis of the fix-points of equation (18) has shown two fundamental properties of a non-orthogonal CDMA setup: Firstly, only when the code correlation matrix r is singular

Overall, while the general findings from Part 1 indicate that inline graphics are by far the most frequently used type of emoticons, and that younger women use emoticons more

This paper explores the selection and posting of GPOY (gratuitous picture of yourself), gif (graphics interchange format) and ‘current status’ images [from now on referred to as

Finally the 3D-Med medical visualization workstation is an example of a com- plete application which uses the Hybris 3D computer graphics architecture for vi- sualization of e.g..

The state declaration defines the state variables to be used in the current rule of ConSpec specification. The variables can be constant