• Ingen resultater fundet

Principal Component Analysis, PCA, in R

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Principal Component Analysis, PCA, in R"

Copied!
33
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

eNote 2

Principal Component Analysis, PCA, in R

(2)

Indhold

2 Principal Component Analysis, PCA, inR 1

2.1 Reading about PCA . . . 2

2.2 Example: Fisher’s Iris Data. . . 6

2.2.1 Data import . . . 7

2.2.2 Basic explorative analysis . . . 9

2.2.3 PCA of Iris data . . . 12

2.3 Spectral data example: yarn data . . . 24

2.4 Exercises . . . 27

2.1 Reading about PCA

You can use the Wehrens book, Chapter 4, pp 43-56:

http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/

page/1

and/or (probably better) the Varmuza-book, chapter 3, sections 3.1 - 3.7:

http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2

The two R-packageschemometricsandChemometricswithR, are companions to the two books.

Bro and Smilde (2014): Principal Component Analysis Analytical Methods TUTORIAL REVIEW, 6, 2812.

http://pubs.rsc.org/en/content/articlepdf/2014/ay/c3ay41907j

(3)

Below there will be a number of important plots examplified as part of the iris-example:

1. Variance-plots (”scree-type”plots)

2. Scores, loadings and biplots (main plots for interpretation of structure) 3. Explained variances for each variable

4. Validation/diagnostics plots:

(a) Leverage and residuals (also called ”score distances”and ”orthogonal distances”(cf.

the nice Figure 3.15, page 79 in the Varmuza-book) (b) The ”influence plot”: residuals versus leverage

5. Jacknifing/bootstrapping/Crossvalidating the PCA for various purposes:

(a) Deciding on number of components

(b) Sensitivity/uncertainty investigation of scores and loadings.

What is PCA: Developed by Karl Pearson in 1901:

Pearson, K. (1901) On lines and planes of closest fit to systems of points in space. Philosophical Magazine (6) 2: 559-572.

(4)

May also be called:

• Singular value decomposition

• Karhunen-Lo´eve expansion

• Eigenvector analysis

• Latent vector analysis

• Characteristic vector analysis PCA is used for many things:

• Projection method

• Exploratory data analysis

• Extract information and remove noise

• Reduce dimensionality / Compression

• (Clustering)

And can be described/expressed in many ways:

(5)

• Produces optimal low-dimensional plots of observations (scores)

• Provides an overview of the variable correlation structure (loadings)

• Finds linear combinations of maximal variance

• Orthogonal distance regression method

• A bilinear model for the data

And can be described/expressed in many ways:

X: The (centered and scaled)n×p−data matrix

X=Observation Scores×Variable Loadings+Error

X= TPT+E

(6)

Xij =

A a=1

tiapaj+eij

Computations/A bit of math:

• PCA findsX-components with maximalY-variance:

||maxα||=1Var(Xα)

• PCA is the least squares fit of the bilinear (non linear regression) model:

mint,p

ij

(xij

A a=1

tiapaj)2

• PCA is the eigen decomposition ofXtX

• PCA is the eigen decomposition ofXXt

• PCA is the outcome of (a version of) the NIPALS algorithm

2.2 Example: Fisher’s Iris Data

Below there will be an exercise based on these data with some questions that PCA can be helpful in answering. Here we examplify a number of visualizations that one could do for such data including PCA-based stuff.

The Fisher Iris data-set is classic, c.f.:

(7)

• Fisher, R.A. (1936). The use of multiple measurements in taxonomic problem. Annals of Eugenics 7: 179-188.

• Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Socie- ty 59: 2-5.

There are 150 objects, 50 Iris setosa, 50 Iris versicolor and 50 Iris virginica. The flowers of these 150 plants have been measures by a ruler. The variables are sepal length (SL), sepal width (SW), petal length (PL) and petal width PW), all in all only four variables.

The original hypothesis was that I. versicolor was a hybrid of the two other species i.e. I. setosa x virginica. I. setosa is diploid; I. virginica is a tetraploid; and I. versicolor is hexaploid.

2.2.1 Data import

The iris data can allready be found within R, so no import is needed:

# Loading package related to Varmuza-book

# (First time you need to install the package) library(ChemometricsWithRData)

library(ChemometricsWithR) data(iris)

Or read the IRIS csv-data which is a copy of the file uploaded on CampusNet. Note that the Iris data given in CampusNet is slightly different from the IRIS data available. First save the data set on your computer and set the relevant working direcctory inR, e.g. by clikcing ’Session’ and choosinf ’Set working directory’, or run the following command with the correct chosen folder path:

setwd("C:/myfolderpath")

And then import the data intoRas follows:

JCFiris=read.table("Fisher_JCF.csv",header=T,sep=";",dec=",")

Note that the Iris data given by JCF is slightly different from the IRIS data available inR:

(8)

summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 Median :5.80 Median :3.00 Median :4.35 Median :1.3 Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5

Species setosa :50 versicolor:50 virginica :50

summary(JCFiris)

X PW PL SW

setosa :50 Min. : 1.0 Min. :10.0 Min. :20.0 versicolor:50 1st Qu.: 3.0 1st Qu.:16.0 1st Qu.:28.0 virginica :50 Median :13.0 Median :44.0 Median :30.0 Mean :11.9 Mean :37.8 Mean :30.6 3rd Qu.:18.0 3rd Qu.:51.0 3rd Qu.:33.0 Max. :25.0 Max. :69.0 Max. :44.0 SL

Min. : 43.0 1st Qu.: 51.0 Median : 58.0 Mean : 62.6 3rd Qu.: 64.0 Max. :699.0

Note the differences: The names, order and scales. AND: an outlier in the JCF-version has been changed in theR-version. Look at the first 6 observations:

head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

(9)

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

head(JCFiris)

X PW PL SW SL 1 setosa 2 14 33 50 2 virginica 24 56 31 67 3 virginica 23 51 31 69 4 setosa 2 10 36 46 5 virginica 20 52 30 65 6 virginica 19 51 27 58

The dimensions are the same:

dim(iris)

[1] 150 5

dim(JCFiris)

[1] 150 5

2.2.2 Basic explorative analysis

First we do some classic (univariate) explorative analysis:

# 4 boxplots with color:

par(mar=c(4,2,3,2),mfrow=c(2,2))

for (i in 1:4) boxplot(iris[,i] ~ iris[,5],

col = 1:3, main = names(iris)[i])

(10)

setosa versicolor virginica

4.55.56.57.5

Sepal.Length

setosa versicolor virginica

2.02.53.03.54.0

Sepal.Width

setosa versicolor virginica

1234567

Petal.Length

setosa versicolor virginica

0.51.01.52.02.5

Petal.Width

The par(mar=c(4,2,3,2)) command controls the four margins of each individual plot in the order: bottom, left, top, right. This is helpful to make nice multi-plot pages.

# Pairwise scatters:

pairs(iris,col = iris$Species)

Referencer

RELATEREDE DOKUMENTER

21 7.2 Exercise: Prostate Cancer data and maybe pectin

Merc 280C Lincoln Continental Chrysler ImperialFiat 128 Cadillac Fleetwood Toyota Corolla Merc 450SLC Merc 450SE Merc 450SL Toyota Corona. Dodge Challenger AMC Javelin

Spectral measurements on sugar samples from three different factories (B, E, F): 34 samples and 1023 variables (Lars Nørgaard example)..

The general idea of the developed prototype is: (1) the use of standard and open file-based exchange with flexibility in data input to support use across different design stages;

Hvis det ikke er muligt at administrere intrathekal eller epidural opioid eller i tilfælde af generel anæstesi, kan opioid PCA anvendes til postoperativ smertebehandling.. Forslag

During the 1970s, Danish mass media recurrently portrayed mass housing estates as signifiers of social problems in the otherwise increasingl affluent anish

Data from the Comprehensive Assessment of Leadership for Learning (CALL) survey indicates that the ways in which leaders and staff look at school development vary

In the following two Sections we will describe how to use two methods, maximum autocorrelation factors [2] and minimum noise fractions [4], for decomposing the tangent space