eNote 2
Principal Component Analysis, PCA, in R
Indhold
2 Principal Component Analysis, PCA, inR 1
2.1 Reading about PCA . . . 2
2.2 Example: Fisher’s Iris Data. . . 6
2.2.1 Data import . . . 7
2.2.2 Basic explorative analysis . . . 9
2.2.3 PCA of Iris data . . . 12
2.3 Spectral data example: yarn data . . . 24
2.4 Exercises . . . 27
2.1 Reading about PCA
You can use the Wehrens book, Chapter 4, pp 43-56:
http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/
page/1
and/or (probably better) the Varmuza-book, chapter 3, sections 3.1 - 3.7:
http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2
The two R-packageschemometricsandChemometricswithR, are companions to the two books.
Bro and Smilde (2014): Principal Component Analysis Analytical Methods TUTORIAL REVIEW, 6, 2812.
http://pubs.rsc.org/en/content/articlepdf/2014/ay/c3ay41907j
Below there will be a number of important plots examplified as part of the iris-example:
1. Variance-plots (”scree-type”plots)
2. Scores, loadings and biplots (main plots for interpretation of structure) 3. Explained variances for each variable
4. Validation/diagnostics plots:
(a) Leverage and residuals (also called ”score distances”and ”orthogonal distances”(cf.
the nice Figure 3.15, page 79 in the Varmuza-book) (b) The ”influence plot”: residuals versus leverage
5. Jacknifing/bootstrapping/Crossvalidating the PCA for various purposes:
(a) Deciding on number of components
(b) Sensitivity/uncertainty investigation of scores and loadings.
What is PCA: Developed by Karl Pearson in 1901:
Pearson, K. (1901) On lines and planes of closest fit to systems of points in space. Philosophical Magazine (6) 2: 559-572.
May also be called:
• Singular value decomposition
• Karhunen-Lo´eve expansion
• Eigenvector analysis
• Latent vector analysis
• Characteristic vector analysis PCA is used for many things:
• Projection method
• Exploratory data analysis
• Extract information and remove noise
• Reduce dimensionality / Compression
• (Clustering)
And can be described/expressed in many ways:
• Produces optimal low-dimensional plots of observations (scores)
• Provides an overview of the variable correlation structure (loadings)
• Finds linear combinations of maximal variance
• Orthogonal distance regression method
• A bilinear model for the data
And can be described/expressed in many ways:
X: The (centered and scaled)n×p−data matrix
X=Observation Scores×Variable Loadings+Error
X= TPT+E
Xij =
∑
A a=1tiapaj+eij
Computations/A bit of math:
• PCA findsX-components with maximalY-variance:
||maxα||=1Var(Xα)
• PCA is the least squares fit of the bilinear (non linear regression) model:
mint,p
∑
ij
(xij−
∑
A a=1tiapaj)2
• PCA is the eigen decomposition ofXtX
• PCA is the eigen decomposition ofXXt
• PCA is the outcome of (a version of) the NIPALS algorithm
2.2 Example: Fisher’s Iris Data
Below there will be an exercise based on these data with some questions that PCA can be helpful in answering. Here we examplify a number of visualizations that one could do for such data including PCA-based stuff.
The Fisher Iris data-set is classic, c.f.:
• Fisher, R.A. (1936). The use of multiple measurements in taxonomic problem. Annals of Eugenics 7: 179-188.
• Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Socie- ty 59: 2-5.
There are 150 objects, 50 Iris setosa, 50 Iris versicolor and 50 Iris virginica. The flowers of these 150 plants have been measures by a ruler. The variables are sepal length (SL), sepal width (SW), petal length (PL) and petal width PW), all in all only four variables.
The original hypothesis was that I. versicolor was a hybrid of the two other species i.e. I. setosa x virginica. I. setosa is diploid; I. virginica is a tetraploid; and I. versicolor is hexaploid.
2.2.1 Data import
The iris data can allready be found within R, so no import is needed:
# Loading package related to Varmuza-book
# (First time you need to install the package) library(ChemometricsWithRData)
library(ChemometricsWithR) data(iris)
Or read the IRIS csv-data which is a copy of the file uploaded on CampusNet. Note that the Iris data given in CampusNet is slightly different from the IRIS data available. First save the data set on your computer and set the relevant working direcctory inR, e.g. by clikcing ’Session’ and choosinf ’Set working directory’, or run the following command with the correct chosen folder path:
setwd("C:/myfolderpath")
And then import the data intoRas follows:
JCFiris=read.table("Fisher_JCF.csv",header=T,sep=";",dec=",")
Note that the Iris data given by JCF is slightly different from the IRIS data available inR:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 Median :5.80 Median :3.00 Median :4.35 Median :1.3 Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
Species setosa :50 versicolor:50 virginica :50
summary(JCFiris)
X PW PL SW
setosa :50 Min. : 1.0 Min. :10.0 Min. :20.0 versicolor:50 1st Qu.: 3.0 1st Qu.:16.0 1st Qu.:28.0 virginica :50 Median :13.0 Median :44.0 Median :30.0 Mean :11.9 Mean :37.8 Mean :30.6 3rd Qu.:18.0 3rd Qu.:51.0 3rd Qu.:33.0 Max. :25.0 Max. :69.0 Max. :44.0 SL
Min. : 43.0 1st Qu.: 51.0 Median : 58.0 Mean : 62.6 3rd Qu.: 64.0 Max. :699.0
Note the differences: The names, order and scales. AND: an outlier in the JCF-version has been changed in theR-version. Look at the first 6 observations:
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
head(JCFiris)
X PW PL SW SL 1 setosa 2 14 33 50 2 virginica 24 56 31 67 3 virginica 23 51 31 69 4 setosa 2 10 36 46 5 virginica 20 52 30 65 6 virginica 19 51 27 58
The dimensions are the same:
dim(iris)
[1] 150 5
dim(JCFiris)
[1] 150 5
2.2.2 Basic explorative analysis
First we do some classic (univariate) explorative analysis:
# 4 boxplots with color:
par(mar=c(4,2,3,2),mfrow=c(2,2))
for (i in 1:4) boxplot(iris[,i] ~ iris[,5],
col = 1:3, main = names(iris)[i])
●
setosa versicolor virginica
4.55.56.57.5
Sepal.Length
●
setosa versicolor virginica
2.02.53.03.54.0
Sepal.Width
●
●
setosa versicolor virginica
1234567
Petal.Length
●
●
setosa versicolor virginica
0.51.01.52.02.5
Petal.Width
The par(mar=c(4,2,3,2)) command controls the four margins of each individual plot in the order: bottom, left, top, right. This is helpful to make nice multi-plot pages.
# Pairwise scatters:
pairs(iris,col = iris$Species)