• Ingen resultater fundet

R in 27411: Biological Data Analysis and Chemometrics

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "R in 27411: Biological Data Analysis and Chemometrics"

Copied!
27
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

R in 27411: Biological Data Analysis and Chemometrics

Per Bruun Brockhoff

DTU Compute, DK-2800 Lyngby 22. marts 2014

Indhold

1 Introduction to R 2

1.1 R in 27411 . . . . 2

1.2 Rstudio . . . . 3

1.3 Getting started . . . . 3

1.3.1 Console and scripts . . . . 3

1.3.2 Assignments and vectors . . . . 3

1.4 Descriptive statistics . . . . 4

1.5 Import af data . . . . 4

1.6 Storage of Text and Graphics . . . . 4

1.7 Scatterplots . . . . 5

2 Introduction day 5 2.1 Height-weight example from Latting Chapter 2 . . . . 5

2.2 Exercise: EDUC scores data from the Lattin book . . . . 6

3 PCA, Principal Component Analysis 8 3.1 Example/Exercise: Fisher’s Iris Data . . . . 9

3.2 Exercise: Wine Data (To be presented by Team 1 next time) . . . . 11

4 MLR, Multiple Linear Regression (OLS) 13 4.1 Example: Leslie Salt Data . . . . 13

4.2 Exercise: Prostate Cancer data . . . . 14

5 PCR, Principal Component Regression 15 5.1 Example: Leslie Salt Data (again) . . . . 16

5.2 Exercise: Prostate Cancer data . . . . 17

(2)

6 PLSR, Partial Least Squares Regression 18 6.1 Motivating example - Leslie Salt again . . . . 18 6.2 NIR Example: yarn data . . . . 19 6.3 Exercise: NIR data on Degree of Esterification . . . . 21

7 RR, Ridge and Lasso Regression 21

7.1 Example: PAC data from package chemometrics . . . . 21 7.2 Exercise: Prostate Cancer data and maybe pectin data . . . . 23 8 Discriminant Analysis, LDA, QDA, k-NN, Bayes, PLSDA 23 8.1 Example: Iris data . . . . 24 8.2 Exercise: Wine data . . . . 27

1 Introduction to R

The program R is an open source statistic program that you can download to your own laptop for free. Go to http://mirrors.dotsrc.org/cran/ and select your platform (Windows, Mac, or Linux).

In the course 02441 Applied Statistics and Statistical Software, which is run as a 3-weeks course in January every year, you will get the chance to work more with R for practical statistical pro- blems - a direct project oriented way to get a more advanced level of working with the program, see http://www.imm.dtu.dk/courses/02441), where you also can find references to good (online) textbook material. Also in the new version of the course 02418 (substitute for 02416) you will get your competences for using R to do more advanced statistical analysis further developed.

1.1 R in 27411

In Spring 2014, support for R is offered in course 27411 for the first time, and the material will be updated throughout the course. It should be seen in the course as an alternative or supplement to the software Unscrambler. Parts of the course will also use the software NTSYS. Some of the methods in this course can be covered with R-tools that can be found in the base R installation and some are available in various add-on R-packages. And not uncommonly, the methods will be available in more than a single version as there by now are thousands of R-packages available.

This R-note should be seen more as a help to find a good way through (or into) the tools, help and material already available than providing lots of specific new tutorial help in itself. So mostly ”copy-and-paste”from and links to other sources! The two main sources of inspiration for us are the two R-packages chemometrics and ChemometricswithR which again are companions to the two books:

• K. Varmuza and P. Filzmoser (2009). Introduction to Multivariate Statistical Analysis in Chemometrics, CRC Press.

• Ron Wehrens (2012). Chemometrics With R: Multivariate Data Analysis in the Natural

Sciences and Life Sciences. Springer, Heidelberg.

(3)

Both of these books are available as ebooks: (that you can access using you DTU ID):

• http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2

• http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/page/1

1.2 Rstudio

RStudio is a free and open source integrated development environment (IDE) for R. You can run it on your desktop (Windows, Mac, or Linux) or even over the web using RStudio Server. It works as (an extended) alternative to running R in the basic way. This will be used in the course.

Download it from http://www.rstudio.com/ and follow installation instructions. To use the software, you only need to open Rstudio (not R itself).

1.3 Getting started

1.3.1 Console and scripts

Once you have opened Rstudio, you will see the a number of different windows. One of them is the console. Here you can write commands and execute them by hitting Enter. For instance:

> 2+3 [1] 5

In the console you cannot go back and change previous commands and neither can you save your work for later. To do this you need to write a script. Go to File → New → R Script. In the script you can write a line and execute it in the console by hitting Ctrl+R (Windows) or Cmd+Enter (Mac). You can also mark several lines and execute them all at the same time.

1.3.2 Assignments and vectors

If you want to assign a value to a variable, you either use = or <- . For instance:

> x = 2

> y <- 3

> x+y [1] 5

It is often useful to assign a set of values to a variable like a vector. This is done with the function c (short for concatenate).

> x = c(1, 4, 6, 2)

> x

[1] 1 4 6 2

Use the colon : , if you need a sequence, e.g. 1 to 10:

> x =1:10

> x

[1] 1 2 3 4 5 6 7 8 9 10

(4)

You can also make a sequence with a specific stepsize different from 1 with seq( from, to, stepsize):

> x = seq( 0, 1, 0.1)

> x

[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

If you are in doubt of how to use a certain function, the help page can be opened by typing ? followed by the function, e.g. ?seq.

1.4 Descriptive statistics

Some useful commands are:

• mean(x) - mean value of the vector x

• var(x) - variance

• sd(x) - standard deviation

• median(x) - median

• quantile(x,p) - finds the pth percentile. p can consist of several different values, e.g.

quantile(x,c(0.25,0.75))

• boxplot(x) - draws a boxplot

Remember to use the help function, if you are in doubt about anything.

1.5 Import af data

Data is imported into R by using the ”read.table” function. When importing data from a spre- adsheet, it can sometimes be better to export the data from the spreadsheet to comma delimited (CSV) format before importing it into R.(The package RODBC provides a more advanced way of directly importing e.g. excell files but we will not go into this here). Examples will be given below.

1.6 Storage of Text and Graphics

Text from windows in R can be copied into other programs in the usual manner:

• mark the text by holding the left mouse button down and drag it over the desired text.

• open another program (e.g. StarOffice), place the pointer at the desired location and press the middle mouse button.

All text in the ’Commands Window’ or the ’Report Window’ can be stored in a text-file by activating the window and choose ’File’ → ’Save As . . . ’.

Graphics can be stored in a graphics-file by activating the graphic window and choose ’File’ →

’Save As . . . ’. It is possible to choose from a range of graphics formats (JPEG is the default).

One may also define explicitly graphics devices, e.g. ”pdf”to directly write the graphics to a

pdf-file.

(5)

1.7 Scatterplots

Have a look at: http://www.statmethods.net/graphs/scatterplot.html

2 Introduction day

2.1 Height-weight example from Latting Chapter 2

# Data from chapter 2:

X1=c(57,58,60,59,61,60,62,61,62,63,62,64,63,65,64,66,67,66,68,69)

X2=c(93,110,99,111,115,122,110,116,122,128,134,117,123,129,135,128,135,148,142,155) mean(X1)

mean(X2) sd(X1) sd(X2)

X_d1=X1-mean(X1) X_d2=X2-mean(X2) X_s1=X_d1/sd(X1) X_s2=X_d2/sd(X2)

# Table 2.1:

Tab21=cbind(X1,X2,X_d1,X_d2,X_s1,X_s2) Tab21

# Raw data in matrix X and using scale function X=cbind(X1,X2)

scale(X,scale=F) scale(X)

X_s=scale(X)

# Means and standard deviations in columns:

apply(Tab21,2,mean) apply(Tab21,2,sd) plot(X1,X2) plot(X_s1,X_s2) abline(h=0,v=0) arrows(0,0,1,1)

# How to find this direction from data:

cor(X)

t(X_s)%*%X_s/19 eigen(cor(X))

# Using notation from Chapter 2:

W=eigen(cor(X))$vectors W # In PCA: loadings

# The projected values: (Fig. 2.8, Fig 2.10) z1=X_s%*%W[,1]

z2=X_s%*%W[,2]

plot(z1,z2,ylim=c(-3,3),xlim=c(-3,3)) hist(z1,xlim=c(-3,3))

hist(z2,xlim=c(-3,3)) var(z1)

var(z2)

(6)

# These variances are also found as the socalled

# singular values or eigen values of the correlation matrix:

eigen(cor(X))$values

# And the D-matrix is the diagonal of the square-roots of these:

D=diag(sqrt(eigen(cor(X))$values)) D # In Pca: The explained variances

# And the z1 and z2 can be standardized by these sds:

z_s1=z1/sd(z1) z_s2=z2/sd(z2) z_s2

# Or the same could be extracted from the matrices:

Z_s=X_s%*%W%*%solve(D)

Z_s # In PCA: The standardized scores var(Z_s)

# So we have done the SVD, check:

Z_s%*%D%*%t(W) X_s

2.2 Exercise: EDUC scores data from the Lattin book

1. Have a look at the R introduction in Appendix 3 of the Varmuza-book:

http://www.crcnetbase.com.globalproxy.cvt.dk/doi/pdfplus/10.

1201/9781420059496.ax3 (They do not mention Rstudio, but this is still recom- mended)

2. Complete Lattin exercise 2.2 by:

(a) If needed: Install R and Rstudio (b) Start Rstudio

(c) Import the EDUC SCORES data from chapter 2 in Lattin et. al (2003) (Hint: Use the file EDUC SCORES.txt available under File Sharing in Campusnet)

(d) Use the following code to solve the different steps:

#import:

educ_scores=read.table("EDUC_SCORES.txt",header=TRUE,sep=",",row.names=1) educ_scores

#Select the relevant columns:

X=edu_scores[,1:3]

X

# a)

?apply()

apply(X,2,mean)

# b)

?scale

Xd=scale(X,scale=F) Xd

# c)

apply(X,2,sd) Xs=scale(X) Xs

(7)

# d)

?cov t(Xd)%*%Xd 7*cov(X)

# e)

(t(Xd)%*%Xd)/7 cov(X)

# f)

(t(Xs)%*%Xs)/7 cor(X)

(e) Some additional matrix scatterplotting:

# Scatterplot Matrices using pairs:

pairs(X)

Gender=factor(educ_scores[,4])

pairs(X,main = "Scatterplots - Gender grouping", col = c("red", "blue")[Gender])

# Scatterplot Matrices from the lattice Package library(lattice)

splom(X, groups=Gender)

# 3D Scatterplot library(scatterplot3d)

scatterplot3d(X, main="3D Scatterplot")

# 3D Scatterplot with Coloring and Vertical Drop Lines scatterplot3d(X, pch=16, highlight.3d=TRUE,

type="h", main="3D Scatterplot")

# 3D Scatterplot with Coloring and Vertical Lines

# and Regression Plane

s3d <-scatterplot3d(X, pch=16, highlight.3d=TRUE, type="h", main="3D Scatterplot") fit <- lm(X[,3] ˜ X[,1]+X[,2])

s3d$plane3d(fit)

# Spinning 3d Scatterplot library(rgl)

plot3d(X, col="red", size=3)

3. Look at the Leslie Salt data (To be used for the MLR day). Import the Leslie Salt data from chapter 3 in Lattin et. al (2003) and do some initial explorative analysis:

#import:

saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".") head(saltdata)

summary(saltdata) dim(saltdata) pairs(saltdata)

# Scatterplot Matrices from the glus Package library(gclus)

dta <- saltdata # get data - just put your own data here!

dta.r <- abs(cor(dta)) # get correlations dta.col <- dmat.color(dta.r) # get colors

# reorder variables so those with highest correlation

# are closest to the diagonal dta.o <- order.single(dta.r)

(8)

cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,

main="Variables Ordered and Colored by Correlation" )

# 8 Normal probility plots on the same plot page:

par(mfrow=c(2,4))

for (i in 1:8) qqnorm(saltdata[,i])

# Adding the log-price:

saltdata$logPrice=log(saltdata$Price)

# 9 histograms with color:

par(mfrow=c(3,3))

for (i in 1:9) hist(saltdata[,i],col=i) attach(saltdata)

# scatterplot for logprice: with Added fit lines par(mfrow=c(1,1))

plot(logPrice˜Size)

abline(lm(logPrice˜Size), col="red")

# scatterplot for logprice: identifying the observation numbers par(mfrow=c(1,1))

plot(logPrice˜Size,type="n")

text(Size,logPrice,labels=row.names(saltdata)) abline(lm(logPrice˜Size), col="red")

# For all of them:

par(mfrow=c(2,4)) for (i in 2:8) {

plot(logPrice˜saltdata[,i],type="n",xlab=names(saltdata)[i]) text(saltdata[,i],logPrice,labels=row.names(saltdata)) abline(lm(logPrice˜saltdata[,i]), col="red")

}

# Saving directly to a pgn-file:

png("logprice_relations.png",width=800,height=600) par(mfrow=c(2,4))

for (i in 2:8) {

plot(logPrice˜saltdata[,i],type="n",xlab=names(saltdata)[i]) text(saltdata[,i],logPrice,labels=row.names(saltdata)) abline(lm(logPrice˜saltdata[,i]), col="red")

}

dev.off()

# Correlation (with 2 decimals) round(cor(saltdata),2)

# Making a table for viewing in a browser - to be included in e.g. Word/Powerpoint

# Go find/use the cortable.html file afterwards library(xtable)

capture.output(print(xtable(cor(saltdata)),type="html"),file="cortable.html")

3 PCA, Principal Component Analysis

Wehrens book, Chapter 4, pp 43-56 and the Varmuza-book, chapter 3, sections 3.1 - 3.7 PLease note that there are a number of important plots examplified as part of the iris-example:

1. Variance-plots (”scree-type”plots)

2. Scores, loadings and biplots (main plots for interpretation of structure)

(9)

3. Explained variances for each variable 4. Validation/diagnostics plots:

(a) Leverage and residuals (also called ”score distances”and ”orthogonal distances”(cf.

the nice Figure 3.15, page 79 in the Varmuza-book) (b) The ”influence plot”: residuals versus leverage

5. Jacknifing/bootstrapping/Crossvalidating the PCA for various purposes:

(a) Deciding on number of components

(b) Sensitivity/uncertainty investigation of scores and loadings.

3.1 Example/Exercise: Fisher’s Iris Data

The exercise can be solved in either NTSYS (SVD), Uscrambler or in R. JCF can support in NTSYS/Unscrambler. Sophie can support in R. The Fisher Iris data-set is classic, c.f.:

• Fisher, R.A. (1936). The use of multiple measurements in taxonomic problem. Annals of Eugenics 7: 179-188.

• Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Society 59: 2-5.

There are 150 objects, 50 Iris setosa, 50 Iris versicolor and 50 Iris virginica. The flowers of these 150 plants have been measures by a ruler. The variables are sepal length (SL), sepal width (SW), petal length (PL) and petal width PW), all in all only four variables.

The original hypothesis was that I. versicolor was a hybrid of the two other species i.e. I. setosa x virginica. I. setosa is diploid; I. virginica is a tetraploid; and I. versicolor is hexaploid.

Note that we call the data-set Fisherout.xls (an Excell file in this case). When you look at the data in the plots later it is a good idea to only use the first two letters (se, ve, vi). Import the data set into your choice of software. In R

First examine the raw data and examine whether there are obvious mistakes. After that one could use other Unscrambler features to examine the statistical properties of the objects and variable, but it in this case we go directly to PCA, as this give a very fine overview of the data, and will often show outliers immediately. Perform the PCA with leverage correction and with centering.

Examine the four standard plots (score plot, loading plot, influence plot and explained variance plot).

1. How many principal components would you need and what does the first PC (PC1) descri- be?

2. How many percentage of the variation is described by the first two PCs?

3. Can you find an outlier? It so do you have an idea why thus outlier came about? (loadings

plot or scores plot)? In R: Do you see problem in the influence plot. If there is an outlier,

in which other plot can you see the problem? If you see severe outliers, remove them from

the data and run PCA again (and answer a, and b, again)

(10)

4. Does a standardization (autoscaling) give a better model? (answer a) and b) again) 5. How many PCs are needed to explain 70%, 75% and 90% of the variation in the data?

6. How many PCs can you maximally get in this dataset?

7. Compare the score and the loading plot, and make a biplot. Do any of the variables ”tell the same story”?

8. Are any variables more discriminative the others? Are any variables dispensable?

9. Can you see the presupposed classes? Any class overlap?

10. Does the original hypothesis seem to be OK?

The iris data can allready be found within R, so no import is needed:

# E.g. set the relevant working direcctory for you - here the one used by PBB locally

# setwd("C:/perbb/DTUkemometri/PCAnew")

# Reading the IRIS csv-data which is a copy of the file uploaded by JCF:

# Note that the Iris data given by JCF is slightly different from the IRIS data available in R:

JCFiris=read.table("Copy of Fisherout.csv",header=T,sep=";",dec=",") library(ChemometricsWithR)

data(iris)

#Note that the Iris data given by JCF is slightly different from the IRIS data available in R:

summary(iris) summary(JCFiris) head(iris) head(JCFiris)

# Basic PCA on covariances (WITHOUT Standardization) irisPC_without=PCA(scale(iris[,1:4],scale = FALSE)) par(mfrow=c(2,2))

scoreplot(irisPC_without, col = iris$Species) loadingplot(irisPC_without, show.names = TRUE) biplot(irisPC_without, score.col = iris$Species) screeplot(irisPC_without,type="percentage")

# Basic PCA on correlations (WITH Standardization) irisPC=PCA(scale(iris[,1:4]))

par(mfrow=c(2,2))

scoreplot(irisPC, col = iris$Species) loadingplot(irisPC, show.names = TRUE) biplot(irisPC, score.col = iris$Species) screeplot(irisPC,type="percentage")

# Plotting more components:

# Scores:

par(mfrow=c(4,4))

for (i in 1:4) for (j in 1:4) scoreplot(irisPC, col = iris$Species,pc=c(i,j))

# Loadings:

par(mfrow=c(4,4))

for (i in 1:4) for (j in 1:4) loadingplot(irisPC, show.names = TRUE,pc=c(i,j))

# Diagnostics using chemometrics package library(chemometrics)

(11)

irisPCA=princomp(iris[,1:4], cor = TRUE) summary(irisPCA)

loadings(irisPCA)

#scores(irisPCA) par(mfrow=c(2,2))

plot(irisPCA) # shows a screeplot.

biplot(irisPCA)

# How to plot residuals versus leverage??

# The score distances res$SDist express the leverage values

# The orthogonal distances express the residuals res=pcaDiagplot(iris[,1:4],irisPCA)

pcaVarexpl(iris[,1:4],a=2)

plot(res$SDist,res$ODist,type="n")

text(res$SDist,res$ODist,labels=row.names(iris))

# Influence plot: residuals versus leverage for different number of components:

par(mfrow=c(2,2)) for (i in 1:4) {

res=pcaDiagplot(iris[,1:4],a=i,irisPCA,plot=FALSE) plot(res$SDist,res$ODist,type="n")

text(res$SDist,res$ODist,labels=row.names(iris)) }

# Doing something similar to "jacknifing of the loadings":

# Leaving out a certain number of the observation and plotting the loadings for each subset data

# Random samples of a certain proportion of the original number of observations are left out par(mfrow=c(2,2))

n=length(iris[,1]) leave_out_size=0.50 for (k in 1:4){

irisPC=PCA(scale(iris[sample(1:n,round(n*(1-leave_out_size))),1:4])) loadingplot(irisPC, show.names = TRUE)

}

3.2 Exercise: Wine Data (To be presented by Team 1 next time)

The exercise can be solved in either NTSYS, Uscrambler or in R. JCF can support in NT- SYS/Unscrambler. Sophie can support in R. The second dataset is called VIN2:

• Forina, M., Armanino, C., Castino, M. and Ubigli, M. (1986). Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25: 189-201.

• Forina, M., Lanteri, S., Armanino, C., Casolino, C. and Casale, M. 2010. V-PARVUS.

An extendable package of programs for data exploration, classification, and correlation.

(www.parvus.unige.it)

The dataset VIN2.csv is an Excell CSV file. In this dataset there are 178 objects (Italian wines),

the first 59 are Barolo wines (B1-B59), the next 71 are Grignolino wines (G60-G130) and the

last 48 are Barbera wines (S131-S178). These wines have been characterized by 13 variables

(chemical and physical measurements):

(12)

1. Alcohol (in %) 2. Malic acid 3. Ash

4. Alkalinity of Ash 5. Magnesium 6. Total phenols 7. Flavanoids

8. Nonflavanoid phenols 9. Proanthocyanins 10. Colour intensity 11. Colour hue

12. OD280 / OD315 of diluted wines 13. Proline (amino acid)

Questions:

1. Examine the raw data. Are there any severe outliers you can detect? What do you think happened with the outlier, if any?

2. Correct wrong data, if any (in the excel file), and use PCA again. Does the score and loading plot look significantly different now?

3. Try PCA without standardization: Which variables are important here and why?

4. Try PCA with standardization. Which variables are important here, and would you recom- mend removing any of them from the data set? Which variables are especially important for the Barbera wines?

5. Suppose that alcohol % and proanthocyanins were especially healthy which wine would you recommend?

6. (IF using Unscrambler) Use the Unscrambler jack-knife method to test for significance of the variables (run the PCA again, with jack-knifing). Are all the variables stable? (In R: Think about how to do something similar - see e.g. the example for the iris data where the loadings are plottet for different subsets of data)

The wine data can allready be found within R, so no import is needed:

(13)

# Wines data:

# From the JCF uploaded file:

# Also slightly different from the version in the package JCFwines=read.table("VIN2.csv",header=T,sep=";",dec=",")

# The wines data from the package:

# The wine class information is here stored in the wine.classes object data(wines, package = "ChemometricsWithRData")

head(wines) head(JCFwines) summary(wines) summary(JCFwines)

# Basic PCA

wines.PC <- PCA(scale(wines)) par(mfrow=c(2,2))

scoreplot(wines.PC, col = wine.classes, pch = wine.classes) loadingplot(wines.PC, show.names = TRUE)

biplot(wines.PC, score.col = wine.classes) screeplot(wines.PC,type="percentage")

4 MLR, Multiple Linear Regression (OLS)

• Check Wehrens book, Chapter 8, pp 145-148

• and the Varmuza-book, chapter 4, sections 4.3.1 -4.3.2.

• And some basics on Regression in R: http://www.statmethods.net/stats/

regression.html

• And Regression diagnostics in R: http://www.statmethods.net/stats/regression.

html

4.1 Example: Leslie Salt Data

# Regression part

#import:

saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".")

# Simple regression

lm1=lm(logPrice˜Distance,data=saltdata) summary(lm1)

# Full MLR model:

lm2=lm(logPrice˜County+Size+Elevation+Sewer+Date+Flood+Distance,data=saltdata) summary(lm2)

step(lm2,direction="backward")

lm3=lm(formula = logPrice ˜ Elevation + Sewer + Date + Flood + Distance, data = saltdata) summary(lm3)

# plot fitted vs. observed - identifying observation nr:

par(mfrow=c(1,1))

plot(saltdata$logPrice,lm3$fitted,type="n")

text(saltdata$logPrice,lm3$fitted,labels=row.names(saltdata))

# Regression Diagnostics plots given automatically, either 4:

par(mfrow=c(2,2))

(14)

plot(lm3)

# Or 6:

par(mfrow=c(3,2)) plot(lm3,1:6)

# Plot residuals versus individual xs:

par(mfrow=c(2,3)) for (i in 4:8) {

plot(lm3$residuals˜saltdata[,i],type="n",xlab=names(saltdata)[i]) text(saltdata[,i],lm3$residuals,labels=row.names(saltdata)) lines(lowess(saltdata[,i],lm3$residuals),col="blue")

}

# Check for potential quadratic effects and/or interactions:

# Here just two examplesof such:

saltdata$Distance_square=saltdata$Distanceˆ2

saltdata$County_Elevation=saltdata$County*saltdata$Elevation

lm4=lm(formula = logPrice ˜ Elevation + Sewer + Date + Flood + Distance + Distance_square + County_Elevation, data = saltdata)

summary(lm4)

# How to remove a potential outlier and re-analyze:

# E.g. Let’s try without observation no 2 and 10:

# Remove and copy-paste:

saltdata_red=saltdata[-c(2,10),]

# Check:

dim(saltdata) dim(saltdata_red) row.names(saltdata_red)

lm3_red=lm(formula = logPrice ˜ Elevation + Sewer + Date + Flood + Distance, data = saltdata_red) summary(lm3_red)

par(mfrow=c(3,2)) plot(lm3_red,1:6)

4.2 Exercise: Prostate Cancer data

Analyze the Prostate data from the following book website: http://statweb.stanford.

edu/˜tibs/ElemStatLearn/ (check p. 47-48 of Hastie et. al for a little description of the data) (Use the uploaded version of the data file in Campusnet to avoid import problems)

prostate=read.table("prostatedata.txt",header=TRUE,sep=";",dec=",") head(prostate)

summary(prostate) dim(prostate) pairs(prostate)

Try to answer the following questions:

1. What are the pair-wise relations between lpsa and the other variables – any indications of important relations?

2. Are there any clearly non-normal (e.g. skew) distributions among the variables?

3. Run the 8-variable MLR analysis and try to reduce the model by removing the most non- significant variables one by one – what is the final model?

4. Interpret the parameters of the final model – compare with the investigation in 1.

(15)

5. What is the estimate (and interpretation) of the residual standard deviation?

6. Investigate the validity/assumptions of the final model:

(a) Residual checks

(b) Influential/outlying observations

(c) Any additional model structure? (non-linearities, iteractions?)(diagnostics plots and/or model extensions)

5 PCR, Principal Component Regression

PCR and the other biased regression methods presented in this course (PLS, Ridge and Lasso) are all together with even more methods (as e.g. MLR=OLS) introduced in each of the three books

• The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, February 2009, Trevor Hastie, Robert Tibshirani, Jerome Friedman

• Ron Wehrens (2012). Chemometrics With R: Multivariate Data Analysis in the Natural Sciences and Life Sciences. Springer, Heidelberg.(Chapter 8 and 9)

• K. Varmuza and P. Filzmoser (2009). Introduction to Multivariate Statistical Analysis in Chemometrics, CRC Press. (Chapter 4)

The latter two ones are directly linked with R-packages, and here we will most directly use the latter. These things are NOT covered by the Lattin book, so we give here a reading list for the most relevant parts of chapter 4, when it comes to syllabus content for course 27411:

• Section 4.1 (4p) (Concepts - ALL models)

• Section 4.2.1-4.2.3 (6p)(Errors in (ALL) models )

• Section 4.2.5-4.2.6 (3.5p)(CV+bootstrap - ALL models)

• [Section 4.3.1-4.3.2.1 (9.5p)(Simple regression and MLR (=OLS))]

• [Section 4.5.3 (1p) (Stepwise Variable Selection in MLR)]

• Section 4.6 (2p) (PCR)

• Section 4.7.1 (3.5p) (PLS)

• Section 4.8.2 (2.5p) (Ridge and Lasso)

• Section 4.9-4.9.1.2 (5p) (An example using PCR and PLS)

• Section 4.9.1.4-4.9.1.5 (5p) (An example using Ridge and Lasso)

• Section 4.10 (2p) (Summary)

(16)

5.1 Example: Leslie Salt Data (again)

Below I have included R-code primarily based on the pls package. It shows how one can work through all the steps of the PCR analysis of a data set.

# E.g.: Set your own working directoty - on perbb computer, it is:

setwd("C:/perbb/DTUkemometri/R")

# Example: using Salt data:

saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".") saltdata$logPrice=log(saltdata$Price)

# Define the X-matrix as a matrix in the data frame:

saltdata$X=as.matrix(saltdata[,2:8])

# First of all we consider a random selection of 4 properties as a TEST set saltdata$train=TRUE

saltdata$train[sample(1:length(saltdata$train),4)]=FALSE saltdata_TEST=saltdata[saltdata$train==FALSE,]

saltdata_TRAIN=saltdata[saltdata$train==TRUE,]

# Now all the work is performed on the TRAIN data set

# First: 1) Explore the data - we allready did this previously, so no more of that here

# Next: 2) Model the data

# Run the PCR with maximal/large number of components using pls package library(pls)

mod <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

plot(mod,labels=rownames(saltdata_TRAIN),which="validation")

plot(mod,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod,labels=rownames(saltdata_TRAIN))

# Choice of components:

# what would segmented CV give:

mod_segCV <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,scale=TRUE,validation="CV", segments = 5, segment.type = c("random"),jackknife = TRUE)

# Initial set of plots:

plot(mod_segCV,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_segCV,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Let us look at some more components:

# Scores:

scoreplot(mod, comps = 1:4,labels=rownames(saltdata_TRAIN))

#Loadings:

loadingplot(mod,comps = 1:4,scatter=TRUE,labels=names(saltdata_TRAIN))

# We choose 4 components

mod4 <- pcr(logPrice ˜X , ncomp = 4, data =saltdata_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Then 3) Validate:

# Let’s validate som more: using 4 components

(17)

# We take the predicted and hence the residuals from the predplot function

# Hence these are the (CV) VALIDATED versions!

par(mfrow=c(2,2)) k=4

obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation") Residuals=obsfit[,1]-obsfit[,2]

plot(obsfit[,2],Residuals,type="n",main=k,xlab="Fitted",ylab="Residuals") text(obsfit[,2],Residuals,labels=rownames(saltdata_TRAIN))

qqnorm(Residuals)

# To plot residuals against X-leverage, we need to find the X-leverage:

# AND then find the leverage-values as diagonals of the Hat-matrix:

# Based on fitted X-values:

Xf=fitted(mod)[,,k]

H=Xf%*%solve(t(Xf)%*%Xf)%*%t(Xf) leverage=diag(H)

plot(leverage,Residuals,type="n",main=k)

text(leverage,Residuals,labels=rownames(saltdata_TRAIN))

# This set of four plots could be reproduced for other values of k

# One could also redo some leaving out certain observations

# Let’s also plot the residuals versus each input X:

par(mfrow=c(3,3)) for ( i in 2:8){

plot(Residuals˜saltdata_TRAIN[,i],type="n",xlab=names(saltdata_TRAIN)[i]) text(saltdata_TRAIN[,i],Residuals,labels=row.names(saltdata_TRAIN)) lines(lowess(saltdata_TRAIN[,i],Residuals),col="blue")

}

# Now let’s look at the results - 4) "interpret/conclude"

par(mfrow=c(2,2))

# Plot coefficients with uncertainty from Jacknife:

obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation") abline(lm(obsfit[,2]˜obsfit[,1]))

plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") coefplot(mod4,se.whiskers=TRUE,labels=prednames(mod4),cex.axis=0.5)

biplot(mod4)

# And then finally some output numbers:

jack.test(mod4,ncomp=4)

# And now let’s try to 5) predict the 4 data points from the TEST set:

preds=predict(mod4,newdata=saltdata_TEST,comps=4) plot(saltdata_TEST$logPrice,preds)

rmsep=sqrt(mean((saltdata_TEST$logPrice-preds)ˆ2)) rmsep

5.2 Exercise: Prostate Cancer data

Use the Prostate data also used for the MLR-exercises. Make sure to support all the analysis by a lot of plotting - try to ”play around” with plotting results with/without extreme observations.

Also try to look at results for different choices of number of components (to get a feeling for

the consequence of that). The Rcode (including comments) for the Leslie Salt data above can

(18)

be used as a template for a good approach.

1. Define test sets and training sets according to the last column in the data set.

2. Do PCR on the training set – use cross-validation.

3. Go through ALL the relevant steps : (a) model selection

(b) validation (c) interpretation (d) etc.

4. Predict the test set lpsa values.

5. What is the average prediction error for the test set?

(a) Compare with the cross validation error in the training set - try both ”LOO”and a segmented version of CV

6. Compare with an MLR prediction using ALL predictors

6 PLSR, Partial Least Squares Regression

6.1 Motivating example - Leslie Salt again

# E.g.: Set your own working directoty - on perbb computer, it is:

setwd("C:/perbb/DTUkemometri/R")

# Example: using Salt data:

saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".") saltdata$logPrice=log(saltdata$Price)

# Define the X-matrix as a matrix in the data frame:

saltdata$X=as.matrix(saltdata[,2:8])

# First of all we consider a random selection of 4 properties as a TEST set saltdata$train=TRUE

saltdata$train[sample(1:length(saltdata$train),4)]=FALSE saltdata_TEST=saltdata[saltdata$train==FALSE,]

saltdata_TRAIN=saltdata[saltdata$train==TRUE,]

# Now all the work is performed on the TRAIN data set

# First: 1) Explore the data - we allready did this previously, so no more of that here

# Next: 2) Model the data

# Run the PCR with maximal/large number of components using pls package library(pls)

mod_pcr <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

(19)

par(mfrow=c(2,2))

plot(mod_pcr,labels=rownames(saltdata_TRAIN),which="validation")

plot(mod_pcr,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pcr,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod_pcr,labels=rownames(saltdata_TRAIN))

# Now with pls:

mod_pls <- plsr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

plot(mod_pls,labels=rownames(saltdata_TRAIN),which="validation")

plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod_pls,labels=rownames(saltdata_TRAIN))

# Let us look at the coefficients of the final models:

# Plot coefficients with uncertainty from Jacknife:

par(mfrow=c(2,2))

coefplot(mod_pcr,ncomp=1,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 1") coefplot(mod_pcr,ncomp=2,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 2") coefplot(mod_pcr,ncomp=4,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 4") coefplot(mod_pls,ncomp=2,se.whiskers=TRUE,labels=prednames(mod_pls4),cex.axis=0.5,main="logPrice, PLS with 2")

6.2 NIR Example: yarn data

library(chemometrics) data(yarn)

#summary(yarn)

?yarn str(yarn)

# Plotting of the 21 individual NIR spectra"

max_X=max(yarn$NIR) min_X=min(yarn$NIR) par(mfrow=c(1,1))

plot(yarn$NIR[1,],type="n",ylim=c(min_X,max_X)) for (i in 1:21) lines(yarn$NIR[i,],col=i)

# Plot of y:

plot(yarn$density,type="n") lines(yarn$density)

yarn_TEST=yarn[yarn$train==FALSE,]

yarn_TRAIN=yarn[yarn$train==TRUE,]

# Pls:

mod_pls <- plsr(density ˜NIR , ncomp = 10, data =yarn_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

plot(mod_pls,labels=rownames(yarn_TRAIN),which="validation")

plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod_pls,labels=rownames(yarn_TRAIN))

# Choice of components:

# what would segmented CV give:

mod_segCV <- plsr(density ˜NIR , ncomp = 10, data =yarn_TRAIN,scale=TRUE,validation="CV", segments = 5, segment.type = c("random"),jackknife = TRUE)

(20)

# Initial set of plots:

par(mfrow=c(2,1))

plot(mod_segCV,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_segCV,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Let us look at some more components:

# Scores:

scoreplot(mod_pls, comps = 1:4,labels=rownames(yarn_TRAIN))

#Loadings:

loadingplot(mod_pls,comps = 1:4,scatter=TRUE,labels=prednames(mod_pls))

#Loadings, one by one - the spectral view:

par(mfrow=c(2,2))

loadingplot(mod_pls,comps = 1,scatter=FALSE,labels=prednames(mod_pls)) loadingplot(mod_pls,comps = 2,scatter=FALSE,labels=prednames(mod_pls)) loadingplot(mod_pls,comps = 3,scatter=FALSE,labels=prednames(mod_pls)) loadingplot(mod_pls,comps = 4,scatter=FALSE,labels=prednames(mod_pls))

# We choose 4 components

mod4 <- plsr(density˜NIR , ncomp = 4, data =yarn_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Then 3) Validate:

# Let’s validate som more: using 4 components

# We take the predicted and hence the residuals from the predplot function

# Hence these are the (CV) VALIDATED versions!

par(mfrow=c(2,2))

obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation") Residuals=obsfit[,1]-obsfit[,2]

plot(obsfit[,2],Residuals,type="n",xlab="Fitted",ylab="Residuals") text(obsfit[,2],Residuals,labels=rownames(saltdata_TRAIN))

qqnorm(Residuals)

# To plot residuals against X-leverage, we need to find the X-leverage:

# AND then find the leverage-values as diagonals of the Hat-matrix:

# Based on fitted X-values:

Xf=scores(mod4)

H=Xf%*%solve(t(Xf)%*%Xf)%*%t(Xf) leverage=diag(H)

plot(leverage,abs(Residuals),type="n",main=k)

text(leverage,abs(Residuals),labels=rownames(yarn_TRAIN))

# This set of four plots could be reproduced for other values of k

# One could also redo some leaving out certain observations

# Now let’s look at the results - 4) "interpret/conclude"

par(mfrow=c(2,2))

# Plot coefficients with uncertainty from Jacknife:

obsfit=predplot(mod4,labels=rownames(yarn_TRAIN),which="validation") abline(lm(obsfit[,2]˜obsfit[,1]))

plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") coefplot(mod4,se.whiskers=TRUE,labels=prednames(mod4),cex.axis=0.5)

biplot(mod4)

# And now let’s try to 5) predict the 4 data points from the TEST set:

preds=predict(mod4,newdata=yarn_TEST,ncomp=4)

(21)

plot(yarn_TEST$density,preds)

rmsep=sqrt(mean((yarn_TEST$density-preds)ˆ2)) rmsep

6.3 Exercise: NIR data on Degree of Esterification

Use the Pectin data to be downloaded from file sharing in CN on relating the degree of esteri- fication (DE) for pectins to their NIR spectra as obtained by diffuse reflectance technique. The datafile contains data for 70 different granulated pectin samples with 1 Y-variable (DE) and 1039 regressor X-variables in the NIR region (3996 cm

−1

- 12004 cm

−1

). Setup and validate a PLS-model following the standard steps as shown in the PCR approach. Compare with PCR re- sults also! Remember to check for outliers. Are any spectral ranges more important than others to predict DE? How precise can DE be predicted by the models? Define your own test set by a random selection of 23 samples:

# E.g.: Set your own working directory - on perbb computer, it is:

setwd("C:/perbb/DTUkemometri/Niels")

# Exercise data: Pectin data:

pectin=read.table("Pectin_and_DE.txt",header=TRUE,sep=";",dec=",") pectin$NIR=as.matrix(pectin[,-1])

length(pectin$DE)

# First of all we consider a random selection of 23 observation as a TEST set pectin$train=TRUE

pectin$train[sample(1:length(pectin$DE),23)]=FALSE pectin_TEST=pectin[pectin$train==FALSE,]

pectin_TRAIN=pectin[pectin$train==TRUE,]

# The training X-data is 47 rows and 1039 columns:

dim(pectin_TRAIN$NIR)

7 RR, Ridge and Lasso Regression

7.1 Example: PAC data from package chemometrics

library(chemometrics) data(PAC)

?PAC

# First find optimal lambda:

# By GCV: (formula approximation of real CV), sec 4.3.2.2 in Varmuza-book ridge_res=plotRidge(y˜X,data=PAC,lambda=seq(1,20,by=0.1))

ridge_res$lambdaopt

# By real CV, e.g. LOO:

# Note that the ridgeCV function shown in the book does NOT do this for us

# It only looks into an allready chosen optimal lambda

# So we use the ridge.CV function from the parcor package:

library(parcor)

# First Full LOO

res_LOO=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,20,by=0.5),k=209,plot.it=TRUE) res_LOO$lambda.opt

# Then 9 (random) versions of 10-fold CV:

par(mfrow=c(3,3))

(22)

for ( i in 1:9){

res_CV10=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,20,by=0.5),k=10,plot.it=TRUE) title(main=res_CV10$lambda.opt)

}

# Let’s do our own "repeated double 10-fold CV":

K=10

lambda.opts=rep(0,K) for ( i in 1:K){

res_CV10=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,30,by=0.5),k=10,plot.it=FALSE) lambda.opts[i]=res_CV10$lambda.opt

}

median(lambda.opts) par(mfrow=c(1,1)) boxplot(lambda.opts)

# Now let’s try the ridgeCV to investigate the choice(s) more carefully:

# As suggested by the book

res=ridgeCV(y˜X,data=PAC,lambdaopt=4.3,repl=20)

# Look at the coefficients:

ridge_lm=lm.ridge(y˜X,data=PAC,lambda=4.3)

# Coefficients on raw X-scale:

coef(ridge_lm)

# Coefficients on standardized X-scale:

ridge_lm$coef

# Now let’s try to fit the lasso:

# We could use the lars package:

lasso_result=lars(PAC$X,PAC$y) cv.lars(PAC$X,PAC$y)

# Or from the chemometrics package:

res_lasso=lassoCV(y˜X,data=PAC,fraction=seq(0,1,by=0.05),K=10) res_lasso$sopt

res_coef=lassocoef(y˜X,data=PAC,sopt=res_lasso$sopt)

# In an auto-scaled version:

#res_lasso_scaled=lassoCV(y˜scale(X),data=PAC,fraction=seq(0,1,by=0.05),K=10)

#res_lasso_scaled$sopt

#res_coef_scaled=lassocoef(y˜scale(X),data=PAC,sopt=res_lasso_scaled$sopt)

# The actual coefficients can be looked at by: (on raw X-scale - I believe.:-)...) res_coef=res_coef$coefficients

res_coef

beta_lasso=res_coef

# Let’s try to predict a NEW data set

# here we take the original data - just for illustration:

PAC_test=PAC

# First by ridge regression:

ridge_lm=lm.ridge(y˜X,data=PAC,lambda=4.3)

# Coefficients on raw X-scale: (WITH intercept) beta_ridge=coef(ridge_lm)

yhat_ridge=beta_ridge[1]+PAC_test$X%*%beta_ridge[-1]

(23)

residuals=PAC$y-yhat_ridge MSEP=mean(residualsˆ2) MSEP

sqrt(MSEP)

plot(PAC_test$y,yhat_ridge,ylim=c(180,520))

# Then by lasso regression:

yhat_lasso=predict(lasso_result,newx=PAC_test$X,s=0.3,mode="fraction")$fit residuals=PAC_test$y-yhat_lasso

MSEP=mean(residualsˆ2) MSEP

sqrt(MSEP)

plot(PAC_test$y,yhat_ridge,ylim=c(180,520))

7.2 Exercise: Prostate Cancer data and maybe pectin data

1. Try both ridge and lasso regression on the Prostate cancer data - a scaled version.

(a) Work on the training data (b) Find the optimal penalty

(c) Will lasso completely leave out some of the variables?

(d) Interpret/Look at the coefficients

(e) Predict the test data - compare results with other methods, e.g. MLR/PCR/(PLS) (f) For the ridge-model: Compare the ”size”of variable coefficients with coefficients

from MLR and PCR - what xan you conclude?

2. Try lasso on the pectin NIR data - how many wavelengths will be completely left out?

8 Discriminant Analysis, LDA, QDA, k-NN, Bayes, PLSDA

Apart from the Latin chapter, also read in the Varmuza book:

• Section 5.1, Intro, 2.5 pages

• Section 5.2, Linear Methods, 12 pages

• Section 5.3.3, Nearest Neighbourg (k-NN), 3 pages

• Section 5.7, Evaluation of classification, 3 pages

(24)

8.1 Example: Iris data

# we use the iris data:

Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3)))

# We make a test and training data:

train <- sample(1:150, 75) test = (1:150)[-train]

Iris_train=Iris[train,]

Iris_test=Iris[test,]

# Distribution in three classes in training data:

table(Iris_train$Sp)

# PART 1: LDA

lda_train_LOO <- lda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3, CV=TRUE)

# Assess the accuracy of the prediction

# percent correct for each category:

ct <- table(Iris_train$Sp, lda_train_LOO$class) diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PLotting only works without the CV-stuff:

lda_train<- lda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3) plot(lda_train) # CVA components score plot no 1 and 2

plot(lda_train, dimen=1, type="both") # fit from lda

# Or REALLY nice plots in another package:

library(klaR)

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="lda",prior=c(1,1,1)/3)

# See the effect of a different prior: (using an extreme one for illustration)

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="lda",prior=c(0.02,.49,0.49))

# And now predicting NEW data:

predict(lda_train, Iris_test)$class

# Find confusion table:

ct <- table(Iris_test$Sp, predict(lda_train, Iris_test)$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PART 2: QDA

# Most stuff from LDA can be reused:

qda_train_LOO <- qda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3, CV=TRUE)

# Assess the accuracy of the prediction

# percent correct for each category:

ct <- table(Iris_train$Sp, qda_train_LOO$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PLotting only works without the CV-stuff:

qda_train<- qda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3)

# Or REALLY nice plots in another package:

(25)

library(klaR)

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="qda",prior=c(1,1,1)/3)

# And now predicting NEW data:

predict(qda_train, Iris_test)$class

# Find confusion table:

ct <- table(Iris_test$Sp, predict(qda_train, Iris_test)$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# Part 3: (Naive) Bayes method

# With "usekernel=TRUE" it fits a density to the observed data and uses that

# instead of the normal distribution

# Let’s explore the results first:

# (Can take a while dur to the fitting of densities)

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="naiveBayes",prior=c(1,1,1)/3,usekernel=TRUE)

bayes_fit=NaiveBayes(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="naiveBayes",prior=c(1,1,1)/3,usekernel=TRUE) plot(bayes_fit)

# And now predicting NEW data:

predict(bayes_fit, Iris_test)$class

# Find confusion table:

ct <- table(Iris_test$Sp, predict(bayes_fit, Iris_test)$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PART 4: k-nearest neighbourgh:

# Explorative plot WARNING: THIS takes some time to produce!!

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris[train,],method="sknn",kn=3) knn_fit_5=sknn(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="sknn",kn=5)

# And now predicting NEW data:

predict(knn_fit_5, Iris_test)$class

# Find confusion table:

ct <- table(Iris_test$Sp, predict(knn_fit_5, Iris_test)$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PART 5: PLS-DA

# We have to use the "usual" PLS-functions

# Define the response vector (2 classes) OR matrix (>2) classes:

# Let’s try with K=2: Group 1: s, Group -1: c and v Iris_train$Y=-1

Iris_train$Y[Iris_train$Sp=="s"]=1 table(Iris_train$Y)

Iris_train$X=as.matrix(Iris_train[,1:4])

# From here use standard PLS1 predictions, e.g.:

library(pls)

# Pls:

(26)

mod_pls <- plsr(Y ˜X , ncomp = 4, data =Iris_train,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

plot(mod_pls,labels=rownames(Iris_train),which="validation")

plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod_pls,labels=Iris_train$Sp)

# Be carefull about interpretations due to the binary setting

# You should do a CV based confusion table for each component, really:

preds=array(dim=c(length(Iris_train[,1]),4)) for (i in 1:4) preds[,i]=predict(mod_pls,ncomp=i) preds[preds<0]=-1

preds[preds>0]=1

# Look at the results from each of the components:

for (i in 1:4) {

ct <- table(Iris_train$Y, preds[,1]) CV_error=1-sum(diag(prop.table(ct))) print(CV_error)

}

# The prediction of new data would be handled similarly

# WHAT if in this case K=3 classes:

# Look at Varmuza, Sec. 5.2.2.3 K=3

Iris_train$Y=matrix(rep(1,length(Iris_train[,1])*K),ncol=K) Iris_train$Y[Iris_train$Sp!="c",1]=-1

Iris_train$Y[Iris_train$Sp!="s",2]=-1 Iris_train$Y[Iris_train$Sp!="v",3]=-1

mod_pls <- plsr(Y ˜X , ncomp = 4, data =Iris_train,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(1,1))

scoreplot(mod_pls,labels=Iris_train$Sp)

plot(mod_pls,labels=rownames(Iris_train),which="validation")

plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Predictions from PLS need to be transformed to actual classifications:

# Select the largest one:

preds=array(dim=c(length(Iris_train[,1]),4))

for (i in 1:4) preds[,i]=apply(predict(mod_pls,ncomp=i),1,which.max) preds2=array(dim=c(length(Iris_train[,1]),4))

preds2[preds==1]="c"

preds2[preds==2]="s"

preds2[preds==3]="v"

# Look at the results from each of the components:

for (i in 1:4) {

ct <- table(Iris_train$Sp, preds2[,i]) CV_error=1-sum(diag(prop.table(ct))) print(CV_error)

}

(27)

8.2 Exercise: Wine data

Use the wine data previously used. Predict the wine-class using the different methods suggested.

Try also a PCA based version of some of the methods. What are the results? Make your own

test- and training data. Apply CV on the training data and predict the test data. By looking

at various plots (in addition to prediction ability) consider which models seem to be mostly

appropriate for the data.

Referencer

RELATEREDE DOKUMENTER

Using item links between researchers, affiliation and country to identify Finnish researchers with no need for query on geocoordinate data.. Combination of country and

These are very fascinating results and although nothing general can be said about this procedure yet, it can be seen that the discriminatory value of the PCA is improved by

Model properties are discussed in connection with applications of the models which include detection of unlikely documents among scientic papers from the NIPS conferences using

Innovative procurement data utilization (big data analytics; predictive market and supplier analysis; field application data analysis to improve design and performance) 5.

Denne tabel skabes ud fra aktiviteten af typen &#34;Opfølgning&#34; og undertypen &#34;Faglig vurdering&#34;, som genereres automatisk når feltet &#34;Version færdiggjort&#34;

Goal 1 Goal 2 Goal 3 Goal 4 Goal 5 Goal 6 Goal 7 Goal 8 Goal 9 Goal 10 Goal 11 Goal 12 Goal 13 Goal 14 Goal 15 Goal 16 Goal 17..

Step 1: Rough data acquisition Step 2: Uncertainty analysis. Step 3:

It is hypothesized that efficient data processing, analysis and visualization of smart metering data can help the DSOs in making useful decisions for future grid planning,