Exercise: Prostate Cancer data - R in 27411: Biological Data Analysis and Chemometrics

# Check for potential quadratic effects and/or interactions:

# Here just two examplesof such:

saltdata$Distance_square=saltdata$Distanceˆ2

saltdata$County_Elevation=saltdata$County*saltdata$Elevation

lm4=lm(formula = logPrice ˜ Elevation + Sewer + Date + Flood + Distance + Distance_square + County_Elevation, data = saltdata)

summary(lm4)

# How to remove a potential outlier and re-analyze:

# E.g. Let’s try without observation no 2 and 10:

# Remove and copy-paste:

saltdata_red=saltdata[-c(2,10),]

# Check:

dim(saltdata) dim(saltdata_red) row.names(saltdata_red)

lm3_red=lm(formula = logPrice ˜ Elevation + Sewer + Date + Flood + Distance, data = saltdata_red) summary(lm3_red)

par(mfrow=c(3,2)) plot(lm3_red,1:6)

4.2 Exercise: Prostate Cancer data

Analyze the Prostate data from the following book website: http://statweb.stanford.

edu/˜tibs/ElemStatLearn/ (check p. 47-48 of Hastie et. al for a little description of the data) (Use the uploaded version of the data file in Campusnet to avoid import problems)

prostate=read.table("prostatedata.txt",header=TRUE,sep=";",dec=",") head(prostate)

summary(prostate) dim(prostate) pairs(prostate)

Try to answer the following questions:

1. What are the pair-wise relations between lpsa and the other variables – any indications of important relations?

2. Are there any clearly non-normal (e.g. skew) distributions among the variables?

3. Run the 8-variable MLR analysis and try to reduce the model by removing the most non-significant variables one by one – what is the final model?

4. Interpret the parameters of the final model – compare with the investigation in 1.

5. What is the estimate (and interpretation) of the residual standard deviation?

6. Investigate the validity/assumptions of the final model:

(a) Residual checks

(b) Influential/outlying observations

(c) Any additional model structure? (non-linearities, iteractions?)(diagnostics plots and/or model extensions)

5 PCR, Principal Component Regression

PCR and the other biased regression methods presented in this course (PLS, Ridge and Lasso) are all together with even more methods (as e.g. MLR=OLS) introduced in each of the three books

• The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, February 2009, Trevor Hastie, Robert Tibshirani, Jerome Friedman

• Ron Wehrens (2012). Chemometrics With R: Multivariate Data Analysis in the Natural Sciences and Life Sciences. Springer, Heidelberg.(Chapter 8 and 9)

• K. Varmuza and P. Filzmoser (2009). Introduction to Multivariate Statistical Analysis in Chemometrics, CRC Press. (Chapter 4)

The latter two ones are directly linked with R-packages, and here we will most directly use the latter. These things are NOT covered by the Lattin book, so we give here a reading list for the most relevant parts of chapter 4, when it comes to syllabus content for course 27411:

• Section 4.1 (4p) (Concepts - ALL models)

• Section 4.2.1-4.2.3 (6p)(Errors in (ALL) models )

• Section 4.2.5-4.2.6 (3.5p)(CV+bootstrap - ALL models)

• [Section 4.3.1-4.3.2.1 (9.5p)(Simple regression and MLR (=OLS))]

• [Section 4.5.3 (1p) (Stepwise Variable Selection in MLR)]

• Section 4.6 (2p) (PCR)

• Section 4.7.1 (3.5p) (PLS)

• Section 4.8.2 (2.5p) (Ridge and Lasso)

• Section 4.9-4.9.1.2 (5p) (An example using PCR and PLS)

• Section 4.9.1.4-4.9.1.5 (5p) (An example using Ridge and Lasso)

• Section 4.10 (2p) (Summary)

5.1 Example: Leslie Salt Data (again)

Below I have included R-code primarily based on the pls package. It shows how one can work through all the steps of the PCR analysis of a data set.

# E.g.: Set your own working directoty - on perbb computer, it is:

setwd("C:/perbb/DTUkemometri/R")

# Example: using Salt data:

saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".") saltdata$logPrice=log(saltdata$Price)

# Define the X-matrix as a matrix in the data frame:

saltdata$X=as.matrix(saltdata[,2:8])

# First of all we consider a random selection of 4 properties as a TEST set saltdata$train=TRUE

saltdata$train[sample(1:length(saltdata$train),4)]=FALSE saltdata_TEST=saltdata[saltdata$train==FALSE,]

saltdata_TRAIN=saltdata[saltdata$train==TRUE,]

# Now all the work is performed on the TRAIN data set

# First: 1) Explore the data - we allready did this previously, so no more of that here

# Next: 2) Model the data

# Run the PCR with maximal/large number of components using pls package library(pls)

mod <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

plot(mod,labels=rownames(saltdata_TRAIN),which="validation")

plot(mod,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod,labels=rownames(saltdata_TRAIN))

# Choice of components:

# what would segmented CV give:

mod_segCV <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,scale=TRUE,validation="CV", segments = 5, segment.type = c("random"),jackknife = TRUE)

# Initial set of plots:

plot(mod_segCV,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_segCV,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Let us look at some more components:

# Scores:

scoreplot(mod, comps = 1:4,labels=rownames(saltdata_TRAIN))

#Loadings:

loadingplot(mod,comps = 1:4,scatter=TRUE,labels=names(saltdata_TRAIN))

# We choose 4 components

mod4 <- pcr(logPrice ˜X , ncomp = 4, data =saltdata_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Then 3) Validate:

# Let’s validate som more: using 4 components

# We take the predicted and hence the residuals from the predplot function

# Hence these are the (CV) VALIDATED versions!

par(mfrow=c(2,2)) k=4

obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation") Residuals=obsfit[,1]-obsfit[,2]

plot(obsfit[,2],Residuals,type="n",main=k,xlab="Fitted",ylab="Residuals") text(obsfit[,2],Residuals,labels=rownames(saltdata_TRAIN))

qqnorm(Residuals)

# To plot residuals against X-leverage, we need to find the X-leverage:

# AND then find the leverage-values as diagonals of the Hat-matrix:

# Based on fitted X-values:

Xf=fitted(mod)[,,k]

H=Xf%*%solve(t(Xf)%*%Xf)%*%t(Xf) leverage=diag(H)

plot(leverage,Residuals,type="n",main=k)

text(leverage,Residuals,labels=rownames(saltdata_TRAIN))

# This set of four plots could be reproduced for other values of k

# One could also redo some leaving out certain observations

# Let’s also plot the residuals versus each input X:

par(mfrow=c(3,3)) for ( i in 2:8){

plot(Residuals˜saltdata_TRAIN[,i],type="n",xlab=names(saltdata_TRAIN)[i]) text(saltdata_TRAIN[,i],Residuals,labels=row.names(saltdata_TRAIN)) lines(lowess(saltdata_TRAIN[,i],Residuals),col="blue")

}

# Now let’s look at the results - 4) "interpret/conclude"

par(mfrow=c(2,2))

# Plot coefficients with uncertainty from Jacknife:

obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation") abline(lm(obsfit[,2]˜obsfit[,1]))

plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") coefplot(mod4,se.whiskers=TRUE,labels=prednames(mod4),cex.axis=0.5)

biplot(mod4)

# And then finally some output numbers:

jack.test(mod4,ncomp=4)

# And now let’s try to 5) predict the 4 data points from the TEST set:

preds=predict(mod4,newdata=saltdata_TEST,comps=4) plot(saltdata_TEST$logPrice,preds)

rmsep=sqrt(mean((saltdata_TEST$logPrice-preds)ˆ2)) rmsep

5.2 Exercise: Prostate Cancer data

Use the Prostate data also used for the MLR-exercises. Make sure to support all the analysis by a lot of plotting - try to ”play around” with plotting results with/without extreme observations.

Also try to look at results for different choices of number of components (to get a feeling for

the consequence of that). The Rcode (including comments) for the Leslie Salt data above can

be used as a template for a good approach.

1. Define test sets and training sets according to the last column in the data set.

2. Do PCR on the training set – use cross-validation.

3. Go through ALL the relevant steps : (a) model selection

(b) validation (c) interpretation (d) etc.

4. Predict the test set lpsa values.

5. What is the average prediction error for the test set?

(a) Compare with the cross validation error in the training set - try both ”LOO”and a segmented version of CV

6. Compare with an MLR prediction using ALL predictors

6 PLSR, Partial Least Squares Regression

6.1 Motivating example - Leslie Salt again

# E.g.: Set your own working directoty - on perbb computer, it is:

setwd("C:/perbb/DTUkemometri/R")

# Example: using Salt data:

saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".") saltdata$logPrice=log(saltdata$Price)

# Define the X-matrix as a matrix in the data frame:

saltdata$X=as.matrix(saltdata[,2:8])

# First of all we consider a random selection of 4 properties as a TEST set saltdata$train=TRUE

saltdata$train[sample(1:length(saltdata$train),4)]=FALSE saltdata_TEST=saltdata[saltdata$train==FALSE,]

saltdata_TRAIN=saltdata[saltdata$train==TRUE,]

# Now all the work is performed on the TRAIN data set

# First: 1) Explore the data - we allready did this previously, so no more of that here

# Next: 2) Model the data

# Run the PCR with maximal/large number of components using pls package library(pls)

mod_pcr <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

mod_pls <- plsr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

plot(mod_pls,labels=rownames(saltdata_TRAIN),which="validation")

plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod_pls,labels=rownames(saltdata_TRAIN))

# Let us look at the coefficients of the final models:

# Plot coefficients with uncertainty from Jacknife:

par(mfrow=c(2,2))

coefplot(mod_pcr,ncomp=1,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 1") coefplot(mod_pcr,ncomp=2,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 2") coefplot(mod_pcr,ncomp=4,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 4") coefplot(mod_pls,ncomp=2,se.whiskers=TRUE,labels=prednames(mod_pls4),cex.axis=0.5,main="logPrice, PLS with 2")

6.2 NIR Example: yarn data

library(chemometrics) data(yarn)

#summary(yarn)

?yarn str(yarn)

# Plotting of the 21 individual NIR spectra"

max_X=max(yarn$NIR) min_X=min(yarn$NIR) par(mfrow=c(1,1))

plot(yarn$NIR[1,],type="n",ylim=c(min_X,max_X)) for (i in 1:21) lines(yarn$NIR[i,],col=i)

# Plot of y:

mod_pls <- plsr(density ˜NIR , ncomp = 10, data =yarn_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

plot(mod_pls,labels=rownames(yarn_TRAIN),which="validation")

plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod_pls,labels=rownames(yarn_TRAIN))

# Choice of components:

# what would segmented CV give:

mod_segCV <- plsr(density ˜NIR , ncomp = 10, data =yarn_TRAIN,scale=TRUE,validation="CV", segments = 5, segment.type = c("random"),jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,1))

plot(mod_segCV,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_segCV,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Let us look at some more components:

# Scores:

scoreplot(mod_pls, comps = 1:4,labels=rownames(yarn_TRAIN))

#Loadings:

loadingplot(mod_pls,comps = 1:4,scatter=TRUE,labels=prednames(mod_pls))

#Loadings, one by one - the spectral view:

par(mfrow=c(2,2))

loadingplot(mod_pls,comps = 1,scatter=FALSE,labels=prednames(mod_pls)) loadingplot(mod_pls,comps = 2,scatter=FALSE,labels=prednames(mod_pls)) loadingplot(mod_pls,comps = 3,scatter=FALSE,labels=prednames(mod_pls)) loadingplot(mod_pls,comps = 4,scatter=FALSE,labels=prednames(mod_pls))

# We choose 4 components

mod4 <- plsr(density˜NIR , ncomp = 4, data =yarn_TRAIN,validation="LOO",scale=TRUE, jackknife = TRUE)

# Then 3) Validate:

# Let’s validate som more: using 4 components

# We take the predicted and hence the residuals from the predplot function

# Hence these are the (CV) VALIDATED versions!

par(mfrow=c(2,2))

obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation") Residuals=obsfit[,1]-obsfit[,2]

plot(obsfit[,2],Residuals,type="n",xlab="Fitted",ylab="Residuals") text(obsfit[,2],Residuals,labels=rownames(saltdata_TRAIN))

qqnorm(Residuals)

# To plot residuals against X-leverage, we need to find the X-leverage:

# AND then find the leverage-values as diagonals of the Hat-matrix:

# Based on fitted X-values:

Xf=scores(mod4)

H=Xf%*%solve(t(Xf)%*%Xf)%*%t(Xf) leverage=diag(H)

plot(leverage,abs(Residuals),type="n",main=k)

text(leverage,abs(Residuals),labels=rownames(yarn_TRAIN))

# This set of four plots could be reproduced for other values of k

# One could also redo some leaving out certain observations

# Now let’s look at the results - 4) "interpret/conclude"

par(mfrow=c(2,2))

# Plot coefficients with uncertainty from Jacknife:

obsfit=predplot(mod4,labels=rownames(yarn_TRAIN),which="validation") abline(lm(obsfit[,2]˜obsfit[,1]))

plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") coefplot(mod4,se.whiskers=TRUE,labels=prednames(mod4),cex.axis=0.5)

biplot(mod4)

# And now let’s try to 5) predict the 4 data points from the TEST set:

preds=predict(mod4,newdata=yarn_TEST,ncomp=4)

plot(yarn_TEST$density,preds)

rmsep=sqrt(mean((yarn_TEST$density-preds)ˆ2)) rmsep

6.3 Exercise: NIR data on Degree of Esterification

Use the Pectin data to be downloaded from file sharing in CN on relating the degree of esteri-fication (DE) for pectins to their NIR spectra as obtained by diffuse reflectance technique. The datafile contains data for 70 different granulated pectin samples with 1 Y-variable (DE) and 1039 regressor X-variables in the NIR region (3996 cm

⁻¹

- 12004 cm

⁻¹

). Setup and validate a PLS-model following the standard steps as shown in the PCR approach. Compare with PCR re-sults also! Remember to check for outliers. Are any spectral ranges more important than others to predict DE? How precise can DE be predicted by the models? Define your own test set by a random selection of 23 samples:

# E.g.: Set your own working directory - on perbb computer, it is:

setwd("C:/perbb/DTUkemometri/Niels")

# Exercise data: Pectin data:

pectin=read.table("Pectin_and_DE.txt",header=TRUE,sep=";",dec=",") pectin$NIR=as.matrix(pectin[,-1])

length(pectin$DE)

# First of all we consider a random selection of 23 observation as a TEST set pectin$train=TRUE

pectin$train[sample(1:length(pectin$DE),23)]=FALSE pectin_TEST=pectin[pectin$train==FALSE,]

pectin_TRAIN=pectin[pectin$train==TRUE,]

# The training X-data is 47 rows and 1039 columns:

dim(pectin_TRAIN$NIR)

7 RR, Ridge and Lasso Regression

7.1 Example: PAC data from package chemometrics

library(chemometrics) data(PAC)

?PAC

# First find optimal lambda:

# By GCV: (formula approximation of real CV), sec 4.3.2.2 in Varmuza-book ridge_res=plotRidge(y˜X,data=PAC,lambda=seq(1,20,by=0.1))

ridge_res$lambdaopt

# By real CV, e.g. LOO:

# Note that the ridgeCV function shown in the book does NOT do this for us

# It only looks into an allready chosen optimal lambda

# So we use the ridge.CV function from the parcor package:

library(parcor)

# First Full LOO

res_LOO=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,20,by=0.5),k=209,plot.it=TRUE) res_LOO$lambda.opt

# Then 9 (random) versions of 10-fold CV:

par(mfrow=c(3,3))

for ( i in 1:9){

res_CV10=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,20,by=0.5),k=10,plot.it=TRUE) title(main=res_CV10$lambda.opt)

}

# Let’s do our own "repeated double 10-fold CV":

K=10

lambda.opts=rep(0,K) for ( i in 1:K){

res_CV10=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,30,by=0.5),k=10,plot.it=FALSE) lambda.opts[i]=res_CV10$lambda.opt

}

median(lambda.opts) par(mfrow=c(1,1)) boxplot(lambda.opts)

# Now let’s try the ridgeCV to investigate the choice(s) more carefully:

# As suggested by the book

res=ridgeCV(y˜X,data=PAC,lambdaopt=4.3,repl=20)

# Look at the coefficients:

ridge_lm=lm.ridge(y˜X,data=PAC,lambda=4.3)

# Coefficients on raw X-scale:

coef(ridge_lm)

# Coefficients on standardized X-scale:

ridge_lm$coef

# Now let’s try to fit the lasso:

# We could use the lars package:

lasso_result=lars(PAC$X,PAC$y) cv.lars(PAC$X,PAC$y)

# Or from the chemometrics package:

res_lasso=lassoCV(y˜X,data=PAC,fraction=seq(0,1,by=0.05),K=10) res_lasso$sopt

res_coef=lassocoef(y˜X,data=PAC,sopt=res_lasso$sopt)

# In an auto-scaled version:

#res_lasso_scaled=lassoCV(y˜scale(X),data=PAC,fraction=seq(0,1,by=0.05),K=10)

#res_lasso_scaled$sopt

#res_coef_scaled=lassocoef(y˜scale(X),data=PAC,sopt=res_lasso_scaled$sopt)

# The actual coefficients can be looked at by: (on raw X-scale - I believe.:-)...) res_coef=res_coef$coefficients

res_coef

beta_lasso=res_coef

# Let’s try to predict a NEW data set

# here we take the original data - just for illustration:

PAC_test=PAC

# First by ridge regression:

ridge_lm=lm.ridge(y˜X,data=PAC,lambda=4.3)

# Coefficients on raw X-scale: (WITH intercept) beta_ridge=coef(ridge_lm)

yhat_ridge=beta_ridge[1]+PAC_test$X%*%beta_ridge[-1]

residuals=PAC$y-yhat_ridge MSEP=mean(residualsˆ2) MSEP

sqrt(MSEP)

plot(PAC_test$y,yhat_ridge,ylim=c(180,520))

# Then by lasso regression:

yhat_lasso=predict(lasso_result,newx=PAC_test$X,s=0.3,mode="fraction")$fit residuals=PAC_test$y-yhat_lasso

MSEP=mean(residualsˆ2) MSEP

sqrt(MSEP)

plot(PAC_test$y,yhat_ridge,ylim=c(180,520))

7.2 Exercise: Prostate Cancer data and maybe pectin data

1. Try both ridge and lasso regression on the Prostate cancer data - a scaled version.

(a) Work on the training data (b) Find the optimal penalty

(c) Will lasso completely leave out some of the variables?

(d) Interpret/Look at the coefficients

(e) Predict the test data - compare results with other methods, e.g. MLR/PCR/(PLS) (f) For the ridge-model: Compare the ”size”of variable coefficients with coefficients

from MLR and PCR - what xan you conclude?

2. Try lasso on the pectin NIR data - how many wavelengths will be completely left out?

8 Discriminant Analysis, LDA, QDA, k-NN, Bayes, PLSDA

Apart from the Latin chapter, also read in the Varmuza book:

• Section 5.1, Intro, 2.5 pages

• Section 5.2, Linear Methods, 12 pages

• Section 5.3.3, Nearest Neighbourg (k-NN), 3 pages

• Section 5.7, Evaluation of classification, 3 pages

8.1 Example: Iris data

# we use the iris data:

Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3)))

# We make a test and training data:

train <- sample(1:150, 75) test = (1:150)[-train]

Iris_train=Iris[train,]

Iris_test=Iris[test,]

# Distribution in three classes in training data:

table(Iris_train$Sp)

# PART 1: LDA

lda_train_LOO <- lda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3, CV=TRUE)

# Assess the accuracy of the prediction

# percent correct for each category:

ct <- table(Iris_train$Sp, lda_train_LOO$class) diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PLotting only works without the CV-stuff:

lda_train<- lda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3) plot(lda_train) # CVA components score plot no 1 and 2

plot(lda_train, dimen=1, type="both") # fit from lda

# Or REALLY nice plots in another package:

library(klaR)

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="lda",prior=c(1,1,1)/3)

# See the effect of a different prior: (using an extreme one for illustration)

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="lda",prior=c(0.02,.49,0.49))

# And now predicting NEW data:

predict(lda_train, Iris_test)$class

# Find confusion table:

ct <- table(Iris_test$Sp, predict(lda_train, Iris_test)$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PART 2: QDA

# Most stuff from LDA can be reused:

qda_train_LOO <- qda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3, CV=TRUE)

# Assess the accuracy of the prediction

# percent correct for each category:

ct <- table(Iris_train$Sp, qda_train_LOO$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PLotting only works without the CV-stuff:

qda_train<- qda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3)

# Or REALLY nice plots in another package:

library(klaR)

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="qda",prior=c(1,1,1)/3)

# And now predicting NEW data:

predict(qda_train, Iris_test)$class

# Find confusion table:

ct <- table(Iris_test$Sp, predict(qda_train, Iris_test)$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# Part 3: (Naive) Bayes method

# With "usekernel=TRUE" it fits a density to the observed data and uses that

# instead of the normal distribution

# Let’s explore the results first:

# (Can take a while dur to the fitting of densities)

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="naiveBayes",prior=c(1,1,1)/3,usekernel=TRUE)

bayes_fit=NaiveBayes(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="naiveBayes",prior=c(1,1,1)/3,usekernel=TRUE) plot(bayes_fit)

# And now predicting NEW data:

predict(bayes_fit, Iris_test)$class

# Find confusion table:

ct <- table(Iris_test$Sp, predict(bayes_fit, Iris_test)$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PART 4: k-nearest neighbourgh:

# Explorative plot WARNING: THIS takes some time to produce!!

partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris[train,],method="sknn",kn=3) knn_fit_5=sknn(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="sknn",kn=5)

# And now predicting NEW data:

predict(knn_fit_5, Iris_test)$class

# Find confusion table:

ct <- table(Iris_test$Sp, predict(knn_fit_5, Iris_test)$class) ct

diag(prop.table(ct, 1))

# total percent correct sum(diag(prop.table(ct)))

# PART 5: PLS-DA

# We have to use the "usual" PLS-functions

# Define the response vector (2 classes) OR matrix (>2) classes:

# Let’s try with K=2: Group 1: s, Group -1: c and v Iris_train$Y=-1

Iris_train$Y[Iris_train$Sp=="s"]=1 table(Iris_train$Y)

Iris_train$X=as.matrix(Iris_train[,1:4])

# From here use standard PLS1 predictions, e.g.:

library(pls)

# Pls:

mod_pls <- plsr(Y ˜X , ncomp = 4, data =Iris_train,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(2,2))

plot(mod_pls,labels=rownames(Iris_train),which="validation")

plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright") scoreplot(mod_pls,labels=Iris_train$Sp)

# Be carefull about interpretations due to the binary setting

# You should do a CV based confusion table for each component, really:

preds=array(dim=c(length(Iris_train[,1]),4)) for (i in 1:4) preds[,i]=predict(mod_pls,ncomp=i) preds[preds<0]=-1

preds[preds>0]=1

# Look at the results from each of the components:

for (i in 1:4) {

ct <- table(Iris_train$Y, preds[,1]) CV_error=1-sum(diag(prop.table(ct))) print(CV_error)

}

# The prediction of new data would be handled similarly

# WHAT if in this case K=3 classes:

# Look at Varmuza, Sec. 5.2.2.3 K=3

Iris_train$Y=matrix(rep(1,length(Iris_train[,1])*K),ncol=K) Iris_train$Y[Iris_train$Sp!="c",1]=-1

Iris_train$Y[Iris_train$Sp!="s",2]=-1 Iris_train$Y[Iris_train$Sp!="v",3]=-1

mod_pls <- plsr(Y ˜X , ncomp = 4, data =Iris_train,validation="LOO",scale=TRUE, jackknife = TRUE)

# Initial set of plots:

par(mfrow=c(1,1))

scoreplot(mod_pls,labels=Iris_train$Sp)

plot(mod_pls,labels=rownames(Iris_train),which="validation")

plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")

plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Predictions from PLS need to be transformed to actual classifications:

# Select the largest one:

preds=array(dim=c(length(Iris_train[,1]),4))

for (i in 1:4) preds[,i]=apply(predict(mod_pls,ncomp=i),1,which.max) preds2=array(dim=c(length(Iris_train[,1]),4))

preds2[preds==1]="c"

preds2[preds==2]="s"

preds2[preds==3]="v"

# Look at the results from each of the components:

for (i in 1:4) {

ct <- table(Iris_train$Sp, preds2[,i]) CV_error=1-sum(diag(prop.table(ct))) print(CV_error)

}

8.2 Exercise: Wine data

Use the wine data previously used. Predict the wine-class using the different methods suggested.

Try also a PCA based version of some of the methods. What are the results? Make your own

test- and training data. Apply CV on the training data and predict the test data. By looking

at various plots (in addition to prediction ability) consider which models seem to be mostly

appropriate for the data.

In document R in 27411: Biological Data Analysis and Chemometrics (Sider 14-27)