• Ingen resultater fundet

Introduction Chemometrics

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Introduction Chemometrics"

Copied!
41
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Chemometrics

Introduction

Course 27411 Biological data analysis and chemometrics Jens C. Frisvad

(2)

Fundamental disciplines in biological sciences

• Classification (theory: taxonomy)

– Discrimination (diagnostics) – Identification

– Nomenclature

• Cladification (theory: phylogeny)

• Modelling and predictions

• Tests and validations

(3)

Engineering

• The science by which the properties of

matter and the sources of energy in nature are made useful to man in structures,

machines and products

• Measurement techniques, computers, language,

definitions, properties, chemistry, physics, mathematical hard modelling, statistics, chemometrics and many other disciplines are necessary to be a good engineer

(4)

X-metrics

• The use of multivariate statistics in the discipline X

• Psychrometrics (used in psychology)

• Taxometrics (used in taxonomy)

• Biometrics (used in biology)

• Technometrics (used in engineering)

• Chemometrics (used in chemistry)

(5)

Chemometrics

• Use of statistics and mathematics in chemical sciences, to measure and interpret chemical data (also used for biological data)

• Biometrics has often been restricted to univariate statistics and taxometrics to biosystematics

(6)

Other definitions of chemometrics:

• Empirical interactive data-driven modelling in chemistry (induction and abduction)

• Exploratory and confirmatory data-analysis (hypothesis generating and hypothesis

testing) in chemistry

• Predictive multivariate modelling in chemistry

(7)

Important disciplines in chemometrics

• Sampling, selection of objects and variables

• Clustering

• Ordination (projection from N dimensions to few dimensions, eigen vector based analyses)

• Multivariate regressions, calibrations and predictions

• Neural networks

• Validation (test set validation, boot-strapping, jackknifing, cross-validation)

• Graphical display and outlier analysis

(8)

Why use chemometrics?

• Complex systems with many interactions are common in science

• Indirect (and often non-destructive) observation of the world as it is

• An expansion of the human perception (the full electromagnetic spectrum)

• Chromatography and spectroscopy will always yield multivariate data

(9)
(10)

Most new analytical chemical methods give multivariate data

• IR and NIR

• UV-VIS & fluorescence

• NMR and ESR

• MS

• TLC

• HPLC

• GC

• CE

• CCC

(11)

Interpreting chemical data

• Abduction

– Sharp independent signals that can be interpreted:

• NMR

• IR

• MS

• Induction

– Soft overlayered signals (strong interaction, co-linearity)

• UV

• Fluorescence

• NIR

• FIA

– Any chromatographic/spectrometric measurements on mixtures

(12)

From Jens Bo Holm Nielsen NIR spectra, ”soft spectra”

(13)

H-NMR spectrum, sharp signals

(14)

No. variables > No. objects

• Classical statistical methods will not work in that case (for example multiple linear regression, linear learning machine etc.)

• Two solutions to this can be variable

selection or classical statistics on scores from eigen vector analysis

(15)

Important methods in chemometrics (classification)

• Cluster analysis

– Hierarchical clustering – Divisive clustering

– Block-clustering and fuzzy clustering

• Ordination

– Principal component analysis – Correspondence analysis

– Multidimensional scaling

(16)

The Celestial Emporium of Benevolent Knowledge (encyclopedia from the 10.

century)

An arbitrary and idiosyncratic classification

1. Those that belong to the Emperor

2. Embalmed ones

3. Those that are trained 4. Suckling pigs

5. Mermaids

6. Fabulous ones 7. Stray dogs

8. Those that are not included in this classification

9. Those that tremble as if they were mad

10. Innumerable ones

11. Those drawn with a very fine camel's hair brush

12. Others

13. Those that have just broken a flower vase

14. Those that resemble flies on a distance

Classification of plants and animals:

(17)

Classification is central!

Classification

Phenotype data

(differentiation)

Nomenclature Identification

Cladification

Phylogeny

Chemometrics

Cladistics

(18)

Examples of chemical data

(19)

2.00 4.00 6.00 8.00 10.00 12.00 14.00

%

0 100

2.00 4.00 6.00 8.00 10.00 12.00 14.00

0.0 1.0 2.0

3.81 0.88

1.05

3.04 1.95

12.05

4.93

10.65 10.10 6.78 9.10

5.23 6.18 7.73

12.58

13.00 13.63

3.85 0.70

0.55

0.96

3.05

1.10 2.75 1.47

3.12

12.07 10.93

9.14 4.92

4.51

5.71 6.83

10.16 9.65

13.19

14.73

13.85 Aurasperone B

Pyranonigrin A

Ochratoxin A (shoulder) Malformin C, A, B1, B, A1

Rubrofusarin B

Malformin B2 Pyranonigrin B/C

Pyranonigrin A

Fumonisin B4 Fumonisin B2 Nigragillin

Funalenone

Nigragillin analogue Tensidol B

Fonsecin B

Fonsecin

Aurasperone G & E

Demethylkotanin

Tensyuic acid A

Aurasperone C

ESI+ TIC (m/z 100-900)UV/VIS (200-700 nm)

Time (min)

Aspergillus niger secondary metabolite HPLC profile (sharp signals)

(20)

Time

2.00 4.00 6.00 8.00 10.00 12.00

%

5

2.00 4.00 6.00 8.00 10.00 12.00

%

0

Kir24877 1: TOF MS ES+

404 2.71e3 10.27

8.15 9.69

12.90

Kir24877 1: TOF MS ES+

TIC 1.62e5 1.69

1.03

10.20 3.60

2.993.23 2.26

9.50 7.71 9.17

3.87 7.23

5.58 5.95 4.62

6.53 8.00 8.88

10.71 11.12

11.80 Ochratoxin A

Calculated mass 404.0901 Deviation -4.2 ppm

HPLC-ESI+ chromatograms (Luna C18 (2) column) A. niger NRRL 3122 from YES agar.

Upper extracted ion chromatogram m/z 404 and lower total ion chromatogram (m/z 100-900)

Fumonisin B2

Fumonisin B4 Chlorine isotope

pattern

(21)

Clustering of some common Penicillia based on 31 extrolite biosynthetic families

P. verrucosum sensu lato UPGMA Yule

-1.00 -0.50 0.00 0.50 1.00

ver1 ver2 ver3 ver4 nor1 nor2 nor3 nor4 vir1 vir2 vir3 vir4 aur1 aur4 aur2 aur3 cyc1 cyc4 cyc2 cyc3 pol1 pol4 pol2 pol3 cru1 cru4 cru3 cru2 com1 com4 com2 com3 aet1 aet4 aet2 aet3

Series Verrucosa

Series Viridicata

Series Camemberti

(22)

Look, daddy, look at the big AACACTGTATCTAATTATT!!!!

Aren’t they cool the new barcodes they’ve given us?

As if that’s something

Biosystematics: Genome or phenome?

Politiken, 16/3 2003

(23)

Methods used in cladistics

• Parsimony

• Maximum likelihood

• Nearest neighbour

• Bayes analysis

• Validation: Often boot-strapping

(24)

Important methods in chemometrics (regresssion)

• Regression

– MLR (multiple linear regression)

– PCR (principal component regression) – PLSR (partial least squares regression) – RR (ridge regression)

• Neural Networks

(25)

The different kinds of reseach

Pure basic reseach (Bohr) Use-inspired basic research (Pasteur)

Pure applied research (Edison)

Considerations of use

Quest for fundamental understanding

Yes

No

No Yes

(26)

The scientific method

• Hypothesis

• Prediction

• Test and validation

• Repeat

The scientific method is a recursive system of matching theory with observation

A hypothesis is a tentatively held conjecture for the purposes of developing predictions of empirical observations

(27)

The scientific method (deduction)

• Discovery, observations, ideas, intuition, former results

• Propose a hypothesis and connect it with logical derivatisations from known theory and propose a mathematical model, prove by several experiments/observations

(tests) and also try to disprove hypothesis (deduction: from the general to the

specific)

(28)

The scientific method (induction)

• Gather many objects and measure by a series of features.

• Classify and find latent features.

• Predict by regression.

• Validate.

• Connect with known theory and set up

hypotheses or set up experimental designs to find important features and their dimensionality

• (induction: from the specific to the general)

(29)

Abduction, deduction and induction

• Science often exhibits a subtle interplay between abduction, induction and

deduction. Abduction is a common

process of creating new generalizations, theories and hypothesis. Deduction takes a hypothesis to make a specific prediction.

”Then” induction is used to fit the evidence to the hypothesis.

(30)

Levels of knowledge

• One (few) example(s): ”Laymans science”

• Neural networks and validation

• X-metrics and validation

• Statistics and distributions

• Mathematical exact modelling

• Essentialism

(31)

Technology and science

”It’s alright in practice, will it ever work in theory?”

• Theory (Plato): Know what (clever people)

• Practice: Know how (skilled people)

(32)

Advantages of technology (applied research)

• Holistic, not reductionistic

• Context driven, not subject driven

• Mission-oriented research, not ”blue skies”

• Team work, not individual scholar

• Divergent, not convergent thinking

• Decisive criterion: does it work?

(33)

Beware of pure reductionism

We must reject this primitive and almost

cannibalistic delution about knowledge, that an understanding of something requires first that we dismantle it, like a child who pulls a watch to pieces and spread out the wheels in order to

understand the mechanism” (Thom, 1975)

(34)

Models are not reality

Precision

Realism Generality

Model

(35)
(36)

Systematic generalization

(hierarchical reductionism)

”We very soon got six yards to the mile. Then we tried hundreds yeards to the mile. And then came the

grandest idea of all! We actually made a map of the country, on the scale of a mile to a mile!”

”Have you used it much?” I enquired.

”It has never been spread out, yet, ” said Mein Herr: ”the farmers objected: they said it would cover the whole country, and shut out the sunlight! So now we use the country itself, as its own map, and I assure you it does nearly as well”

(Lewis Carroll, Sylvie and Bruno, 1893)

(37)

Chemometrics and science?

• Find an important problem in your field of

interest (FOI) to which there is yet no solution (think!). Propose a preliminary hypothesis.

• Observe and measure within the FOI

• Use statistical and multivariate design

• Propose a hypothesis (think!)

• Experiments and/or observations: tests and predictions based on proposed model

• Reject hypothesis or accept it for the time being

(38)

This course

• In this course you will have hands-on

experience in how to treat data with a lot of features (variables) measured on

several or a lot of objects

(39)

Learning objectives of the course

Give an overview of important chemometric methods

Identify situations where exploratory data-analysis is required

Describe and use different forms of scaling, transformation and normalization

Understand and describe the difference between classification and regression

Understand and describe the difference between clustering and ordination

Apply and interpret principal component analysis (PCA) on multivariate data

Apply and interpret the principles of validation and outlier detection

Use and interpret cluster analysis

Describe, apply and interpret multiple linear regression (MLR) and ridge regression (RR) and where to apply them in two data matrix problems

Describe, apply and interpret principal component regression (PCR) and partial least squares regression (PLSR) and where to apply them in two data matrix problems

Apply and interpret correspondence analysis

Describe the method metric multidimensional scaling

(40)

Programs used

• R, R-Studio (free software)

• NTSYS (Exeter publishing) version 2.2

– A whole package on the methods used most frequently: used earlier in this course

– (http://www.exetersoftware.com)(USA)

• UNSCRAMBLER version 10.3 (CAMO, Norway)

– (you can buy your own version (but it is expensive)) (http://www.camo.com)

(41)

Book used

• Lattin, L, Carroll, J.D., Green PE:

• Analyzing multivariate data, Thomson, Pacific Grove, CA, USA, 2003, 556 pp.

• Recommended: E-boks on R

• + a little extra reading material, especially Romesburg (1984) on cluster analysis

Referencer

RELATEREDE DOKUMENTER

Big data analysis uses machine learning methods with base coordinate analysis and partial least squares regres- sion methods to identify the key factors influencing energy

• Analyte concentration level is estimated using Principal Component Regression (PCR), Neural Network Regression (NNR) and Gaussian Process Regression (GPR).. • Application

A Krylov method usually for each iteration apply the matrix vector product once, and the preconditioner once, in total 3k local Dirichlet solves, k local Neumann solves, and two

If we exclude taste but not the individual´s understanding of mate- rial and function from Scruton and Pye´s explanations of aesthetics and apply them to the phenomenon

Access Points are to be Approved by local PEPPOL Authority and have to apply to

Cant and Cooper, 2012), acting and facilitating skills (Keskitalo et al., 2011, Reid-Searl et al., 2011) and skills to apply simulation pedagogy in practice (Kaakinen and Arwood,

Introduction to the R program Simple and multiple regression Analysis of Variance (ANOVA) Analysis of

We present a simple way to program typed abstract syntax in a lan- guage following a Hindley-Milner typing discipline, such as Haskell and ML, and we apply it to automate two