• Ingen resultater fundet

Selection of number of principal components

In document IMM Datamininginmedicaldatabases (Sider 35-38)

projec-tion as is suggested in [19]. In order to find the high dimensional basis vectors Random Projection could be used. The goal is to project the data to lower di-mensional space without causing a significant change in the distance. For the second step, classic methods like Principal Component Analysis, Factor Anal-ysis [40, 44, 45] or Independent Component AnalAnal-ysis [49] can be used so the important directions are extracted.

2.3 Selection of number of principal components

All the projection techniques presented in this chapter estimate the set of pro-jection directions but also leave an open question as to how the optimum di-mension of the space will be chosen. It is extremely important to notice that too small a dimension may result in serious information loss, whilst one that is too large will often introduce noise to the projected data. In both cases the conse-quences are significant in the form of large errors in the information extraction, clustering, density estimation, prediction etc.

Thus, several methods have been developed for selecting the optimum dimen-sion. An overview is presented below of some of the techniques for selecting the number of principal components in the case of PCA (described in section 2.1.2).

The simplest technique to find the effective dimension K is to calculate the eigenvalue curve and base the decision on its shape. If the effective dimension-ality of the data is smaller than the actual data size D, the eigenvalue curve will have the shape of the letter “L”. Various, mainly ad hoc rules, have been proposed for estimating the turning point of that curve6, which will minimize the projection error [10, 45]. One way is to use the cumulative percentage of the total variation. Here, it is desired that the selected components contribute significantly, say80%−90%of the total variation. A similar way is to select the threshold value for which the components with smaller variance/eigenvalues are discarded. Thus, only the largest/dominant components contribute in the final result. In this approach, typically the eigenvalues are rejected that are smaller than75−90%of the maximum.

6The turning point that determines the optimalKis usually ambiguous, due to the fact that the L-curve often has a smooth characteristic.

Another approach is proposed by Girolami in [32]. It suggests using both the ordering information included in the eigenvalues and in the variation of the eigenvectors. Thus, instead of looking at the eigenvalue curve, we now inves-tigate the transformed curve of the formλk(N1 P

ju2kj). Both the thresholding and the turning point search can be applied. This transformation usually makes the slope of the L-curve steeper which facilitates the decision.

Cattell in [15] suggests finding the largest eigenvalue beyond the point where the curve becomes a straight line as close to horizontal as possible. A simi-lar technique can be found in [18, 45], but here, the results are based on the log-eigenvalue curve. Craddock and Flood in [18] claim that the eigenvalues corresponding to the noise value should decay in geometric progression and that will produce a straight horizontal curve in the logarithmic space.

The success of the methods mentioned above is subjective and dependent on the choice of cut-off level. Nevertheless, they are often used due to their simplicity and fairly reliable outcomes.

Additionally, objective rules have been developed for choosing the number of principal components [37, 45]. Usually in such cases the decision is based on the cross-validation error.

For example the cross-validated choice is proposed by [24, 45, 90]. For an in-creasing number of retained components, the Least Square error is estimated between the original data and the leave-one-out estimate of the data in the re-duced space. Thus, the so-called Prediction Sum of Squares is calculated based on the following equation

P RESS(K) = XD d=1

XN n=1

(ˆxdn−xdn)2, (2.5)

wherexˆdn=PK

k=1dk·sˆk·vˆnkT is the set diminished with thexdnsample.

The next approach is presented in line with Hansen [37] and Minka [61]. The D-dimensional observation vectorsxare assumed here to be a composition of the signal and the white noise: x=s+v. If we further assume, that both the signal and the noise are normally distributed, then the observed N point data samplexwill have as well normal characteristics of the formN(µssv).

Notice that the noise has zero mean:µv = 0.

2.3 Selection of number of principal components 21

Thus, the density function of the data can be expressed by:

p(x|µssv) = 1

p|2πΣsv|exp(−1

2(x−µs)Tsv)−1(x−µs)).

(2.6) TheΣsis further assumed to be singular, i.e. the rankK≤DandΣv2ID, whereIDisD×Ddimensional identity matrix. Then,Σ=Σsv2ID and it can be estimated from the data set:

µs= 1 N

XN n=1

xn, Σb = 1 N

XN n=1

(xn−µs)(xn−µs)T. (2.7)

Further,Σb can be decomposed by singular value decompositionΣb =SΛST, where S = {sd, d = 1. . . D}collects the eigenvectors and Λis a diagonal matrix with eigenvaluesλkordered in descending order. Since the variation of the noise is assumed to be significantly lower than the signal, the first eigenval-ues will represent the signal components while the last ones are responsible for the noise. Thus, by truncation in the eigenvalue space, the noise level can be estimated:

b

ΣK = S·diag([λ1, λ2,· · · , λK,0,· · ·,0])·ST, (2.8) b

σ2 = 1

D−KT race(Σb−ΣbK,) (2.9)

Σbv = σb2ID, (2.10)

and then the signal covariance is described by:

Σbs=S·diag([λ1−σb2, λ2−σb2,· · ·, λK−σb2,0,· · · ,0])·ST (2.11)

If, in addition to the samplex, theNzpoints are observedzn,n={1,2, . . . Nz} forming the test set then the generalization error7 can be estimated in

cross-7The generalization error is defined as the expected loss on the future independent sample.

validation scheme as the negative log-likelihood over this test set8 Gbz = − 1

Nz Nz

X

i=1

log(p(zissv))

= − 1 Nz

Nz

X

i=1

logexp(−12(zi−µs)Tsv)−1(zi−µs)) p|2π(Σsv)|

= 1

2log|2π(Σbs+Σbv)|+1

2Trace[(Σbs+Σbv)−1Σbz], (2.12) where Σbs and Σbv are estimated covariance structures of the signal and the noise, respectively, for the K largest eigenvalues. And the covariance of the test setΣbz = N1

z

PNz

n=1(zn−µs)T(zn−µs).

By minimizing the approximated generalization error the optimum signal space is found.

Some of the presented approaches are illustrated with an example. For this purpose a Gaussian cluster is generated, where the effective dimension is equal to 3, but the signal, due to the added noise, existed in 10-dimensional space. In order to determine the effective dimension of the data, the approximated gen-eralization error is calculated according to equation 2.12. Results are presented on figure 2.10 (lower right plot). The optimum dimension that gives a minimum generalization error is equal to 3. Additionally, the Prediction Sum of Squares (equation 2.5) is computed. Here, the curve flattens at the third component suggesting that dimension for the signal space. For comparison the eigenvalue curves are also shown. Both the eigenvalues of the data and the scaled version, proposed in [32] indicate 3 components.

In document IMM Datamininginmedicaldatabases (Sider 35-38)