projec-tion as is suggested in [19]. In order to find the high dimensional basis vectors Random Projection could be used. The goal is to project the data to lower di-mensional space without causing a significant change in the distance. For the second step, classic methods like Principal Component Analysis, Factor Anal-ysis [40, 44, 45] or Independent Component AnalAnal-ysis [49] can be used so the important directions are extracted.

**2.3 Selection of number of principal components**

All the projection techniques presented in this chapter estimate the set of pro-jection directions but also leave an open question as to how the optimum di-mension of the space will be chosen. It is extremely important to notice that too small a dimension may result in serious information loss, whilst one that is too large will often introduce noise to the projected data. In both cases the conse-quences are significant in the form of large errors in the information extraction, clustering, density estimation, prediction etc.

Thus, several methods have been developed for selecting the optimum dimen-sion. An overview is presented below of some of the techniques for selecting the number of principal components in the case of PCA (described in section 2.1.2).

The simplest technique to find the effective dimension K is to calculate the
eigenvalue curve and base the decision on its shape. If the effective
dimension-ality of the data is smaller than the actual data size D, the eigenvalue curve
will have the shape of the letter “L”. Various, mainly ad hoc rules, have been
proposed for estimating the turning point of that curve^{6}, which will minimize
the projection error [10, 45]. One way is to use the cumulative percentage of
the total variation. Here, it is desired that the selected components contribute
significantly, say80%−90%of the total variation. A similar way is to select the
threshold value for which the components with smaller variance/eigenvalues are
discarded. Thus, only the largest/dominant components contribute in the final
result. In this approach, typically the eigenvalues are rejected that are smaller
than75−90%of the maximum.

6The turning point that determines the optimalKis usually ambiguous, due to the fact that the L-curve often has a smooth characteristic.

Another approach is proposed by Girolami in [32]. It suggests using both the
ordering information included in the eigenvalues and in the variation of the
eigenvectors. Thus, instead of looking at the eigenvalue curve, we now
inves-tigate the transformed curve of the formλ_{k}(_{N}^{1} P

ju^{2}_{kj}). Both the thresholding
and the turning point search can be applied. This transformation usually makes
the slope of the L-curve steeper which facilitates the decision.

Cattell in [15] suggests finding the largest eigenvalue beyond the point where the curve becomes a straight line as close to horizontal as possible. A simi-lar technique can be found in [18, 45], but here, the results are based on the log-eigenvalue curve. Craddock and Flood in [18] claim that the eigenvalues corresponding to the noise value should decay in geometric progression and that will produce a straight horizontal curve in the logarithmic space.

The success of the methods mentioned above is subjective and dependent on the choice of cut-off level. Nevertheless, they are often used due to their simplicity and fairly reliable outcomes.

Additionally, objective rules have been developed for choosing the number of principal components [37, 45]. Usually in such cases the decision is based on the cross-validation error.

For example the cross-validated choice is proposed by [24, 45, 90]. For an in-creasing number of retained components, the Least Square error is estimated between the original data and the leave-one-out estimate of the data in the re-duced space. Thus, the so-called Prediction Sum of Squares is calculated based on the following equation

P RESS(K) = XD d=1

XN n=1

(ˆx_{dn}−x_{dn})^{2}, (2.5)

wherexˆ_{dn}=PK

k=1uˆ_{dk}·sˆ_{k}·vˆ_{nk}^{T} is the set diminished with thex_{dn}sample.

The next approach is presented in line with Hansen [37] and Minka [61]. The
D-dimensional observation vectorsxare assumed here to be a composition of
the signal and the white noise: x=s+v. If we further assume, that both the
signal and the noise are normally distributed, then the observed N point data
samplexwill have as well normal characteristics of the formN(µ_{s},Σ_{s}+Σ_{v}).

Notice that the noise has zero mean:µ_{v} = 0.

**2.3 Selection of number of principal components** **21**

Thus, the density function of the data can be expressed by:

p(x|µ_{s},Σ_{s},Σ_{v}) = 1

p|2πΣs+Σ_{v}|exp(−1

2(x−µ_{s})^{T}(Σ_{s}+Σ_{v})^{−1}(x−µ_{s})).

(2.6)
TheΣ_{s}is further assumed to be singular, i.e. the rankK≤DandΣ_{v} =σ^{2}ID,
whereI_{D}isD×Ddimensional identity matrix. Then,Σ=Σ_{s}+σ_{v}^{2}*I*D and it
can be estimated from the data set:

µ_{s}= 1
N

XN n=1

x_{n}, Σb = 1
N

XN n=1

(xn−µ_{s})(xn−µ_{s})^{T}. (2.7)

Further,Σb can be decomposed by singular value decompositionΣb =SΛS^{T},
where S = {sd, d = 1. . . D}collects the eigenvectors and Λis a diagonal
matrix with eigenvaluesλ_{k}ordered in descending order. Since the variation of
the noise is assumed to be significantly lower than the signal, the first
eigenval-ues will represent the signal components while the last ones are responsible for
the noise. Thus, by truncation in the eigenvalue space, the noise level can be
estimated:

b

Σ_{K} = S·diag([λ1, λ2,· · · , λK,0,· · ·,0])·S^{T}, (2.8)
b

σ^{2} = 1

D−KT race(Σb−Σb_{K},) (2.9)

Σb_{v} = σb^{2}*I*D, (2.10)

and then the signal covariance is described by:

Σb_{s}=S·diag([λ_{1}−σb^{2}, λ_{2}−σb^{2},· · ·, λ_{K}−σb^{2},0,· · · ,0])·S^{T} (2.11)

If, in addition to the samplex, theN_{z}points are observedz_{n},n={1,2, . . . N_{z}}
forming the test set then the generalization error^{7} can be estimated in

cross-7The generalization error is defined as the expected loss on the future independent sample.

validation scheme as the negative log-likelihood over this test set^{8}
Gbz = − 1

Nz Nz

X

i=1

log(p(z_{i}|µ_{s},Σ_{s},Σ_{v}))

= − 1
N_{z}

Nz

X

i=1

logexp(−^{1}_{2}(z_{i}−µ_{s})^{T}(Σ_{s}+Σ_{v})^{−1}(z_{i}−µ_{s}))
p|2π(Σs+Σ_{v})|

= 1

2log|2π(Σb_{s}+Σb_{v})|+1

2Trace[(Σb_{s}+Σb_{v})^{−1}Σb_{z}], (2.12)
where Σb_{s} and Σb_{v} are estimated covariance structures of the signal and the
noise, respectively, for the K largest eigenvalues. And the covariance of the
test setΣb_{z} = _{N}^{1}

z

PNz

n=1(zn−µ_{s})^{T}(zn−µ_{s}).

By minimizing the approximated generalization error the optimum signal space is found.

Some of the presented approaches are illustrated with an example. For this purpose a Gaussian cluster is generated, where the effective dimension is equal to 3, but the signal, due to the added noise, existed in 10-dimensional space. In order to determine the effective dimension of the data, the approximated gen-eralization error is calculated according to equation 2.12. Results are presented on figure 2.10 (lower right plot). The optimum dimension that gives a minimum generalization error is equal to 3. Additionally, the Prediction Sum of Squares (equation 2.5) is computed. Here, the curve flattens at the third component suggesting that dimension for the signal space. For comparison the eigenvalue curves are also shown. Both the eigenvalues of the data and the scaled version, proposed in [32] indicate 3 components.