**2.1 Dimensionality reduction methods**

**2.1.1 Random Projection**

A lot of research studies have focused on an investigation of the Random Pro-jection method. Many of the details and experiments not included in this work, can be found in [4, 9, 19].

The method is simple and it does not require the presence of the data whilst
creating the projection matrix. Many find this a significant advantage. Random
projection is based on the set of vectors which are orthogonalized from the
nor-mal distributed random matrix. The orthogonal projection vectors^{1}(R·R^{T} =
I) preserves the distances among the data points. In this way it is possible to

1Two vectors are said to be orthogonal if the cosine of its angle is equal 0, i.e.u·v^{T} = 0,
wherev·v^{T}= 1andu·u^{T}= 1.

**2.1 Dimensionality reduction methods** **11**
project the data to the smaller dimensional space while preserving fair
sepa-ration of the clusters. Note however, thatR is orthogonal only column-wise,
i.e. the identity outcome of the dot product of the projection vectors holds only
for quadraticR(projecting onto the space of the same dimension as the
origi-nal). In the case of projecting onto smaller dimensions, the distance is no longer
preserved in an ideal way.

Let theD-dimensional data matrix be defined asX_{D×N}, whereN is the
num-ber of samples. Data is projected with the help of the projection matrix Rto
K dimensions (K ≤ D) and the projected data matrix is denoted as XbK×N.
Figure 2.1 presents the algorithm of the Random Projection method.

**The Random Projection Algorithm**

*1.*Generate random matrix R_{K×D} = {r_{kd}, k = 1. . . K, d = 1. . . D}

drawn form normal distribution of zero mean and spherical covariance structurerkd ∈ N(0,1). K≤D.

*2.*Orthogonalize each of the projection vectors (use for example
Gram-Schmidt algorithm (see [56] or most linear algebra books for reference)).

*3.*Project data on the new set of directionsXb_{K×N} =R_{K×D} ·X_{D×N}

**Figure 2.1** The Random Projection algorithm.

In order to check the success of the random projection method, the separability
is investigated with the following toy example. In the example, 200 data points
are generated from 2 Gaussian distributions with spherical covariance structure
Σ = *I*and different mode locations, where distance between the data means

||µ_{1} −µ_{2}||2 approximately equals 5. Only one dimension is enough to
sepa-rate clusters. Originally, data spans 500 dimensions. The clusters are almost
linearly separable, i.e. the cluster overlap is negligible. Also by using
Princi-pal Component Analysis, described later in section 2.1.2 and presented here as
reference, the distance between the clusters is fully preserved.

Figure 2.2 presents the results of the experiment. Left plot shows the distance between the cluster centers. As was suspected, the distance decays, and in the small dimensional space it can be expected that the data is poorly separated.

This artifact is more significant for large dimensional and more complex data sets, such as sparse vector space representation of textual information. The

0 100 200 300 400 500

**Dimension in the reduced space**
**distance ||**µ**1**** − **µ**2****||****2**

**Dimension in the reduced space**

**Success rate in [%]**

0 100 200 300 400 500

Dimension in the reduced space

Time [s]

**Figure 2.2** Illustration of the Random Projection method with example of
2 Gaussian distributed clusters. Originally the measured distance between
the clusters is close to 6. The experiments are repeated 20 times and the
average performance is shown. Left plot describes the distance||µ_{1}−µ_{2}||_{2}
between the cluster centers as a function of dimensionality. It decays with
the projected dimension. Error-bars show maximum and minimum measured
distance in the trials. They increase with the dimension. Middle plot shows
success rate of the projection (the trials for which the distance is larger than
3, for smaller numbers clusters are assumed not to be separable). Also the
success rate decays with the reduced dimension. That means that for the
large reductions there is a considerable chance that the projected data will
lose separability. Note, that one dimension is enough to linearly separate the
data. The time needed to produce the projection matrix grows exponentially
with the size of the matrixR(right figure).

smaller the original dimensionality, the longer the separation is kept, when the
projection matrix is increased. The success rate^{2}(middle plot) is decaying
sig-nificantly with the dimension. The projection is not in all cases satisfactory,
the separability in some *trials*^{3} is lower due to the random origin of the
pro-jection vectors. The number of “unlucky” propro-jections grows significantly with
reduced dimension. The time needed to produce the projection matrix grows
significantly with the size of the projection matrixR(right plot).

When the projected dimension is large enough, the distortion of the distance
is not significant but then the time spent for creating and diagonalizing
projec-tion matrix is considerable. To summarize, for large dimensional data sets it is
not necessarily efficient and satisfying enough^{4}to use random projection
with-out significant separability loss, especially when the dimensionality the data is

2Success rate is the percentage of the number of projection runs in which the distance is preserved (in this case the clusters are assumed to lose separability when the distance between them is smaller than 3 (originally the distance is equal to 5)). The cluster distance is calculated from projected known cluster centers.

3One particular run of the experiment is understood to be a trial.

4The benefits of the random projection may vary accordingly to the application.

**2.1 Dimensionality reduction methods** **13**

projected onto is high.

Random Projection is not applied in further experiments due to the fact that the dimensionality of the data used in the experiments allowed us to use Principal Component Analysis (section 2.1.2). However, when facing larger databases it could be advisable to use RP as the preprocessing step before applying PCA. In such a case, creating the random projection matrix is less expensive than PCA and with the projected space large enough, the distances in the data is not be distorted significantly.