2.1 Dimensionality reduction methods
2.1.1 Random Projection
A lot of research studies have focused on an investigation of the Random Pro-jection method. Many of the details and experiments not included in this work, can be found in [4, 9, 19].
The method is simple and it does not require the presence of the data whilst creating the projection matrix. Many find this a significant advantage. Random projection is based on the set of vectors which are orthogonalized from the nor-mal distributed random matrix. The orthogonal projection vectors1(R·RT = I) preserves the distances among the data points. In this way it is possible to
1Two vectors are said to be orthogonal if the cosine of its angle is equal 0, i.e.u·vT = 0, wherev·vT= 1andu·uT= 1.
2.1 Dimensionality reduction methods 11 project the data to the smaller dimensional space while preserving fair sepa-ration of the clusters. Note however, thatR is orthogonal only column-wise, i.e. the identity outcome of the dot product of the projection vectors holds only for quadraticR(projecting onto the space of the same dimension as the origi-nal). In the case of projecting onto smaller dimensions, the distance is no longer preserved in an ideal way.
Let theD-dimensional data matrix be defined asXD×N, whereN is the num-ber of samples. Data is projected with the help of the projection matrix Rto K dimensions (K ≤ D) and the projected data matrix is denoted as XbK×N. Figure 2.1 presents the algorithm of the Random Projection method.
The Random Projection Algorithm
1.Generate random matrix RK×D = {rkd, k = 1. . . K, d = 1. . . D}
drawn form normal distribution of zero mean and spherical covariance structurerkd ∈ N(0,1). K≤D.
2.Orthogonalize each of the projection vectors (use for example Gram-Schmidt algorithm (see [56] or most linear algebra books for reference)).
3.Project data on the new set of directionsXbK×N =RK×D ·XD×N
Figure 2.1 The Random Projection algorithm.
In order to check the success of the random projection method, the separability is investigated with the following toy example. In the example, 200 data points are generated from 2 Gaussian distributions with spherical covariance structure Σ = Iand different mode locations, where distance between the data means
||µ1 −µ2||2 approximately equals 5. Only one dimension is enough to sepa-rate clusters. Originally, data spans 500 dimensions. The clusters are almost linearly separable, i.e. the cluster overlap is negligible. Also by using Princi-pal Component Analysis, described later in section 2.1.2 and presented here as reference, the distance between the clusters is fully preserved.
Figure 2.2 presents the results of the experiment. Left plot shows the distance between the cluster centers. As was suspected, the distance decays, and in the small dimensional space it can be expected that the data is poorly separated.
This artifact is more significant for large dimensional and more complex data sets, such as sparse vector space representation of textual information. The
0 100 200 300 400 500
Dimension in the reduced space distance ||µ1 − µ2||2
Dimension in the reduced space
Success rate in [%]
0 100 200 300 400 500
Dimension in the reduced space
Time [s]
Figure 2.2 Illustration of the Random Projection method with example of 2 Gaussian distributed clusters. Originally the measured distance between the clusters is close to 6. The experiments are repeated 20 times and the average performance is shown. Left plot describes the distance||µ1−µ2||2 between the cluster centers as a function of dimensionality. It decays with the projected dimension. Error-bars show maximum and minimum measured distance in the trials. They increase with the dimension. Middle plot shows success rate of the projection (the trials for which the distance is larger than 3, for smaller numbers clusters are assumed not to be separable). Also the success rate decays with the reduced dimension. That means that for the large reductions there is a considerable chance that the projected data will lose separability. Note, that one dimension is enough to linearly separate the data. The time needed to produce the projection matrix grows exponentially with the size of the matrixR(right figure).
smaller the original dimensionality, the longer the separation is kept, when the projection matrix is increased. The success rate2(middle plot) is decaying sig-nificantly with the dimension. The projection is not in all cases satisfactory, the separability in some trials3 is lower due to the random origin of the pro-jection vectors. The number of “unlucky” propro-jections grows significantly with reduced dimension. The time needed to produce the projection matrix grows significantly with the size of the projection matrixR(right plot).
When the projected dimension is large enough, the distortion of the distance is not significant but then the time spent for creating and diagonalizing projec-tion matrix is considerable. To summarize, for large dimensional data sets it is not necessarily efficient and satisfying enough4to use random projection with-out significant separability loss, especially when the dimensionality the data is
2Success rate is the percentage of the number of projection runs in which the distance is preserved (in this case the clusters are assumed to lose separability when the distance between them is smaller than 3 (originally the distance is equal to 5)). The cluster distance is calculated from projected known cluster centers.
3One particular run of the experiment is understood to be a trial.
4The benefits of the random projection may vary accordingly to the application.
2.1 Dimensionality reduction methods 13
projected onto is high.
Random Projection is not applied in further experiments due to the fact that the dimensionality of the data used in the experiments allowed us to use Principal Component Analysis (section 2.1.2). However, when facing larger databases it could be advisable to use RP as the preprocessing step before applying PCA. In such a case, creating the random projection matrix is less expensive than PCA and with the projected space large enough, the distances in the data is not be distorted significantly.