3.5 Unsupervised/supervised Generalizable Gaussian Mixture model 34
3.6.3 Evaluation of outlier detection methods
A 2-dimensional toy data set was generated for illustration and comparison of the outlier detection techniques presented in sections 3.6.1 and 3.6.2. Results are shown on figure 3.8.
−1 −0.5 0 0.5 1
Figure 3.8 Two Gaussian clusters with the outliers are used as the data set (left panel). The outliers are marked in red. Method 1 (middle plot) is corre-sponding to the outcome of the cumulative distribution. The samples classi-fied as outliers are marked with the green triangle. Here, more samples than expected is detected as outliers. Right panel is presenting the results of the clustering with the outlier cluster (method 2) . The examples that were classi-fied as outliers are marked with the cyan diamond. Precisely the same points that were originally outliers are predicted as outliers.
Two Gaussian distributed spherical clusters were generated. Then, additional 9 outlier samples formed in regular grid were included in the data set. Left plot of figure 3.8 shows the created data. Blue points are generated by Gaussians and red circles are the outliers. In the middle plot the result of the cumulative dis-tribution is presented and in right plot the outcome of outlier cluster technique is shown. The cumulative distribution was used with the threshold of0.5%but
4Value ofc= 10is used in the experiments
still many low probable samples from the Gaussian clusters were classified as outliers. In case of outlier cluster method the result is accurate.
3.7 Discussion
Summarizing, the mixture models are universal and very flexible approxima-tions for the data densities. The significant drawback of that model is the com-putational complexity. The number of parameters in the model grows quadrat-ically with the data feature dimension what next requires a large number of data points needed for good parameter estimation. For high dimensional data it is moreover a time consuming process, since the inversion of theD×D co-variance matrix is needed in each training iteration. Therefore, dimensionality reduction techniques are often used in the preprocessing step.
In many applications, finding outliers, i.e. rare events, is more interesting than finding the common cases. They indicate either some unusual behavior or con-ditions or give information about new introduced cases or simply about errors existing in the system. Therefore, ability of detecting outliers is an useful fea-ture for the model.
The outlier cluster was found superior over the cumulative distribution. It does not detect any outlier samples in the data where in fact there is no outliers. If the density is estimated accurately then the outcome of this method is also precise.
The outlier detection technique is however, based on the mixture model and as such inherit all its drawbacks.
CH A P T E R
4
Cluster visualization &
interpretation
Visualization and the result interpretation, the next step in the KDD process (see figure 1.1), follows the transformation part that is described in Chapter 2 and the modeling part presented in Chapter 3. The following chapter discovers the visualization and interpretation possibilities based on the outcome of the Generalizable Gaussian Mixture model. The modeling results of the realistic data sets together with theirs interpretation are visualized in the Chapter 7.
4.1 Hierarchical clustering
Hierarchical methods for unsupervised and supervised data mining provide a multilevel description of data. It is relevant for many applications related to in-formation extraction, retrieval, navigation and organization, for further reading in this topic refer to, e.g. [13, 28]. Many different approaches have been sug-gested to hierarchical analysis from divisive to agglomerative clustering and recent developments include [11, 27, 58, 69, 87, 89]. In this section, the ag-glomerative probabilistic clustering method is investigated [52, 53, 80] as an
additional level based on the outcome of the Generalizable Gaussian Mixture model.
The agglomerative hierarchical clustering algorithms start with each object as a separate element. These elements are successively combined based ona simi-larity measureuntil only one group remains.
Let us start from the outcome of the unsupervised GGM model (described in section 3.3), the probability density componentsp(x|k)(equation 3.11), where k is an index of the cluster with parameters collected in θk, and treat it as first levelj = 1of the new developed hierarchy. On this level,K clusters are observed that are described by means µk, covariancesΣk and the proportion values P(k), k = 1,2, . . . , K. The goal is to group together the clusters that are similar to each other. This similarity can be understood, for example, as the distance between the clusters in the Euclidean space, or the similar character-istics in the probability space. Therefore, various similarity measures can be applied what often leads to different clustering results. A few of the possible choices for distance measures between the clusters are presented further in this chapter.
The simplest technique for combining the clusters is to merge only two clusters at each level of the hierarchy. Two clusters are merged, when the distance between them is the smallest. The procedure is repeated until one cluster at the top level is reached containing all the elements. That is, at levelj= 1there are K clusters and there is one cluster at the final level,j= 2K−1.
Letpj(x|k) be the density for thek’th cluster at levelj andPj(k)the corre-sponding mixing proportion. Then the density model at level j is defined by:
p(x) =
K−j+1X
k=1
Pj(k)pj(x|k). (4.1) If clusterrandmat leveljare merged into`at levelj+ 1then
pj+1(x|`) = pj(x|r)·Pj(r) +pj(x|m)·Pj(m)
Pj(r) +Pj(m) , (4.2) and
Pj+1(`) =Pj(r) +Pj(m). (4.3)
4.1 Hierarchical clustering 45 The class posterior
Pj(k|xn) = pj(x|k)Pj(k)
p(x) (4.4)
propagates also in the hierarchy so that
Pj+1(`|xn) =Pj(r|xn) +Pj(m|xn). (4.5) Once, the hierarchy is created it is possible to determine the class/level mem-bership for new samples. Thus,xnis assigned to clusterkat leveljif
Pj(k|xn) = p(xn|k)P(k)
p(xn) > ρ, (4.6) whereρis a threshold value, typically close to 1. That ensures high confidence in the assignment. The data sample “climbing up” the tree structure gains con-fidence with each additional level. That means, that if the sample at leveljcan not be assigned unambiguously to any of the clusters it will surely be possible at one of the higher levels of the hierarchy.