**3.5 Unsupervised/supervised Generalizable Gaussian Mixture model 34**

**3.6.3 Evaluation of outlier detection methods**

A 2-dimensional toy data set was generated for illustration and comparison of the outlier detection techniques presented in sections 3.6.1 and 3.6.2. Results are shown on figure 3.8.

−1 −0.5 0 0.5 1

**Figure 3.8** Two Gaussian clusters with the outliers are used as the data set
(left panel). The outliers are marked in red. Method 1 (middle plot) is
corre-sponding to the outcome of the cumulative distribution. The samples
classi-fied as outliers are marked with the green triangle. Here, more samples than
expected is detected as outliers. Right panel is presenting the results of the
clustering with the outlier cluster (method 2) . The examples that were
classi-fied as outliers are marked with the cyan diamond. Precisely the same points
that were originally outliers are predicted as outliers.

Two Gaussian distributed spherical clusters were generated. Then, additional 9 outlier samples formed in regular grid were included in the data set. Left plot of figure 3.8 shows the created data. Blue points are generated by Gaussians and red circles are the outliers. In the middle plot the result of the cumulative dis-tribution is presented and in right plot the outcome of outlier cluster technique is shown. The cumulative distribution was used with the threshold of0.5%but

4Value ofc= 10is used in the experiments

still many low probable samples from the Gaussian clusters were classified as outliers. In case of outlier cluster method the result is accurate.

**3.7 Discussion**

Summarizing, the mixture models are universal and very flexible approxima-tions for the data densities. The significant drawback of that model is the com-putational complexity. The number of parameters in the model grows quadrat-ically with the data feature dimension what next requires a large number of data points needed for good parameter estimation. For high dimensional data it is moreover a time consuming process, since the inversion of theD×D co-variance matrix is needed in each training iteration. Therefore, dimensionality reduction techniques are often used in the preprocessing step.

In many applications, finding outliers, i.e. rare events, is more interesting than finding the common cases. They indicate either some unusual behavior or con-ditions or give information about new introduced cases or simply about errors existing in the system. Therefore, ability of detecting outliers is an useful fea-ture for the model.

The outlier cluster was found superior over the cumulative distribution. It does not detect any outlier samples in the data where in fact there is no outliers. If the density is estimated accurately then the outcome of this method is also precise.

The outlier detection technique is however, based on the mixture model and as such inherit all its drawbacks.

CH A P T E R

### 4

**Cluster visualization &**

**interpretation**

Visualization and the result interpretation, the next step in the KDD process (see figure 1.1), follows the transformation part that is described in Chapter 2 and the modeling part presented in Chapter 3. The following chapter discovers the visualization and interpretation possibilities based on the outcome of the Generalizable Gaussian Mixture model. The modeling results of the realistic data sets together with theirs interpretation are visualized in the Chapter 7.

**4.1 Hierarchical clustering**

Hierarchical methods for unsupervised and supervised data mining provide a
multilevel description of data. It is relevant for many applications related to
in-formation extraction, retrieval, navigation and organization, for further reading
in this topic refer to, e.g. [13, 28]. Many different approaches have been
sug-gested to hierarchical analysis from divisive to agglomerative clustering and
recent developments include [11, 27, 58, 69, 87, 89]. In this section, *the *
*ag-glomerative probabilistic clustering* method is investigated [52, 53, 80] as an

additional level based on the outcome of the Generalizable Gaussian Mixture model.

The agglomerative hierarchical clustering algorithms start with each object as a
separate element. These elements are successively combined based on*a *
*simi-larity measure*until only one group remains.

Let us start from the outcome of the unsupervised GGM model (described in
section 3.3), the probability density componentsp(x|k)(equation 3.11), where
k is an index of the cluster with parameters collected in θ_{k}, and treat it as
first levelj = 1of the new developed hierarchy. On this level,K clusters are
observed that are described by means µ_{k}, covariancesΣ_{k} and the proportion
values P(k), k = 1,2, . . . , K. The goal is to group together the clusters that
are similar to each other. This similarity can be understood, for example, as the
distance between the clusters in the Euclidean space, or the similar
character-istics in the probability space. Therefore, various similarity measures can be
applied what often leads to different clustering results. A few of the possible
choices for distance measures between the clusters are presented further in this
chapter.

The simplest technique for combining the clusters is to merge only two clusters at each level of the hierarchy. Two clusters are merged, when the distance between them is the smallest. The procedure is repeated until one cluster at the top level is reached containing all the elements. That is, at levelj= 1there are K clusters and there is one cluster at the final level,j= 2K−1.

Letp_{j}(x|k) be the density for thek’th cluster at levelj andP_{j}(k)the
corre-sponding mixing proportion. Then the density model at level j is defined by:

p(x) =

K−j+1X

k=1

Pj(k)pj(x|k). (4.1) If clusterrandmat leveljare merged into`at levelj+ 1then

p_{j+1}(x|`) = p_{j}(x|r)·P_{j}(r) +p_{j}(x|m)·P_{j}(m)

Pj(r) +Pj(m) , (4.2) and

P_{j+1}(`) =P_{j}(r) +P_{j}(m). (4.3)

**4.1 Hierarchical clustering** **45**
The class posterior

P_{j}(k|xn) = p_{j}(x|k)Pj(k)

p(x) (4.4)

propagates also in the hierarchy so that

P_{j+1}(`|x_{n}) =P_{j}(r|x_{n}) +P_{j}(m|x_{n}). (4.5)
Once, the hierarchy is created it is possible to determine the class/level
mem-bership for new samples. Thus,xnis assigned to clusterkat leveljif

P_{j}(k|xn) = p(x_{n}|k)P(k)

p(x_{n}) > ρ, (4.6)
whereρis a threshold value, typically close to 1. That ensures high confidence
in the assignment. The data sample “climbing up” the tree structure gains
con-fidence with each additional level. That means, that if the sample at leveljcan
not be assigned unambiguously to any of the clusters it will surely be possible
at one of the higher levels of the hierarchy.