**5.2 Gaussian model**

**6.1.2 Examples of kernels**

For continuousD-dimensional data a common choice of kernel, for both kernel PCA and density estimation, is the isotropic Gaussian kernel of the form

K_{h}(x,x_{n}) = (2π)^{−}^{D}^{2}h^{−D}exp
Of course many other forms of kernel can be employed, though they may not
themselves satisfy the requirements of being a density. If we consider for
ex-ample the case of the linear cosine inner-product in case of discrete data such

as vector space representations of text, the kernel is defined as
K(x,x_{n}) = x^{T}x_{n}

||x|| · ||xn||. (6.19) The decomposition of such cosine based matrix directly yields the required probabilities.

This interpretation provides a means of spectral clustering which, in case of
continuous data, is linked directly to non-parametric density estimation and
extends easily to discrete data such as for example text. We should also note that
the aggregate Markov perspective allows us to take the random walk viewpoint
as elaborated in [59] and so a K-connected graph may be employed in defining
the kernel similarityKK(x,x_{n}). Similarly to the smoothing parameter and the
number of clusters, the number of connected points in the graph can be also
estimated from the generalization error.

**6.2 Discussion**

For the case of a Gaussian kernel this interpretation of a kernel based
clus-tering enables estimating the kernel width parameter by means of test
predic-tive likelihood and as such cross-validation can be employed. In addition, the
problem of choosing the number of possible classes, a problem common to all
non-parametric clustering methods such as spectral clustering [59, 62], can now
be addressed. This overcomes the lack of an objective means of selecting the
smoothing parameter in most other forms of spectral clustering models. The
proposed method first defines a non-parametric density estimate, and then the
inherent class structure is identified by the basis decomposition of the
normal-ized kernel in the form of class conditional posterior probabilitiesP(x_{n}|c)and
P(c|xn). Since the projection coefficients are provided a new, previously
unob-served point can be allocated in the structure. Thus, projection of the
normal-ized kernel function of a new point onto the class-conditional basis functions
yields the posterior probability of class membership for the new point.
Some-thing which cannot be achieved by partitioning based methods such as those
found in [59] and [62].

A number of points arise from the presented exposition. Firstly, in case of continuous data, it can be noted that the quality of the clustering is directly re-lated to the quality of the density estimate. Once a density has been estimated

**6.2 Discussion** **69**
the proposed clustering method attempts to find modes in the density. Also if
the density is poorly estimated perhaps due to a window smoothing parameter
which is too large then class structure may be over-smoothed and so modes
may be lost. In other words essential class structure may not be identified by
the clustering. The same argument applies to a smoothing parameter which is
too small thus causing non-existent structure to be discovered and to the
con-nectedness of the underlying graph connecting the points under consideration.

CH A P T E R

### 7

**Segmentation of textual and** **medical databases**

In the previous chapters some of the results were presented, and there the dis-cussed techniques were compared on the simple data sets. Carefully selected auxiliary data sets were used in order to obtain good illustration. It allowed in some of the cases to decide which technique to apply in further investiga-tions. This chapter focuses on implementation of the earlier described models on observational data.

**7.1 Data description**

In this section the applied data sets are presented together with short description of the accompanying them preprocessing procedure. The detailed report of the preprocessing steps is given in sections 7.2.1 and 7.3.1. Four data sets were used in the further investigations.

**Email data:** is a collection^{1} of private emails categorized into three groups
namely: *conference,* *job, and* *spam. The documents are hand-labeled.*

Since emails were collected by an university employee the categories are university related. The collection was preprocessed, details of the pro-cedure are presented in section 7.2.1, so the final data matrix consists of 1405 documents each described by 7798 terms. The clustering of the data was performed in latent space found in Latent Semantic Indexing framework [20]. This data set is used due to its simplicity and fairly good separation.

**Newsgroups:** is collection of 4 selected newsgroups^{2}. The data is used in many
publications, starting from the data collector Ken Lang published in [51]

and also for example in [6, 9, 68]. The original collection consists of 20 different newsgroup categories each containing around 1000 records.

In the experiments only four newsgroups were selected, namely
*com-puter graphics, motorcycles, baseball*and*Christian religion*each of 200
records. The preprocessing steps are identical to those performed on the
Email collection which are described in detail in section 7.2.1. In
pre-processing 2 documents were removed^{3} resulting with the final number
of 798 records and 1367 terms. Also in case of this data set labels are
available.

**Sun-exposure study:** The data was collected by Department of Dermatology,
Bispebjerg Hospital University of Copenhagen, Denmark. It concerns a
cancer risk study. The data set used in the experiments represents one
year study or in fact 138 days, collected during spring, summer, and
au-tumn period. The survey was performed on a group of 196 volunteers
resulting in a total number of 24212 collected records. The experiments
concern only the one year fraction of diary database. However, full study
was performed throughout 3 years survey. This extended data set consist
of diary records, detailed UV measurements, questionnaire of the past
sun habits and the measurements of the skin type. This data is available
for future study.

For the survey purpose, a special device was constructed for measuring
sun exposure^{4} of the subjects. The picture of the device is shown on
figure 7.1. Additionally, subjects were asked to fill out daily diary records

1The Email database can be obtained at following location: http://isp.imm.dtu.dk/staff/anna

2The full data collection consisting of 20 categories is available at e.g., http://kdd.ics.uci.edu

3See section 7.2.1 for details.

4UVA and UVB radiation dose was measured every 10 minutes. In the performed

**experi-7.1 Data description** **73**

**Figure 7.1** The devise measuring sun radiation the skin is exposed on.

about their sun behavior by answering 10 questions which are listed in table 7.1. Since it is a common knowledge that high sun exposure leads to the increase of cancer risk, it was expected that a link between these two events can be established.

**No. Question** **Answer**

1. Using measuring device yes/no

2. Holiday yes/no

3. Abroad yes/no

4. Sun Bathing yes/yes-solarium/no

5. Naked Shoulders yes/no

6. On the Beach/Water yes/no 7. Using Sun Screen yes/no

8. Sun Screen Factor Number yes-number (26 values)/no
9. Sunburned no/red/hurts/blisters
10. Size of Sunburn Area no/little/medium/large
**Table 7.1** Questions concerning the daily sun habits in the sun-exposure
study.

In the questionnaire, question number seven contains redundant informa-tion, for the investigainforma-tion, since it is already included in question eight, and therefore it was removed from the data set. Also question number one, which is in fact an indicator of missingness in the data, was not included in the cluster structure investigation. In result, the records are described by 8 questions – 8 dimensional, categorical vectors. As ex-pected, in the survey, a lot of data is missing. Therefore, the techniques are investigated to imputate missing values in these records. For other experiments, concerning data segmentation, the missing records were

re-ments only daily dose was available.

moved. In the diary collection there are 1073 incomplete records, 4171 are missing in UVA/UVB measurements and when combining these two data sets 5041 records have missing values. Therefore, only 19171 com-plete records from 187 subjects were used in the investigation.

**Dermatological collection:** is the collection^{5}of erythemato-squamous diseases.

Six classes (the diagnosis of erythemato-squamous diseases) are observed:

*psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic *
*der-matitis, pityriasis rubra pilaris. This database contains 358 instances *
de-scribed by 34 attributes, 33 of which are nominal (values 0, 1, 2, 3) and
one of them (age) is ordinal. Original data set contains few missing
val-ues that were removed for this investigation. The data set was previously
used in [35]. The set is segmented by the aggregated Markov model, that
is described in Chapter 6.