• Ingen resultater fundet

Examples of kernels

In document IMM Datamininginmedicaldatabases (Sider 83-90)

5.2 Gaussian model

6.1.2 Examples of kernels

For continuousD-dimensional data a common choice of kernel, for both kernel PCA and density estimation, is the isotropic Gaussian kernel of the form

Kh(x,xn) = (2π)D2h−Dexp Of course many other forms of kernel can be employed, though they may not themselves satisfy the requirements of being a density. If we consider for ex-ample the case of the linear cosine inner-product in case of discrete data such

as vector space representations of text, the kernel is defined as K(x,xn) = xTxn

||x|| · ||xn||. (6.19) The decomposition of such cosine based matrix directly yields the required probabilities.

This interpretation provides a means of spectral clustering which, in case of continuous data, is linked directly to non-parametric density estimation and extends easily to discrete data such as for example text. We should also note that the aggregate Markov perspective allows us to take the random walk viewpoint as elaborated in [59] and so a K-connected graph may be employed in defining the kernel similarityKK(x,xn). Similarly to the smoothing parameter and the number of clusters, the number of connected points in the graph can be also estimated from the generalization error.

6.2 Discussion

For the case of a Gaussian kernel this interpretation of a kernel based clus-tering enables estimating the kernel width parameter by means of test predic-tive likelihood and as such cross-validation can be employed. In addition, the problem of choosing the number of possible classes, a problem common to all non-parametric clustering methods such as spectral clustering [59, 62], can now be addressed. This overcomes the lack of an objective means of selecting the smoothing parameter in most other forms of spectral clustering models. The proposed method first defines a non-parametric density estimate, and then the inherent class structure is identified by the basis decomposition of the normal-ized kernel in the form of class conditional posterior probabilitiesP(xn|c)and P(c|xn). Since the projection coefficients are provided a new, previously unob-served point can be allocated in the structure. Thus, projection of the normal-ized kernel function of a new point onto the class-conditional basis functions yields the posterior probability of class membership for the new point. Some-thing which cannot be achieved by partitioning based methods such as those found in [59] and [62].

A number of points arise from the presented exposition. Firstly, in case of continuous data, it can be noted that the quality of the clustering is directly re-lated to the quality of the density estimate. Once a density has been estimated

6.2 Discussion 69 the proposed clustering method attempts to find modes in the density. Also if the density is poorly estimated perhaps due to a window smoothing parameter which is too large then class structure may be over-smoothed and so modes may be lost. In other words essential class structure may not be identified by the clustering. The same argument applies to a smoothing parameter which is too small thus causing non-existent structure to be discovered and to the con-nectedness of the underlying graph connecting the points under consideration.

CH A P T E R

7

Segmentation of textual and medical databases

In the previous chapters some of the results were presented, and there the dis-cussed techniques were compared on the simple data sets. Carefully selected auxiliary data sets were used in order to obtain good illustration. It allowed in some of the cases to decide which technique to apply in further investiga-tions. This chapter focuses on implementation of the earlier described models on observational data.

7.1 Data description

In this section the applied data sets are presented together with short description of the accompanying them preprocessing procedure. The detailed report of the preprocessing steps is given in sections 7.2.1 and 7.3.1. Four data sets were used in the further investigations.

Email data: is a collection1 of private emails categorized into three groups namely: conference, job, and spam. The documents are hand-labeled.

Since emails were collected by an university employee the categories are university related. The collection was preprocessed, details of the pro-cedure are presented in section 7.2.1, so the final data matrix consists of 1405 documents each described by 7798 terms. The clustering of the data was performed in latent space found in Latent Semantic Indexing framework [20]. This data set is used due to its simplicity and fairly good separation.

Newsgroups: is collection of 4 selected newsgroups2. The data is used in many publications, starting from the data collector Ken Lang published in [51]

and also for example in [6, 9, 68]. The original collection consists of 20 different newsgroup categories each containing around 1000 records.

In the experiments only four newsgroups were selected, namely com-puter graphics, motorcycles, baseballandChristian religioneach of 200 records. The preprocessing steps are identical to those performed on the Email collection which are described in detail in section 7.2.1. In pre-processing 2 documents were removed3 resulting with the final number of 798 records and 1367 terms. Also in case of this data set labels are available.

Sun-exposure study: The data was collected by Department of Dermatology, Bispebjerg Hospital University of Copenhagen, Denmark. It concerns a cancer risk study. The data set used in the experiments represents one year study or in fact 138 days, collected during spring, summer, and au-tumn period. The survey was performed on a group of 196 volunteers resulting in a total number of 24212 collected records. The experiments concern only the one year fraction of diary database. However, full study was performed throughout 3 years survey. This extended data set consist of diary records, detailed UV measurements, questionnaire of the past sun habits and the measurements of the skin type. This data is available for future study.

For the survey purpose, a special device was constructed for measuring sun exposure4 of the subjects. The picture of the device is shown on figure 7.1. Additionally, subjects were asked to fill out daily diary records

1The Email database can be obtained at following location: http://isp.imm.dtu.dk/staff/anna

2The full data collection consisting of 20 categories is available at e.g., http://kdd.ics.uci.edu

3See section 7.2.1 for details.

4UVA and UVB radiation dose was measured every 10 minutes. In the performed

experi-7.1 Data description 73

Figure 7.1 The devise measuring sun radiation the skin is exposed on.

about their sun behavior by answering 10 questions which are listed in table 7.1. Since it is a common knowledge that high sun exposure leads to the increase of cancer risk, it was expected that a link between these two events can be established.

No. Question Answer

1. Using measuring device yes/no

2. Holiday yes/no

3. Abroad yes/no

4. Sun Bathing yes/yes-solarium/no

5. Naked Shoulders yes/no

6. On the Beach/Water yes/no 7. Using Sun Screen yes/no

8. Sun Screen Factor Number yes-number (26 values)/no 9. Sunburned no/red/hurts/blisters 10. Size of Sunburn Area no/little/medium/large Table 7.1 Questions concerning the daily sun habits in the sun-exposure study.

In the questionnaire, question number seven contains redundant informa-tion, for the investigainforma-tion, since it is already included in question eight, and therefore it was removed from the data set. Also question number one, which is in fact an indicator of missingness in the data, was not included in the cluster structure investigation. In result, the records are described by 8 questions – 8 dimensional, categorical vectors. As ex-pected, in the survey, a lot of data is missing. Therefore, the techniques are investigated to imputate missing values in these records. For other experiments, concerning data segmentation, the missing records were

re-ments only daily dose was available.

moved. In the diary collection there are 1073 incomplete records, 4171 are missing in UVA/UVB measurements and when combining these two data sets 5041 records have missing values. Therefore, only 19171 com-plete records from 187 subjects were used in the investigation.

Dermatological collection: is the collection5of erythemato-squamous diseases.

Six classes (the diagnosis of erythemato-squamous diseases) are observed:

psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic der-matitis, pityriasis rubra pilaris. This database contains 358 instances de-scribed by 34 attributes, 33 of which are nominal (values 0, 1, 2, 3) and one of them (age) is ordinal. Original data set contains few missing val-ues that were removed for this investigation. The data set was previously used in [35]. The set is segmented by the aggregated Markov model, that is described in Chapter 6.

In document IMM Datamininginmedicaldatabases (Sider 83-90)