• Ingen resultater fundet

resulting in poor classification performance.

(a) full space (b) zoom

Figure 4.6: Zooming in on the LSI space of the document substrings. Some of the spaces are very dense on data, making the areas very attractive for the gaussian mixtures. When using high dimensional representations of the LSI substring space, the outer data points therefore tend to be badly modeled. Since much variation exists for the data in non dense areas, lack of modeling in these areas are likely to result in loss of information.

The HMM model with LSI estimated emission probabilities was much better at classifying documents correctly than the HLG model. The state transition probabilities did however not seem to capture any valuable information about the differences in word sequences for the three classes. It is likely that a true EM estimate of emission probabilities would have resulted in different transition probabilities that would capture more of the word order information, leading to better classification. A true EM estimate of shared latent topic emissions will however require that the a new HMM must be redefined and update rules determined.

4.8 Summary

We have used two state space model approaches to capture the information, that is contained in the order in which words appear in documents. The first approach involved a transformation of the document vocabulary into a smaller LSI vocabulary, whereon a HMM could be trained. This approach lacked in

classification ability but was conceptual successful at capturing word order in-formation.

The second approach involved making an estimate of latent topic emission prob-abilities for at HMM using LSI. This approach had less success at capturing word order information, but was better at the classification task. We have hope that the HMM approach will have greater success by the development of a HMM with shared latent topic emission probabilities.

Chapter 5

Dirichlet

5.1 Introduction

In the previous chapters we have used discriminative classifiers to categorize text documents. In this chapter and the following chapters the focus is set on categorization based on generative probabilistic models. Generative approaches to classification are popular since they are relatively easy to interpret and can be trained quickly. With these generative approaches, the key problem is to develop a probabilistic model that represents the data well. Unfortunately, for text classification too little attention has been devoted to this task. Instead, the generic multinomial model is typically used. Recent work (Rennie et al., 2003) has pointed out a number of deficiencies of the commonly used multinomial model, and suggested heuristics to improve its performance.

The central problem with the multinomial model is that it assumes that doc-uments are created from a static process, i.e. the multinomial model does not change it’s emission probabilities during the process of generating a document.

This assumption is opposed to the true nature of text documents, where a word at a given place in a document is dependent on the previous words in the document and especially dependent on whether the word itself has appeared previously. If a given word has appeared previously in a document, it is much more likely to appear again later in that document, i.e. the word appears in

bursts. The fact that words tend to appear in bursts, as opposed to being emit-ted independently has been investigaemit-ted previously in (Church & Gale, 1995;

Katz, 1996).

The burstiness issue has previously been addressed a couple of times. Rennie et al. (2003) address this issue by log-normalizing counts, hereby reducing the impact of burstiness on the likelihood of a document. Teevan and Karger (2003) empirically search for a model that fits documents well within an exponential family of models, while Jansche (2003) proposes a zero-inflated mixture model.

In this chapter we go further. We show that the multinomial model is appropri-ate for common words but not for other words. The distributions of counts pro-duced by multinomial distributions are fundamentally different from the count distributions of natural text. Zipf’s law (Zipf, 1949) states that the probability pi of the occurrence of an event follows a power law pi ≈ i−a, where i is the rank of the event and a is a parameter. The most famous example of Zipf’s law is that the frequency of an English word, as a function of the word’s rank, follows a power law with exponent close to minus one.

We subsequently present a different probabilistic model that, without any heuris-tic changes, is far better suited for modeling the burstiness phenomenon. We propose to model a collection of documents using the Dirichlet distribution (Minka, 2003) as an alternative to the multinomial. The dirichlet distribution has one additional degree of freedom, which we shall see allows it to capture burstiness. The dirichlet distribution can be thought of as a bag-of-scaled-documents where the multinomial distribution is a bag-of-words model.

In the following we continue to represent an individual document as a histogram of word counts (Salton et al., 1975). We further assume that word emissions are independent given the document category, i.e. the naive Bayes property holds. This property is not valid (Lewis, 1998), but naive Bayes models remain popular (McCallum & Nigam, 1998; Sebastiani, 2002) because they are fast and easy to implement, they can be fit even with limited training data, and they do yield accurate classification when heuristics are applied (Jones, 1972; Rennie et al., 2003).

Dirichlet distributions have been used previously to model text, but our ap-proach is fundamentally different. In the LDA apap-proach (Blei et al., 2003) the Dirichlet is a distribution over topics, while each topic is modeled in the usual way as a multinomial distribution over words. In our approach, each topic, i.e.

each class of documents, is modeled in a novel way by a Dirichlet distribution instead of by a multinomial. Our approach is therefore complementary to the LDA and related approaches. This will be explained more deeply in one of the following sections.