Reading guide/Overview - Problem description

1.3 Problem description

1.3.1 Reading guide/Overview

This section gives a brief overview of the structure of the thesis part by part.

Part I - Data Store and Preprocessing

Data store and preprocessing describes the design of the data store used in Zeeker Search Engine, i.e. what data is available and how have we chosen to limit the data to a manageable size while still getting good results. Furthermore, data processing and the various data processing tests are also discussed.

Part II - Clustering

Clustering introduces the basics behind clustering and the general principles.

This part also introduces a few clustering algorithms such as Sphericalk-means, Non-Negative Matrix Factorization (NMF) and Frequent Term-based Clustering (FTC). The algorithms differences are discussed as well as the tests on the implemented clustering as well.

Part III - Retrieval

The Retrieval part explains how theZeeker Search Engineprocesses queries and how documents are retrieved, ranked and presented to the user. Various test scenarios along with the tests performed on theZeeker Search Engine retrieval and the results of these are also discussed.

Part IV - Implementation

Implementation of the search engine is described along with the data flow i.e.

from the time data is downloaded from the Internet to how it is presented on the web page.

Part V - Conclusion

The Conclusion part contains a discussion of the future of the Zeeker Search Engine, i.e. what Zeeker Search Engine is capable of and how we want to improve it. Furthermore, all the efforts put into this thesis are discussed and summed up in the conclusion.

Appendix

The Appendix includes a list of used stop words, a list of POS tags, description of test sets and a User Guide.

22 Introduction

Part I

Data Store and

Preprocessing

Chapter 2

Data store

Data store refers to what data is available for preprocessing, indexing, testing etc. In this chapter it will be discussed and explained what data from the internet has been made available in the data store. Data is downloaded and only stored if it belongs to the chosen data intersection, which will be described later in this chapter. Wikipedia has been mentioned before and since data from Wikipedia is used as the core data, it is necessary to take a closer look at how it is structured and what data is publicly available.

The downloading of data and web resources is handled by Zeeker.Spider.

The webspider was created as a pre-project and is therefore not discussed in this thesis. Zeeker.Spider can best be described as a standard webspider, distributed and highly configurable, with an underlying database holding all downloaded data. Suffice to say that theZeeker.Spider meets all the web-crawling needs of this thesis.

2.1 Wikipedia

Wikipedia is a well known on-line free encyclopedia that anyone can edit. It has had tremendous support from day one and is still growing. People all over the world co-author articles on whatever topic they find interesting and Wikipedia categorizes these articles and makes them publicly available on the Internet.

Wikipedia (wiki) has been chosen as a source ofexpert pages in the search engine because of it’s enormous amount of information. Wikipedia includes both explicit information, such as articles, and implicit information, such as edit history, discussion pages etc. In fact, Wikipedia contains so much information that there is no way of utilizing it all¹. Therefore a choice has to be made on which information is most suitable for the purpose of this thesis and scale it to a size that is possible to utilize within the given time frame.

1using the current hardware setup

26 Data store

Wikipedia structure

One of the powerful features of Wikipedia is the strong contexts of its articles (one article - one topic). All articles are labeled and categorized by many edi-tors giving a very good categorization - which even improves over time as more people contribute.

Figure 2.1, page 27 illustrates a part of the Wikipedia structure. The cate-goryMusical groups is chosen and the figure shows the outline of the category tree as well as the relationship between categories and articles. Categories can have sub categories and articles associated with them. Articles belong to one or more categories and have several interesting article attributes as shown in the figure. Wikipedia does not enforce a strict parent-child relationship in its category structure and cycles and disconnected categories are therefore possible (yet rare)[46]. The categorization of Wikipedia is human made and as a result is very precise - or at least as precise as can be hoped for. The category precision is what can hopefully be exploited to get better results when the core is clustered.

Using Wikipedia as a training set to create clusters is calledsupervised machine learning - which is exactly the intention of this thesis.

A more detailed discussion of the conceptsclustering,supervised-and unsuper-vised machine learning andtraining sets is given in Part II. For now, suffice to say that the strong categorization of the Wikipedia articles is a unique oppor-tunity to use Wikipedia as a data source to generate good clusters.

Other features

Besides the great categorization and the strong article contexts, Wikipedia has a lot of other useful attributes as shown in figure 2.1. All articles have a quality measure, such as stub,normal andfeatured that indicate how Wikipedia rates the context of the article.

Every article also has associating discussion pages and logs of what has changed, who has changed it and when. These article attributes give a lot of in-formation about the article quality as well. Long discussion pages, many major and minor edits, the profile of the user doing the editing, the amount of links to other Wikipedia articles, the amount of external links, how many images are included and which images etc. can all be used as quality indicators for a specific article. These measures can even be used as a filter to remove the bad articles when trying to generate accurate clusters.

Yet another advantage of using the Wikipedia articles is that when cluster-ing the articles, the article information might be incorrect but the vocabulary will probably still be correct. Meaning the words used to describe the topic inaccurately are presumably the same as the words normally used to describe that topic in an accurate way. Hence, the clusters will still give a very good description of the topic since the statistical properties of the individual words are used and not their exact meaning.

Furthermore, Wikipedia includes a download service which provides data

2.1 Wikipedia 27

Music genres

- specific category

Rock

- specific category

articles

- a,b,c

articles

- b,d,e

articles

- d,e,f

Article attributes

external links

internal links

edit history

discussions others

featured

normal

stub

Figure 2.1: Wikipedia category and article hierarchy overview

28 Data store

dumps in text format. These data dumps include many important tables that the Wikipedia database consists of, such as page information, external links, image links, page to page links, page abstracts, articles, article titles, redirect-ing articles and much more.

Another interesting source for generating training sets and clusters was con-sidered, namely theOpen Directory Project (ODP). The problem with ODP is that one page does not necessarily mean one topic, and it is much easier to build a category from a page when it is known that it only has one topic. Therefore, given all its features and article attributes, Wikipedia seems like the best choice.

In document Zeeker: A topic-based search engine (Sider 41-48)