Data flow - Zeeker: A topic-based search engine

13.3 Summary

13.3.1 Data flow

The many applications created are shown in figure 13.1. The figure shows the overall flow of data from downloading from the Internet usingZeeker.Spider to the point where the data is available in the index and can be retrieved using Zeeker.Website.

Two primary data sources are used: the Internet andhttp://en.wikipedia.org.

The data from the Internet is automatically downloaded by theZeeker.Spider and stored in compressed form in Zeeker.Base with the help of a webservice called Zeeker.DataGateway. Wikipedia data dumps are downloaded manually and imported into theZeeker.Base usingDataManagementWizard application.

This concludes the data collecting phase of theZeeker Search Engine.

A MS SQL 2005 database is used as it has the possibility of extracting exactly the data needed when indexing and testing. Zeeker.Baseand DataManagemen-tWizard can calculate Wikipedia clusters (Wikipedia categories) and dump data in different formats.

All the different data dumps make it easier to build the parsers used in the BuildIndex application. No data is removed when it gets dumped from the database, instead, several thousand Html resources are accumulated and stored in one or more (very big) file(s). In this new format, usually a variant of TREC⁵, meta data not available in the original downloaded data can also be added, such as an id pointing back to the database etc. DataManagementWizard supports over 30 different data dumps in various formats.

BuildIndex is the application responsible for indexing the data dumped by DataManagementWizard. BuildIndex is by far the most complex of all the ap-plications making up theZeeker Search Engine. The application creates parsers, POS tagger, stopper, stemmer and indexer. The BuildIndex data flow can be seen in figure 3.2. When all exported files have been processed by BuildIndex, a complete searchable index (not clustered) is available.

Html files are added to clusters using the application Clustering. The ap-plication calculates similarity, using cosine similarity measure, for each Html document and measures against the clusters. Html documents are added to the appropriate cluster if the similarity score is above a preset threshold. The application stores the clustering of the Wikipedia articles and the Html files in

5Text Retrieval Conference (TREC) http://trec.nist.gov/

13.3 Summary 111

a binary file which is loaded by theZeeker.Website. When the index has been clustered, a searchable index with category filtering options is ready.

Zeeker.Website is a basic web interface created solely for the purpose of making the index public. The website searches the file based index and presents results as any other search engine on the web. Few but important search oper-ators are implemented as described in chapter 10. The website is designed in a simple manner and should be intuitively easy to use.

The implemented applications and features have extended the Lemur frame-work. Features such as parsers, POS tagger, POS filter, the ability to store more data and meta data, a new clustering method, cluster quality calculation and much more have been added to the framework. There are not many objects or classes that have not been modified or even been completely rewritten.

Zeeker Search Engine is a very complex array of applications which together consist of more than 2600 files taking up more than 900 MB on disk. This is of course a very relative measure as some files are bigger than others, yet it gives a feeling of the complexity ofZeeker Search Engine.

112 Implementation

Part V

Conclusion

Chapter 14

Future work

During the analysis and implementation ofZeeker Search Engine, many choices have been made as to how data is represented, indexed, filtered and retrieved.

Building a search engine is more about making choices rather than an exact science. Even though the many choices have at times been a dilemma, they can also be seen asparameters, which later can be adjusted in order to improve the results. Some of the more dominant parameters available are: adjusting how vocabulary is reduced, changing the way documents are clustered and improv-ing the retrieval method.

The choices made have revealed some minor issues and improvements that can be worked on to improve the search engine. This chapter summarizes these issues and improvements and proposed solutions are presented.

14.1 Known issues

First of all, the fairly conservative way of pruning the vocabulary has seemingly been too strict. Even though the idea was not to remove too many terms from the index, many vital terms seem to be missing regardless. Reading the litera-ture on the subject (see introduction), researchers have removed large amounts of terms in indexes without reducing the quality of retrieval. We believe this is due to the indexes’ vocabulary. For example, a vocabulary based on email correspondence will probably not be very diverse as people tend to use the same terms over and over again in emails and daily conversations. Therefore, most of the terms can be removed and only a few (distinct and important) terms left for indexing. TheZeeker Search Engine index is based on resources regarding musical groups, song titles, band names etc. Removing stop words from such resources could easily remove an entire title from a song or an entire band name.

Pruning was found to be too aggressive as a search on The Who returned no results since both terms are stop words. Localizing the stop-word problem

re-116 Future work

vealed another minor, yet serious issue. Terms were found to be automatically down-cased before they were POS-tagged. This resulted in wrong POS cate-gories for some terms, e.g. a band name like The Who became the who thus failing to identify the terms as nouns.

Cluster precision and cluster quality is another known issue. As described earlier, documents are added to the most similar clusters by measuring simi-larity between the document and cluster centroids using the Cosine Simisimi-larity Method. This method works to some extent, yet some web pages seem to be placed in too many clusters. The solution to this problem is not entirely clear.

One solution could be to change the similarity threshold which would result in a more selective adding of documents to clusters although some documents might not be added to any cluster. Another solution is re-evaluate the entire clustering. Using cosine similarity works well with flat cluster structures but does not seem to work properly when the cluster structure is hierarchical like the Wikipedia clusters. The cosine similarity measure is more effective with smaller, more strict vocabularies whereas the larger and more general clusters in the Wikipedia hierarchy have many documents and thus a very general and rich vocabulary. Measuring document similarity against the centroids for these general vocabularies will never be very specific. The best solution for this prob-lem is in no way clear and needs further analysis and testing in order to improve the cluster precision and quality.

In document Zeeker: A topic-based search engine (Sider 130-136)