Summary - Zeeker: A topic-based search engine

to measure the search engines’ performance. These new measurements are de-scribed as follows:

• [Quality of result ranking]Described by the correlation between human and search engine ranking

• [Ability to retrieve top ranked pages] Top results for a query are taken from the search engines and merged into a single set. Human users then rank these results after relevance. The ability of a search engine to retrieve a relevant page is then measured as a percentage of how many pages were relevant in its top results.

• [Stability over time (10-weeks)] Stability of the number of pages re-trieved and how many pages remain in the top results over a short period.

Liwen’s research showed that these new measurements can distinguish search engine performance very well.

In a new study [2], Ali et. al. present an automated framework to mea-sure engine performance. This framework takes advantage of user feedback, Cosine Similarity Measure, PageRank (as used by Google), Boolean Similar-ity Measure¹as well as Rank Aggregation techniques to evaluate search engine performance. The framework presented submits a query to a search engine and presents it to a user but also stores the ranking. User feedback is then achieved through the user’s click data, i.e. which links are followed, printed, bookmarked, saved etc. This click data is then analyzed and four new rankings are calculated based on the user’s actions. These new rankings are calculated using Cosine Similarity Measure, Boolean Similarity Measure, PageRank and user feedback.

The rankings are then aggregated into a new ranked list of results. Correlation is then calculated between the original list returned by the search engine and the new ranked list. The higher the correlation coefficient, the more effective the engine is. The four different measurements are used in order to avoid bias be-tween the different structures of the search engines tested. Seven search engines were tested in the study with good results.

11.3 Summary

Since searching is without a doubt one of the most popular activities on the Internet and as the search engines get more complex, the measurements for how effective they are also have to keep up. Using the simpleF-measure can be useful, but is very cost-inefficient if standard datasets are not used, since humans have to organize datasets and evaluate the relevance of each document with regards to the test queries. User feedback has been shown to give good results, but is also cost inefficient. A more automated approach is therefore desirable and the research within the field of search engine evaluation seems to be heading toward more automated evaluation methods.

1A simplified version of Li Danzig’s [26]S^⊗measure is used to reduce computational effort.

92 Evaluating retrieval

Chapter 12

Testing retrieval

In this chapter, the testing of the implemented search engine (using the re-trieval discussed in chapter 10) is described. The considered test strategies are discussed in the following section. Test scenarios and results will also be dis-cussed. Finally, the testing is summed up and conclusions are drawn regarding the overall performance of the search engine.

12.1 Test strategies

In the previous chapter, two approaches commonly used to evaluate a search en-gine’s performance and retrieval precision were discussed. Indeed theF-measure is simple and computationally easy, given datasets and queries that have been analyzed and where the desired outcome of each query is known. However, this is not the case with the datasets used in this work. The approximately 200.000 web documents are not labeled in any way, except for the clusters they are assigned to. TheZeeker Search Engine uses clusters to provide additional filtering of the result set and therefore measuring recall and precision on the original result set does not make any sense as the filtering is what makesZeeker Search Engine different from other search engines. Therefore, the use of recall, precision and F-measure on known datasets was quickly dismissed.

Since the F-measure was of no use in this case, the only measure left was User Feedback. In the previous chapter, manual and automatic user feedback scenarios were described. The automatic scenario described by Ali et. al. in [2] seemed a bit excessive, and as the other strategies did not seem to fit the purpose either, it was decided to rely entirely on manual user feedback, i.e. have users test Zeeker Search Engine and give feedback about its performance and retrieval precision.

94 Testing retrieval

12.1.1 Selected test methods

Two methods were mainly used to test the engine. First of all, the retrieval part was tested using a trial-and-error approach, where the primary goal was to find errors in the retrieval logic and programming code. Trial-and-error was also used to see how the engine handled various potentially problematic queries¹. These tests revealed some errors which were fixed before the search engine was put on-line for others to try out.

The second method used was manual user feedback. A questionnaire was constructed which was sent out to numerous people asking them to participate.

Before creating the questionnaire, the information and answers valuable to the search engine’s performance had to be defined. Based on general search behav-ior using search engines, e.g. Google, it was concluded that there were mainly two ways users use search engines, either forquestion answering (who is, what is etc.) or forresearch (what has been written about some topic). Therefore, the questionnaire should include questions that would give indications as to how well the search engine can be used for question answering and research respec-tively. To test the question answering part, users were asked to find answers to questions known to exist in the index, given minimal clues to go on. Researching was tested by asking the users to submit queries on their own and evaluate the relevance of the results returned by the search engine.

It was also considered asking users to evaluate results from predefined queries.

This idea presented a couple of problems. First of all, users might not know anything about the chosen topic of the queries and would therefore be in no position to evaluate the relevance of the retrieved information. Furthermore, predefined queries known to give good results could also be selected, thus giving biased results making the questionnaire unreliable. Finally, this approach does not model the general search behavior mentioned above and therefore the use of predefined queries was entirely dismissed.

The devised questionnaire can be found in chapter B.2 in the appendix. The test results are presented and discussed in the next section.

In document Zeeker: A topic-based search engine (Sider 111-114)