IR system - A (Media) Archaeology of Citation

Chapter 3: A (Media) Archaeology of Citation

5.1 IR system

As Rieder critically points out, the ‘Google search engine is of course both a link analysis and an IR System’ and I begin by addressing the latter (2012:8). The term Information Retrieval (IR) is attributed to Calvin Moers in 1950, where he defined IR as the ‘human cognitive processes’ in epistemological quests that drew on ‘theories of information categorization’, as well as creating automated catalogues and ‘retrieval in the field of computer science and artificial intelligence’

(Van Couvering 2009:95).IR spans different fields, however its combination with citation analysis developed for finding scientific literature began in the 1990s with the rise of search engines. This convergence between ‘Information Retrieval’ and ‘searching the web’ occurred at the moment when the web was expanding exponentially and creating millions of documents, which needed to retrieved and ordered. PageRank was designed to bring the searcher to another destination as a means to solve the ‘lost in hyperspace’ problem, by determining what is

‘relevant’ to the user.

Figure 26: Brin and Page’s Figure 4 (1998)

Brin and Page’s concept of ‘relevancy’ was based on the discovery that ‘[p]eople are still only willing to look at the first few tens of results’ (1998:108).⁶³ Brin and Page also point out that the biggest problem facing users is the quality of the returned search results and efficiency, with the solution being the Google query evaluation process (Figure 26). In order to obtain quality search results based on a query, they defined ‘relevancy’ through hit lists, keeping location information

63 This revelation is crucial because it takes the position of a user who has to be able to search through large indexes of information and also inspired me to look on the 2, 3^rd or even the 50^th page of Google Search results.

and the use of ‘proximity’ for search which helps increase relevance for multiple word queries.

PageRank works in conjunction with automated programs called spiders or crawlers and applies an ‘inverted index’ or ‘reverse index’ that improves search speed as it indexes words or terms into a database of text elements. The PageRank algorithm also considered the frequency and location of keywords within a web page and how long the web page has existed (ibid). Also

‘full, raw HTML of pages is available in a repository’, the font size and whether the words are in bold, or are weighted differently is important (ibid:110). Appendix A provides a more detailed explanation of ‘how keyword search works’.

The section of Brin and Page’s text titled Intuitive Justification offers insight into what is behind the workings of PageRank and why specific pages are given a certain ‘weight’ based on terms.

Karen Spärck Jones is accredited with the invention of ‘inverse document frequency’ in IR systems (1972), where the

importance of terms is weighted according to the proportion of documents in the corpus in which they occur; the intuition being that terms which occur in many documents are poor index terms. This is the partial basis of all weighting schemes adopted by widely used Internet search engines (Tait 2007).

In regard to IR systems, the information scientist Tefco Saracevic states that ‘there is no such things as relevance without a context. Relevance is always contextual’ (2013). Rieder concurs, stating that the ‘classic IR idea of relevance’ is always thought about ‘in relation to a specific

“informational need” ––complemented by the sociometric concepts of status and authority’

(2012:8). According to Anders Koen Madsen, PageRank creates a ‘market of relevance’ that depends on a calculative space similar to the one economic sociologists have detected in a

‘market of goods’ (2012:12). Just as markets assign prices to goods, the function of PageRank assigns visibility and relevance to information in response to specific queries.

We take the dot product of the vector of count-weights with the vector of type-weights to compute an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document’ (Brin and Page 1998:109).

5.2 Link analysis

To return to the former aspect of Rieder’s statement regarding link analysis, Brin and Page’s self-referential citation method constituted exploiting links and divulges part of PageRank’s secret recipe. ‘In particular, link structure [Page 98] and link text provide a lot of information for making relevance judgments and quality filtering’ (Brin and Page 1998:108).⁶⁴ PageRank

applies an ‘anchor text’––the human readable text in a URL, which is no longer than 60 characters and provides a clear and accurate description of the web pages themselves but also consumes a large amount of computing power for processing. Other text-based search engines cannot read documents such as databases, images, programmes, etc. that also have anchor texts but ‘Google can and does return pages that haven’t even been crawled or didn’t yet exist’ but had ‘hyperlinks pointing to it’ (ibid:109). The algorithm evaluates the number and quality of links to a page to get a rough estimate of the value of the page relative to all pages on the web.

64 Larry Page had already published on the idea of how to rank efficiently under his last name, Page.

The more links on a page the more valuable those pages are, value being measured by the amount of times a website is visited on the web.

When Google developed PageRank, factoring in incoming links to a page as evidence of its value, it built in a different logic: a page with many incoming links, from high-quality sites, is seen as ‘ratified’ by other users, and is more likely to be relevant to this user as well (Gillespie 2014:178).

The more important or worthwhile websites were likely to receive more links from other websites and this ‘linking’ is a direct reflection of society and a ‘rational attribution of importance’ that contributes to the ‘universal understanding of authority’ (Rieder 2012:8).

Every link between websites is a kind of vote about worth or “authority” (Kleinberg 1997). Google reads the infrastructure of the web as the interpreter of its content. The metadata used for indexing are not contained in a document but are inferred, kind of like Saussure’s network of language. Google’s crawlers are interested in key words on websites, but are even more interested in network strength and density, reading for inlinks and outlinks (PageRank rarely points to broken links) (Peters 2015:327).

Besides indexing, IR and linking determining relevance, how the ‘corpus was conceptualised’

becomes decisive, because PageRank is not a flat corpus (Rieder 2012:8). Instead of being a document repository, PageRank is a ‘social system’, reflected by documents that have been placed in a ‘stratified network/society before the searching even begins’ (ibid). It is therefore helpful to elucidate how PageRank could have been designed otherwise.

Rieder subsequently compares PageRank to Jon M. Kleinberg’s HITS (Hyperlink-Induced Topic Search) that appeared independently around the same time and proposed another

‘ranking’ model to determine the ‘relevance’ of results.

An advantage of HITS with respect to PageRank is that it provides two scores at the price of one. The user is hence provided with two rankings: the most authoritative pages about the research topic, which can be exploited to investigate in depth a research subject, and the most hubby pages, which correspond to portal pages linking to the research topic from which a broad search can be started (Franceschet 2010:5.1).

Furthermore, Hugill et al. explain the important distinction between the two search methods:

PageRank assigns a numerical weight to each document, where each link counts as a vote of support in a sense. It is executed at indexing time, so the ranks are stored with each page directly in the index. HITS basic features are the use of so-called hubs and authority pages. It is executed at query time. Pages that have many incoming links are called authorities and pages with many outgoing links are called hubs (2013:247).

Although both analysed the link structure of the Web, the difference was that HITS used two eigenvector metrics, one for authority and the other for ‘hubs’ and secondly reverses the temporality, the order of how the documents were retrieved or ranked:

Rather than calculating a universal or a priori landscape of authority, documents matching a query are retrieved first and authority is calculated second ––thus based on the link structure in the result set only and not the full corpus (Rieder 2012:8).

In this way one could question the implicit ‘bias’ in PageRank, having already sorted and compiled the corpus before ranking whereas with HITS, ‘authority is dependent on a domain’

(ibid). The query is thereby ‘freer’ to roam through the documents and a page having a high score for one query would not necessarily receive a high score for another. ‘Ranking can be done at different stages of the search process. Depending on how the index is formatted and what information can be’ (Hugill et al. 2013:246). However, the reason why HITS was not a success was because of the ‘higher susceptibility of the method to spamming’ (Franceschet 2010:5.1) and the fact that it demanded much more computational power to carry out search requests.

In document Re:search The Personalised Subject vs. the Anonymous User (Sider 83-86)