From PageRank to RankBrain - The Personalised Subject

Chapter 5: The Personalised Subject

4.2 From PageRank to RankBrain

process was also interrupted. Drawing on the research of Seaver (2013), Rob Kitchin explains that ‘many proprietary systems are aware that many people are seeking to determine and game their algorithm, and thus seek to identify and block bot users’ (2017:24). This is exactly what happened to me when I was carrying out my search queries––Google perceived that I was a bot because I was only clicking on and collecting data from the 1st, 10^th, 20^th, 30^th, 40^th, etc. of SERPs. At a certain moment I received a Google ‘We’re sorry’ message. (Figure 40) Although Google ‘intervened’ and I had to start over, I was eventually able to collect my small data set.

lower quality, enabling higher quality pages to rise. In April 2012 Google launched the

‘Penguin’ update that attempted to catch sites that were ‘spamming’, e.g. buying or obtaining links through networks and boosting Google rankings and as of September 30, 2016, updates in real time are part of the core algorithm. Since June 2013 ‘Payday’ fights spam such as ‘payday loans’, ‘Pigeon’ improves local results, ‘Top Heavy’ demotes ad-heavy pages, ‘Mobile

Friendly’ rewards mobile users and ‘Pirate’ fights copyright infringement.

Figure 43: Weltevrede’s timeline of key Google algorithm changes from 2002-2015 (2016)

Analogous to the components of a car engine that has had it parts replaced, where Penguin and Panda might be the oil filter and gas pump respectively, the launch of ‘Hummingbird’ in August 2013 was Google’s largest overhaul since 2001. With the introduction of a brand-new search engine the emphasis shifted to the contextual—it became less now about the keyword and more about the intention behind it—the semantic capabilities. Whereas previously certain keywords were the focus, now the other words in the sentence and their meaning gained importance.

Moreover, the complexity level of the queries went up, resulting in an improvement in indexing web documents. According to David Amerland, author of Google Semantic Search, the

‘relationality linking search queries and web documents’ comes together with the Knowledge Graph, along with ‘conversational search’ that incorporated voice activated enquiries

(Gesenhues 2013). (Figure 44).

Figure 44: Amy Gesenhues’s Twitter feed from September 27, 2013.

Publicly declared as Google’s third most important ranking ‘signal’, after links and content (words), PageRank infers the use of a keyword by applying synonyms or stemming lists. User’s queries have also changed and are not only keywords but also multi-words, phrases and

sentences that could be deemed ‘long-tail’ queries. To a certain degree these need to be translated from ‘ambiguous to specific’ or from ‘uncommon to common,’ in order to be

processed and analysed (Sullivan 2016). Hummingbird has a more sophisticated understanding of search queries, even long-tail searches, by applying synonyms and semantics. As mentioned in Appendix D, my small dataset (Re:search -Terms of Art ) reflects divergent results in ranking when searching with two browsers, Google and Tor (Disconnect Search), with Google’s results based on ads and locative data, as shown above. Additionally, my keywords were neither trending nor popular––most are eclectic terms and ‘long-tail’ queries. I postulate that when I entered two words or phrases in the search box which are not usually phrased together, the search engine separates the words (artistic + research or new + aesthetic, for example), it was less likely there would be a match in ranking between the two search browsers, resulting in greater variability between the two results and therefore delivering more unique results.¹¹⁰ (Figure 45)

110 I propose this based on my experimentation with other search engines such as YaCy that works with two stages of ranking.

If Hummingbird is the new Google engine from 2013, the latest replacement part is then

‘RankBrain’ a ‘machine-learning artificial intelligence system’. Launched around early 2015, RankBrain ostensibly ‘interprets’ what people are searching for, even though they may have not entered the exact keywords. Around October 2015, RankBrain was handling ‘a very large fraction’ of billions of daily searches that it had never seen before (Metz 2016). Thus, acknowledging that my ‘experiment in living’ was conducted from October 2014 to January 2016, RankBrain might have been used to answer some of my search queries. I entered new queries to Google Search and at that time, RankBrain was answering around 15% of new searches. I propose that the reason I obtained ‘unique results’, where certain results only appear with Google, might have been due not only to ads and personalisation but the fact that Google was already applying machine learning algorithms when I was carrying out my study.

Figure 45: Keyword: artistic research. Unique results (white links)

As of June 2016, RankBrain is being implemented for every Google Search request because it is a ‘query interpretation algorithm’ optimised for ‘meaning/parsing’, which enables it to

understand the meaning and intent in a specific context and to determine ‘correct retrieval of information from the index’ (Fiorelli 2016).

That is the real reason why Semantics, in the sense of structured data, good architecture and topical research, hubs and closeness are so important IMHO, as well as being directly or potentially relevant for the personal search history of the searchers’ (ibid).

Furthermore, the SEO industry speculates that RankBrain is summarising the page’s content.

The murmur is that the algorithm is adapting, or ‘learning’ from people’s mistakes and its

surroundings. It does this by applying ‘deep neural networks’ that are modelled after the human brain. By combining hardware and software in an attempt to copy the human web of neurons, development is carried out through trial and error, analysing the results and then adjusting the math, then repeating the steps with new data.

Previously computers were not fast enough, or the data sets were too small, to carry out this type of testing. Now there is enough computational power at Google’s data centres to handle much more data and it enables the pace of the research to quicken.RankBrain is continuously fed vast amounts of data to train the deep-learning neural networks, splitting computing tasks across machines.¹¹¹ In this way the algorithms are ‘trained’ and they ‘learn’ but it is ‘difficult to directly tweak a machine learning-based system to boost the importance of certain signals over others’ (Lau 2017). In the past, humans––programmers––wrote the code and then tweaked the results, now with RankBrain the models are machine-readable and therefore less human-readable. At the moment, no one is quite sure why neural nets behave the way they do.

Neural networks are really just math—linear algebra—and engineers can certainly trace how the numbers behave inside these multi-layered creations. The trouble is that it’s hard to understand why a neural net classifies a photo or spoken word or snippet of natural language in a certain way (Metz 2016).

These machine-readable models are less human-readable and it is extremely difficult to determine why priority is given to certain results (higher ranking) over other ‘unique results’.

Nowadays with its ‘learning process’ of deep neural networks replacing written rules and code,

‘one of the benefits of Google is the ability to scale’ (Giannandrea 2017) and measure user interaction.

Figure 46: Search Engine Land’s mockup by Larry Kim (2017)

111 This progress in technology facilitates a constellation or coming together of different capabilities from various sources, through models and parameters. According to Google the algorithm first learns offline, being fed historical batched searches (or photos or spoken commands) from which it makes predictions. Eventually the subject, or learner, in this case the algorithm, is able to make predictions through the constant repetition of this cycle. If the predictions are correct, the latest versions of RankBrain go live (Sullivan 2016).

As of 2017 what seems to matter with the ranking is engagement, with high ranking now based on user interaction, or ‘traffic’, clicking on ads and creating ‘network surplus value’, as

elucidated in Chapter 3. Users have been constantly clicking on the links but now RankBrain is placing greater importance on these user signals, as shown in the SEO mock-up diagram below (Figure 46). RankBrain now ostensibly deranks sites that may have good content if the user doesn’t click on the results (where before the signals were measuring keywords, relative to content). Moreover, RankBrain is combined with the amount of time a user spends on the page, or ‘dwell time’ and only Google can measure this. Once again, clicking is the measurement that determines the value of the web pages returned, constantly reflecting the cycles of user

engagement. Traffic, another important factor, diminishes over time if there is no user interaction and

[m]achine learning then becomes a “layer” on top of this. It becomes the final arbiter of rank––quality control, if you will (Kim 2017).

In 2016 Google admitted that ‘ranking systems are made up of not one, but a whole series of algorithms’. With constant tweaking to its proprietary algorithm, in 2017 there were more than 2400 changes and in 2018 more than 3200 changes (Grind et al. 2019), there are now reportedly

‘more than the 200 signals that Google uses to rank results’ (Weltevrede 2016:117; Sullivan 2010). Over the past twenty years PageRank cum RankBrain has been mythologised, fetishised and commodified because of the undisclosed ‘signals’ that determine ranking, yet its code still remains a corporate secret (Pasquale 2015).

In document Re:search The Personalised Subject vs. the Anonymous User (Sider 132-137)