D ATA T RANSFORMATION , A NALYSIS & V ISUALIZATION

3. RESEARCH METHODOLOGY

3.7 D ATA T RANSFORMATION , A NALYSIS & V ISUALIZATION

Table 5: Page table

Page Table

From the table “page”, we included two columns, as listed below.

Column Primary or secondary key Description

id Primary key for page Unique identifier for the page

start_url N/A Indicates the URL of the page

The column “id” is the unique identifier for the page and the columns labelled as “page_id” in the other tables are based on this column. The column “start_url” indicates the url of the web page that is loaded in the browser, before any possible redirection, such as “http://www.um.dk”.

such example is the domain “esbjerg.dk”. The domain “esbjerg.dk” received requests from 27 different elements, all of which were requested by one page, namely “http://www.esbjerg.dk”. We therefore excluded elements from the domain “esbjerg.dk”, as it is not a third-party domain.

In the processing of the WebXray results for cookies, we included all results, regardless if the cookies came from third-party domains or not. We made this decision due to the fact that cookies, compared to elements, are able to collect more data about the user, such as how the user behaves in the browsing session (Libert, 2015). It is our understanding that this decision will help to create a better understanding of the usage of cookies on Danish public web pages and to ultimately help to better understand TPT on Danish public web pages. We did not accept any of the cookies prior or during the WebXray analysis, which means that all the cookies found by WebXray, were set without consenting.

Out of the WebXray output, see section 3.6 Sample Selection for a description of the output, only certain tables and columns were relevant to include in the analysis and therefore, tables containing relevant columns were combined successfully in the SQL query to allow for further processing. This is due to the fact that some tables contain some of the same values, e.g. one table contains a foreign key column, which is based on a primary key column in the other table. A primary key is a column that works as a unique identifier (ID) for that specific table and originates from this table. When a primary key is included in any table other than the one that it is originally from, it is called a secondary key. In other words, the primary key is the original column and the secondary key is the copy or duplicate in another table.

For example, the table “element” could be joined with the table “domain”, because the values in the column “domain_id” (Foreign key) in “element” is identical to values in the column “id” (Primary key) in “domain”. Thus, the SQLite database will know that values in rows with “id” = X should also be attributed to rows with “domain.id” = X. This meant that query results with more characteristics could be achieved, such as retrieving element URLs, a domain and the nationality of the domain, since the

“country” column, which denotes nationality, is only found in the “domain” table. After having used SQL to query the specified data, we used Excel combined with the SQL output to model the data and get a better understanding of which cookies and elements were present on which web pages.

This was our initial relation-analysis, allowing identification of the extent of cookies and elements and the relationship between the different trackers and the different web pages. The analysis of the relationships through SQL and Excel resulted in our initial ecosystem analysis.

The relationships between the trackers and pages were however difficult to identify and analyze manually in Excel, we therefore used Gephi as a visual analytics tool to visualize our ecosystem.

Gephi is a visualization and exploration software for graphs and networks (Gephi, 2020). By using Gephi, we were able to through exploratory- and link analysis to reveal the underlying structures of associations between the trackers and web pages in our ecosystem, i.e. identifying how the different trackers and pages were interconnected. Gephi relies on input based on nodes, which are the actors, i.e. the trackers and web pages, and edges, which is the relationship between the nodes, i.e. where the trackers are present on specific pages.

We here used the modelling from our Excel sheet and transformed the data once more by providing each tracker with a unique ID and each page with a unique ID and we were hereby able to identify on which unique pages the different trackers were present. For example, ID 293 was present on ID 5, 98 and 179, i.e. three different web pages. For Gephi to be able to understand the inputs and the relations between the inputs, the trackers and web pages were given IDs within the same list, meaning that the web pages were given IDs from 1-277 and third-party domains that set cookies were given IDs from 278-329. For the element tracking ecosystem, web pages were given IDs ranging from 1-277 (the same as in cookie ecosystem) and the tracking elements were given 278-393. This allowed us to visualize trackers and their relation to web pages within the ecosystem, which will be presented in our findings and analysis section. We produced two different ecosystems, one for cookies and one for elements.

Just uploading the nodes and edges to Gephi did however not show anything relevant, as all of the nodes and edges were just one big chunk of black dots. Within Gephi we therefore transformed the visuals of our ecosystem by setting different parameters. The first thing was changing the size of the nodes in relation to the number of outer edges, meaning that trackers who were present on many pages were proportionally bigger than trackers with less presence. After changing the size, we applied colours relating to the modularity of the trackers, meaning that we applied colour related to the clustering of the trackers. Trackers with more than one relation to each other, i.e. tracking the same web pages, were therefore part of the same cluster and therefore had the same colour.

We did however find that some of the trackers were present on a great number of web pages, such as third-party domains from Google Analytics and Siteimprove. We therefore split the cookie ecosystem into two separate ecosystems and split the element ecosystem into two separate ecosystems. One ecosystem containing the trackers with a presence on 10+ web pages and one

where trackers with a presence on 10+ web pages were removed, meaning that we ended up with four ecosystems in total. We included both ecosystems for respectively cookies and elements to show the complexity of the ecosystems with the top trackers, but also included the ones without the top trackers, in order to be able to identify the specific and underlying clusters within the ecosystems.

The visualizations within our thesis serve different purposes. Some of our visualizations serve the sole purpose of making it easier for the reader to get an overview, such as the visualization used in the research question section. This visualization could have been left out, and the result would have been the same. According to Kennedy & Engebretsen (2020:19) “visual representations of statistics and other, often quantitative data can convey complex facts and patterns quickly and effectively”.

Other visualizations within our thesis serve a more specific purpose relating to the above quote, such as the information cycle in section 2.1.2. This illustration would be very hard to describe for a reader, as it contains many different factors. We therefore chose to illustrate and then explain the process below, hereby leading the reader through the cycle. Another example is our visualizations of the ecosystems in our findings and analysis section. The data behind the ecosystem as previously described does not illustrate anything without having existing knowledge. We therefore visualized the ecosystems to show the complexity and relations between different trackers and web pages, which would not have been possible without an illustration. Our visuals therefore serve the purpose of benefiting the process of sense-making and learning through the thesis and are tools for understanding (Kennedy & Engebretsen, 2020).

In document Surveillance in the Digitalized Public Sector (Sider 56-59)

D ATA T RANSFORMATION , A NALYSIS & V ISUALIZATION

3. RESEARCH METHODOLOGY

3.7 D ATA T RANSFORMATION , A NALYSIS &amp; V ISUALIZATION

3.7 D ATA T RANSFORMATION , A NALYSIS & V ISUALIZATION