• Ingen resultater fundet

Replication Likely Webpages

5.5 Results and Evaluation

5.5.1 Replication Likely Webpages

It is highly likely that the websites that are documented as part of this section are replicas of each other(or have a common root) or have content that is heavily borrowed from one another. We will describe the process involved in obtaining these results and then elaborate what the results actually show.

The website that we consider for this section is one named britishdragonshop.com.

An extract of the HTML_VOCAB is made from the tool for the website british-dragonshop.com as shown in the Figure5.2. Then it is imported into Gephi as edges and the nodes csv is used for names of the websites in the database.

5.5 Results and Evaluation 47

Figure 5.2: Screenshot of the extraction of builtwith information for BritishDragonShop.com

From careful observation and experimentation, it is observed that the following steps can be taken up in sequence in Gephi to produce graphs of a certain type.

These graphs display information that is very relevant to our study.

All edge weights except the minimum 10% are filtered from the graph to remove vertices from the graph that correspond to high distance values. Thickness of edges is reduced is minimized to reduce clutter. Labels are added to all nodes to show the websites that each node corresponds to and their font size is adjusted to convenience.

Some of the standard layouts offered by Gephi help us transoform the rats nest graphs that we have now into a more meaningful and understandable format.

The "Fruchterman Reingold" layout is applied to distribute the nodes evenly followed by "Force Atlas" layout to segregate the set of nodes into the connected bunch and the unconnected bunch. We follow this up with a quick run of the

"Label Adjust" layout to show the labels clearly.

A weighted degree based color gradient is applied to the graph to highlight the source node and its connected nodes. At this point, then graph looks as can be found in the Figure 5.3.

48 Classification Engine

Figure 5.3: Screenshot of Gephi showing classification of webpages similar to BritishDragonShop.com

Now if we hover over the source node, which in our case now is britishdragon-shop.com, then we can see only those that still have a connection to it through one of the non-filtered edges. This means that they are likely to contain a similar vocabulary of the order of 0.35-0.40 Jaccard distance in this particular case. This can be seen in the Figure5.4.

It can be seen from these images that the websites that are likely to be similar to britishdragonshop.com are the ones mentioned below.

1. britishdispensaryshop.com 2. scheringshop.com

3. casablancapharmashop.com 4. asiapharmashop.com

To verify this claim manually, a lookup on these websites was made on a Chrome browser to verify the content similarity. It was discovered that there was a remarkable level of similarity between these websites in their principle content as can be seen in the screenshots in Figures5.8,5.6,5.9and5.7below.

5.5 Results and Evaluation 49

Figure 5.4: Screenshot of Gephi showing classification of webpages similar to BritishDragonShop.com - Source highlighted

Figure 5.5: Screenshot of BritishDragonShop.com homepage

50 Classification Engine

Figure 5.6: Screenshot of CasablancaPharmaShop.com homepage

Figure 5.7: Screenshot of ScheringShop.com homepage

5.5 Results and Evaluation 51

Figure 5.8: Screenshot of BritishDispensaryShop.com homepage

Figure 5.9: Screenshot of AsiaPharmaShop.com homepage

It is very clear that these websites only have cosmetic changes in them and that their primary content and content structure bear a striking resemblance. When a deeper dig into the actual numbers was made to identify the source of even the small deviation, it was discovered that the only reason the Jaccard distance was not completely zero was because of the dynamic contents in the homepage

52 Classification Engine

of each website. These were the top selling product categories which varied from one website to another and the actual products that contributed to these. Some of these appear random, while others seem to be based on sales trends.

As such it is these that contribute to even the small deviations in Jaccard distance for the vocabulary of the webpages. To verify the claims, we also look into the technologies used in the creation of these webpages and how similar these webpages are to each other when we use the builtwith data as a parameter to compute the Jaccard distance.

When visualized with Gephi, the builtwith similarity for britishdragonshop.com looks as can be seen in Figure 5.10. This clearly has more websites which are worth looking into separately but they include all the webpages identified using the vocabulary parameter. This is a strong indication that the vocabulary revealed sister websites in this particular case.

Figure 5.10: Screenshot of Gephi showing pages similar to britishdragonshop.com with builtwith as a parameter.

It can thus be said that the vocabulary is indeed a strong metric in determining sister websites that could potentially be run by the same group of people. It may also be possible to identify the most popular and lucrative platforms based on source material originating from this platform being used in multiple other non-related websites that use lets say different technologies to be built.

The reason the second hypothesis is difficult to evaluate would be that the lack of actual ground truth in most of these cases as they happen to be subject of actual criminal investigations.

5.5 Results and Evaluation 53