• Ingen resultater fundet

Magazine recommendations based on social media trends

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Magazine recommendations based on social media trends"

Copied!
67
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Magazine recommendations based on social media trends

Steffen Karlsson

Kongens Lyngby 2014 B.Eng-2014

(2)

Technical University of Denmark

Department of Applied Mathematics and Computer Science Matematiktorvet, building 303B,

2800 Kongens Lyngby, Denmark Phone +45 4525 3351

compute@compute.dtu.dk

www.compute.dtu.dk B.Eng-2014

(3)

Summary (English)

Issuu uses a recommendation engine, for predicting what a certain reader will enjoy. It is based on collaborative filtering, such as reading history of other similar users and content-based filtering reflected as the document’s topics etc. So far all of those parameters, are completely isolated from any external (non-Issuu) sources causing the Matthew Effect. This project, done in collaboration with Issuu, is the first attempt to solve the problem, by investigating how to extract trends from social media and incorporate them to improve Issuu’s magazine recommendations.

Popular social media networks have been investigated and evaluated re- sulting in choosing Twitter as the data source. A framework for spotting trends in the data has been implemented. To map trends to Issuu two ap- proaches have been used - Latent Dirichlet Allocation model and Apache Solr search engine.

(4)

ii

(5)

Summary (Danish)

Issuu benytter sig af et anbefalingssystem til at forudsige, hvad der vil glæde en given læser. Det er baseret på collaborative filtering såsom læse historik fra lignende brugere. Derudover er det baseret på indholdsbase- ret filtrering, der afspejles som dokumentets tema mv. Hidtil er alle disse parametre fuldstændig isoleret fra eksterne (ikke Issuu) kilder. Dette pro- jekt er udført i samarbejde med Issuu og er det første forsøg på at løse problemet. Dette er gjort ved at undersøge, hvorledes man kan udtræk- ke tendenser fra sociale medier og integere dem, for at forbedre Issuu’s magasin anbefalinger.

Populære sociale medier er blevet undersøgt og evalueret, hvilket resulte- rer i at Twitter blev valgt som datakilde. Et system til at spotte trendenser på i dataen er blevet implementeret. Der er benyttet to forskellige me- toder til at integere tendenserne på Issuu - Latent Dirichlet Allocation modellen og Apache Solr søgemaskine.

(6)

iv

(7)

Preface

This thesis was prepared at the department of Applied Mathematics and Computer Science at the Technical University of Denmark (DTU) in ful- fillment of the requirements for acquiring an B.Eng. in IT. The work was carried out in the period September 2013 to January 2014.

I would like to thank my supervisor Ole Winther from DTU, my external supervisor Andrius Butkus and Issuu for spending time and resources on having me around.

Lyngby, 10-January-2014

Steffen Karlsson

(8)

vi

(9)

Contents

Summary (English) i

Summary (Danish) iii

Preface v

1 Introduction 1

1.1 Problem definition . . . 2

1.2 Social media . . . 2

1.3 What is a trend? . . . 5

1.4 Related work . . . 5

1.5 Methodology . . . 7

1.6 Expected results . . . 8

1.7 Outline . . . 8

2 Mining Twitter 9 2.1 Twitter API . . . 10

2.2 Tweet’s location problem . . . 11

3 Trending framework 15 3.1 Raw data . . . 16

3.2 Normalizing data . . . 18

3.3 Detecting trends . . . 19

3.4 Recurring trends . . . 20

3.5 Trend score . . . 21

(10)

viii CONTENTS

3.6 Aggregating trends . . . 22

4 From trends to magazines 25 4.1 LDA . . . 25

4.2 Using LDA . . . 28

4.2.1 Results using LDA . . . 30

4.3 Solr . . . 31

4.4 Using Solr . . . 32

4.4.1 Results using Solr . . . 33

5 Conclusion 35 5.1 Improvements of the trending framework . . . 36

5.2 Improvements of the LDA model . . . 38

5.3 LDA vs. Solr . . . 39

A Dataset statistics 41 A.1 Location . . . 41

A.2 Hashtag . . . 42

B Example: #bostonstrong 43 C Implementation details 47 C.1 Flask . . . 47

C.2 Peewee . . . 48

C.3 Database . . . 49

C.4 MySQL . . . 49

Bibliography 51

(11)

List of Figures

1.1 Typical patterns for slow and fast trends. . . 6

1.2 Project flowchart . . . 7

2.1 Mining Twitter flowchart . . . 9

2.2 Visualization of the problem with the location . . . 12

2.3 Visualization of the solution to the location problem . . . 12

3.1 Total tweets per hour . . . 16

3.2 Raw tweet count for hashtags . . . 17

3.3 Weighted tweet count per hour . . . 18

3.4 Normalized hashtags . . . 18

3.5 Example sizes ofw and r . . . 19

3.6 w - r, wherer = 2 hours. . . 20

(12)

x LIST OF FIGURES

3.7 w - r, wherer = 24 hours. . . 21

3.8 Displyaing use of threshold in the trending framwork. . . . 22

3.9 E/R Diagram v2 . . . 23

4.1 Plate notation of the LDA model [Ble09] . . . 26

4.2 LDA topic simplex, with three topics . . . 27

4.3 Representation of topic distribution using dummy data . . 27

4.4 #appletag cloud . . . 28

4.5 From trend to magazines flowchart . . . 29

4.6 Topic distribution for#appletweets . . . 29

4.7 Subset of the similar#apple documents using LDA . . . . 30

4.8 Example of tokenizing and stemming . . . 31

4.9 Subset of the similar#apple documents using Solr . . . . 33

5.1 Supported languages by Issuu . . . 36

5.2 Translation module to improve the solution. . . 37

5.3 Three simultaneously running trending frameworks. . . 37

5.4 Top words in the topics. . . 38

5.5 LDA per page solution. . . 39

A.1 Top and bottom 10 of used locations . . . 41

(13)

LIST OF FIGURES xi

A.2 Top and bottom 10 of used hashtags . . . 42

B.1 Total tweets per hour . . . 43

B.2 Raw tweet count for hashtags . . . 44

B.3 Fully processed data . . . 44

B.4 #bostonstrongtag cloud . . . 45

B.5 Subset of #bostonstrong LDA documents . . . 46

B.6 Subset of #bostonstrong Solr documents . . . 46

C.1 E/R Diagram . . . 49

(14)

xii LIST OF FIGURES

(15)

Chapter

1

Introduction

Issuu1 is a leading online publishing platform with more than 15 million publications - a pool that keeps growing by more than 20 thousand new ones each day. The main challenge for the reader then becomes the nav- igation and discovery of interesting content, among the vast number of documents. To solve the problem Issuu uses a recommendation engine for predicting what a certain reader might enjoy.

Currently a whole range of parameters are a part of Issuu’s recommen- dation algorithm: reader’s location and language preferences (context), reading history of other similar users (collaborative filtering [RIS+94], [SM95]), document’s topics (content-based filtering [Sal89]) and docu- ment’s overall popularity. Also there are editorial and promoted docu- ments. So far all of those parameters, are completely isolated from any external (non-Issuu) sources.

1www.issuu.com

(16)

2 Introduction

The main problem is that the same magazines constantly get recom- mended again and again. It highlights the shortcomings with collabora- tive filtering, rather than reading habits of Issuu users. Issuu does not allow readers to rate magazines, so the read time is used instead. Nat- urally popular magazines gather their read-times very quick and then are hard to beat by the newly uploaded ones. They get recommended more and by that they only become stronger - a phenomena known as the Mathew Effect [Jac88].

Incorporating local trends (what is happening around the reader) into the recommendations, would address this problem and add a bit more freshness and serendipity.

1.1 Problem definition

How to extract trends from social media and incorporate them, to improve Issuu’s magazine recommendations.

1.2 Social media

In this project, social media is the data source from which trends can be extracted. There are many social media platforms that could be used as the data source for this project. Their suitability was evaluated based on these parameters:

Data - Defines the format of the data, the amount of the data that is available and how semantically rich is it. This is the most impor- tant parameter, since it’s all about the quality of the data and will directly impact the ability to extract trends. Text is a preferred data format here. The more data, the better - since it will add stability to the resulting trends. Semantic richness is about, how much meaning can be extracted from the data.

(17)

1.2 Social media 3

We should not expect any highly organized and semantically rich taxonomies, since the Twitter is a crowd driven social media in- stead of editorially curated and organized. In social networks we normally see data being organized as folksonomies2, where "mul- tiple users tag particular content with a variety of terms from a variety of vocabularies, thus creating a greater amount of meta- data for that content [Wal05]. Semantic richness in folksonomies comes from multiple users tagging the data with the same labels, which shows that they agree on what it is about. It can be narrow or broad. In narrow ones only the creator of the content is allowed to label it with tags, while in broad ones multiple users can label a piece of content. Broad folksonomies are more stable and infor- mative given that there are enough users to label things, and is the preferred one in this project.

Real-time - Defines the time from something important happening in the world, until it appears on the particular social media network.

An API supporting real-time streaming of data is naturally prefer- able, but a small delay is also acceptable.

Accessibility- Defines if there are any restrictions throughout the API, that limits the accessibility of data.

The most popular social networks were evaluated based on these three parameters. The one that fitted the best appeared to be Twitter3 (see Table 1.1). Facebook4 and Google+5 scored well on the data part and real-time but had to be ruled out due to the limited API access and strict privacy settings. "The largest study ever conducted on Facebook on privacy, showed that in June 2011 around 53% of the profiles where private, which where an increase of 17% over 15 months." [Sag12].

2A term coined by Thomas Vander Wal, combining words folk and taxonomy.

3www.twitter.com

4www.facebook.com

5plus.google.com

(18)

4 Introduction

Positive Negative

Data

Average of 58 million tweets each day6.

Over 85% of topics are headline or per- sistent news in na- ture [KLPM10].

Length of the tweet.

Lack of reliability accord- ing to the location preci- sion of the tweets.

Real-time Pseudo real-time location based streaming service.

2 hours behind.

Accessibility

API easy accessible and usable.

Unpaid plan limited by a 1% representative subset of data.

Table 1.1: Evaluation of Twitter’s suitability for the project.

Linked-in7 was discarded because of the nature of the data - industry and career oriented. Instagram8, Pinterest9 and Flickr10 are all big and interesting, but the data they provide is mostly images and thus hard to interpret, also their data is not that close to trending news. Same goes for YouTube11 and Vine12.

It is worth to mention that trends can be spotted on Issuu as well. One of the problems is that they have a huge delay since Issuu users are not as active as other social networks. Also on Issuu trends would have to be inferred from what people read instead of what they are posting or commenting.

7www.linkedin.com

8www.instagram.com

9www.pinterest.com

10www.flickr.com

11www.youtube.com

12vine.twitter.com

(19)

1.3 What is a trend? 5

1.3 What is a trend?

Trend can be understood in many different ways depending on the context - stock market, fashion, music, news, etc. The dictionary defines a trend as:

"a general direction in which something is developing or changing" - Definition in the dictionary.

In this project trends will be considered a bit differently. Basically Issuu is interested in knowing what topic or event is currently hot in which country (or other even smaller area) and recommend magazines similar to it. On Twitter trends can be spotted by looking at the hashtags, so in this project trending hashtags and trends, will be considered the same thing. Trends are taken as a “hashtag-driven topic that is immediately popular at a particular time"13.

Trends vary in terms of how unexpected they are. Seasonal holidays like Christmas or Halloween are trends, but very expected ones. On the other hand, Schumacher’s skiing accident is a very unexpected one, both types are equally interesting and valuable for Issuu. Another parameter is the speed of how quickly, the trend is raising. We can have slow or fast trends (see Figure 1.1), the priority is spotting trends that raise fast.

Trend and popularity is not the same thing. If something becomes popular all of a sudden - it is a trend. But if it keeps being popular, it is not a trend anymore.

1.4 Related work

Extracting trends from Twitter is nothing new. The two widely used approaches are parametric or non-parametric. The most popular one is the parametric approach, where a trending hashtag is being detected, by observing it’s deviation based on some baseline [IHS06], [BNG11],

13www.hashtags.org/platforms/twitter/what-do-twitter-trends-mean/

(20)

6 Introduction

1 1

2 0

1 0

0 2

0 4

0 7

0 5

0 8

1 10

0 15

2 12

16 16

30 18

3 20

2 16

1 22

2 18

1 16

1 14

1 17

0 14

0 3

0 0

0 0

1 1

2 2

1 1

1 1

1 1

1 1

Count

Fast trend Slow trend

Count

time period ≈ 24 hours

0 24

Figure 1.1: Typical patterns for slow and fast trends.

[CDCS10], using a sliding window. It’s the simplest of approaches and still quite successful, based on the assumption that different trends will behave similarly to one another. It’s known that this is not the case, in the real world - there are many types of trends, with all kinds of patterns.

To address that problem, other non-parametric methods have been used as well [Nik12]. In those ones the parameters were not set in advance, but were learned from the data instead. Many patterns were observed and grouped into the ones that became trends and the ones that didn’t.

New hashtag patterns can then be compared to the observed ones using euclidean distance, the similarity can then be used to determinate if it is trending or not.

The requirements for spotting trends at Issuu, are not that strict - there’s no need to capture all the trends from a certain day, but instead just the most significant ones. It makes things simpler and that’s why, the parametric model was chosen for this project. It is the first time, that Issuu is doing a project like this, so the idea was to try the simpler things first, to see if they work. If not the more heavy non-parametric models, could be applied.

(21)

1.5 Methodology 7

1.5 Methodology

Figure 1.2 is illustrating the methodology of the project.

! ! ⚙ " I !

Twitter Data Trending

Framework Trends Issuu Documents

Figure 1.2: Project flowchart

It’s important to note early how this trend data will be used by Issuu because it sets requirements on other parts of the projects. Issuu is using Latent Dirichlet Allocation (LDA) [BNJ03], to extract topics from it’s documents, using the Gensim implementation [ŘS10]. Using the Jensen- Shannon distance (JSD) algorithm it is possible to compare documents to one other, using the LDA topic distribution. This allows Issuu to find similar documents, to the one that is being read, for example.

If we can capture trends from social media and express them as text ("virtual document"), we could calculate LDA for the trend (one text file per trend) and use JSD to find similar documents.

Issuu is using Apache Solr14search engine, which takes text as input, and can give similar documents as output. This is another approach and will be investigated, whether this may be used as an alternative/complement to LDA.

With all that in mind, the plan is this:

• Access Twitter API15 and retrieve tweets, from a given country on a given time and storing them in a database.

14lucene.apache.org/solr/

15dev.twitter.com

(22)

8 Introduction

• Calculate trends from the tweets, the output of this step are the list of trending hashtags, per given time window.

• Find out how to feed those trends, into both LDA topic model and Solr search engine

• Get documents as the final result and evaluate.

1.6 Expected results

• Analysis of potential resources for mining data from social media networks, to be used at Issuu as basis for recommendations.

• Data mining algorithms (Python16) to retrieve all the necessary data.

• An algorithm for extracting trends from tweets.

• A method of feeding trends into the LDA model and Solr.

• Evaluation of the results and final recommendations on the end- to-end solution, for incorporating social media data into Issuu’s recommendation engine.

1.7 Outline

Chapter 2 is explaining how to retrieve tweets from the Twitter API service, which are being processed and analyzed in Chapter 3. The trends are fed into the LDA model and Solr search engine, resulting in similar documents in Chapter 4. The final recommendations on the end-to-end solution are being evaluated in Chapter 5.

16www.python.org

(23)

Chapter

2

Mining Twitter

This chapter is about retrieving tweets from Twitter and storing them for trend extraction later. USA was chosen as the country for this project because of several reasons. First of all, most Issuu readers are from the USA. Secondly, more than half of Twitter users are from the USA too [Bee12]. Finally, having tweets in english makes it simpler, because Issuu’s LDA model was trained on english Wikipedia and sticking to the english tweets means that no translation will be needed.

!

Trend related Tweets

I

Issuu’s LDA

#

#

Topic Distribution

!

Similar documents

$

All tweets

in the world Location filter [USA]

!

Tweets in USA

$

Database

Figure 2.1: Mining Twitter flowchart

(24)

10 Mining Twitter

The data in Twitter are 140 character long messages called tweets. Often they contain some additional meta-data:

Symbol Description Example

#

Grouping tweets together by type or topic, known as a hashtag.

Wow, Mac OS X Maver- icks is free and will be available for machines go- ing back as far as 2007?

#Apple #Keynote

@

Used to referencing, men- tioning or replying another user.

@alastormspotter: iOS7 will release at around noon central time on Wednesday.

RT

Symbolizing a retweet (posting an existing tweet from another user).

RT @ThomasCDec: 50 days to #ElectionDay

Table 2.1: Additional meta-data used in tweets.

2.1 Twitter API

The Twitter API is providing two different calls which may be suitable for this purpose:

GET search/tweets : Is part of the ordinary API, i.e. with the rate limit of 450 requests per 15 minutes and continuations url, which means that there are a finite number of tweets per request before requesting next chunk.

POST statuses/filter : Is part of the streaming API, as mentioned in Table 1.1. Which for the unpaid plan, has the limitation of only providing a 1% representative subset of the full dataset.

(25)

2.2 Tweet’s location problem 11

Twitter is using a three step heuristic, to determine whether a given Tweet falls within the specified location defined as a bounding box1:

1. If the tweet is geo-location tagged, this location will be used for comparison with the bounding box.

2. A user on Twitter can in the account settings specify location, which in the API calls refers asplace, and this will be used for comparison if the tweets is not geo tagged.

3. If neither of the rules listed above match, the tweet will be ignored by the streaming API.

The streaming API was chosen, because it takes all the three heuristics into account, whereas the search API only includes the second. Addition- ally it is difficult to know how frequently to execute the API call, in order to be up-to-date, due to the limitations.

2.2 Tweet’s location problem

A couple of problems where spotted with the location accuracy:

1. Twitters API supports streaming by location, but only with coor- dinates in sets of SW and NE, defining each country by a square.

Figure 2.2 shows the tweets streamed from USA (tweets that are actually from USA have been filtered, to provide a better overview).

2. Although the selected bounding box is covering USA and even more, tweets from Guatemala and Honduras is still present (see Figure 2.2).

1Two pairs of longitude and latitude coordinates; south-west (SW) and north-east (NE) corner of a rectangle

(26)

12 Mining Twitter

Figure 2.2: Visualization of the problem with the Twitter service, where each red dot represents a tweet. Duration is 1 hour and the number of tweets with a wrong location is 7,240.

Figure 2.3: Visualization of the solution to the Twitter service problem, where each red dot represents a tweet. Duration is 1 hour and the number of tweets is 112,851, this means an error rate of approximately 7%.

(27)

2.2 Tweet’s location problem 13

Applying the location filter to the streaming API, means that the bound- ing box needs to be known. GeoNames2 solves the problem, by providing all coordinates needed for all countries.

The two problems spotted regarding the location accuracy, turned out to have the same solution. Algorithm 1 investigates whether the current received tweet are from the same country as desired. The ones which are, will be stored in the MySQL3 database for further analysis. Appendix C contains information about, the database choice and implementation details including E/R diagram.

A problem occurred, some of the tweets where missing the country code, which means that they could not be processed. To solve this Open- Streetmap’s reverse geocoding API4 was used, which has the ability to convert longitude and latitude value pairs to a country code.

Algorithm 1: Parse tweet from data if coordinates in data then

if place not in data then

country_code = reverse geocode coordinates if tweet.country_code is chosen country_code then

#Parse the rest of the tweet add tweet to database

else

raise LocationNotAvailableException

For debugging purposes an interactive tweet-map, which is a graphical interactive way of visualizing tweet, has been created (used at Figure 2.2) and Figure 2.3). It is a JavaScript/HTML5 based website hosted locally in Python with the module Flask5 and implementation details available in Section C.1.

2www.geonames.org - Licensed under a Creative Commons attribution license, which gives you free access to: Share - to copy, distribute and transmit the work andRemix- to adapt the work to make commercial use of

3www.mysql.com

4wiki.openstreetmap.org/wiki/Nominatim/

5flask.pocoo.org

(28)

14 Mining Twitter

(29)

Chapter

3

Trending framework

In the previous chapter it was described how the tweets were collected ensuring their location accuracy and stored in the MySQL database. This chapter focuses on how to turn those tweets into trends. "Fast" trends were chosen for this project, because it would have the most impact com- pared to the "slow" trends. Eventually most trends will appear on Issuu - having huge delay, since Issuu users are not as active as other social networks. Therefore the challenge is to reduce this delay.

To illustrate the idea a time period of three days was chosen knowing in advance that there were several trends in there and testing if the algorithm can find them.

On October 22nd Apple held it’s annual event where it has presented the updated product line (new iPads, MacBooks and of course the new OSX Mavericks). This event was chosen as one of the examples to start with.

(30)

16 Trending framework

3.1 Raw data

At first, we will take a look at the raw data from the database. In addition also a 3 consecutive days subset, which will be used as example, to describe the trending framework:

Type Count

Full dataset Example

Duration (hours) 1462 72

Tweets 127,930,378 4,103,273

Hashtags 25,502,269 770,453

Unique hashtags 3,180,466 206,850

Avg. tweets per day 2,099,858 1,367,757

Avg. length per tweet (char) 56 54

Avg. words per tweet 9.5 9.1

Table 3.1: Facts about the dataset collected.

More statistics about the dataset are presented in Appendix A.

The plot of total tweets per hour at Figure 3.1 - where the x-axis rep- resents the 3 days (72 hours) and 0, 24 and 72 is midnight (this also applies for the other plots in this chapter) - clearly shows, that the fre- quency/fluctuation of tweets reflects the same day/night rhythm as hu- mans, which was as expected.

tweets per hour

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

1 6 12 18 24

#apple

0 12 24 36 48 60 72

1 12 24 36 48 60 72

1 12 24 36 48 60 72

1 12 24 36 48 60 72

24 36 48 60 72

Untitled 124 27 Untitled 630 33 Untitled 1136 Untitled 16 Untitled 2145 48

w r

2 hours 1 hour

30,000

1,800

0.07

0.06

0.06 0

0

Figure 3.1: Total tweets per hour

(31)

3.1 Raw data 17

Hashtags is used to categorize/label the tweet, with one word or phrase and can be used to spot the trends in the tweets. The full text of the tweets, could also have been used, this option has been tested and found to be generally too vague.

Figure 3.2, shows the total amount of tweets of the chosen hashtags. As described before Apple is one of them, the other two are; the TV-show

"Pretty Little Liars" which was shown the same day and the hashtags

"jobs", which is a way companies use to identify a job opening on Twitter.

These three different hashtags represent different kinds of trends: one- time-events, weekly recurring and daily recurring, which will be described later in this chapter.

tweets per hour

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

1 6 12 18 24

#apple

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

Untitled 124 27 Untitled 630 33 Untitled 1136 Untitled 16 Untitled 2145 48

w r

2 hours 1 hour

30,000

1,800

0.07

0.06

0.06 0

0

0 12 24 36 48 60 72

Figure 3.2: Raw tweet count for hashtags

Due to the quite big fluctuation in the total amount of tweets during a day, a weight function with the purpose of reducing the importance of tweets, which are tweeted during nights, expressed as a sigmoid function1 has been created (this could also have been another mathematical function like hyperbolic tangent):

wt= 1

1 + exp(−(m×(tweets∈t−X))), (3.1) where t defines a time period and m is the slope of the curve, which can be defined as how"expensive" it is, to have a tweet count below the preferred amount X (see Figure 3.3).

1en.wikipedia.org/wiki/Sigmoid_function/

(32)

18 Trending framework

tweets per hour

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

1 6 12 18 24

#apple

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 21

24

36 48

30 33 45

27

w r

2 hours 1 hour

30,000

1,800

0.07

0.06

0.06 0

0

0 12 24 36 48 60 72

!

#pll

#pll

#jobs

weighted tweet count

0 12 24 36 48 60 72

1.0

Figure 3.3: Weighted tweet count per hour

3.2 Normalizing data

To get reasonable results from data which vary in the amount (in this case, the total amount of tweets pr hour), it is highly recommended to normalize the data. In this project every hashtag will be normalized by the total amount of tweets in the time periodt, this results in a normalized value for each hashtag:

ft= |tweets3the hashtag|

|tweets∈t| (3.2)

Figure 3.4 shows the result of applying Equation 3.2 to the data at Figure 3.2. The hashtags seems to follow the same pattern, although there is some differences during the night.

tweets per hour

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

1 6 12 18 24

#apple

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

Untitled 124 27 Untitled 630 33 Untitled 1136 Untitled 16 Untitled 2145 48

w r

2 hours 1 hour

30,000

1,800

0.07

0.06

0.06 0

0

0 12 24 36 48 60 72

Figure 3.4: Normalized hashtags

(33)

3.3 Detecting trends 19

3.3 Detecting trends

To detect a trend, is it important to know how the hashtag has behaved in the previous time window (reference window r), before the current time windoww. Sizes of thew and r are parameters in the framework and are tunable.

Figure 3.5 shows an example of the values, for the reference window and current window, which respectively is 2 hours and 1 hour.

Tweet count

tweets pr hour

#pll

#apple

#jobs

#pll

#apple

#jobs

Trend score

#pll

#apple

#jobs

#pll

#apple

#jobs

1 6 12 18 24

#apple

1 12 24 36 48 60 72

1 12 24 36 48 60 72

1 12 24 36 48 60 72

1 12 24 36 48 60 72

1 12 24 36 48 60 72

24 27 30 33 36 45 48

w r

2 hours 1 hour

Figure 3.5: Example sizes ofw and r.

To be able to know if the term is a trend, the normalized reference window (ft_ref) will be subtracted from the current window, to find out whether the interest has increased:

ft_ref =

|r|

P|the hashtag∈tweets|

|r|

P|tweets∈t|

(3.3)

where r is a list of reference windows. The outcome of this step can be seen in Figure 3.6. It clearly shows that it has a huge impact on the

"jobs" hashtag, whose influence has dropped.

(34)

20 Trending framework

tweets per hour

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

1 6 12 18 24

#apple

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

Untitled 124 27 Untitled 630 33 Untitled 1136 Untitled 16 Untitled 2145 48

w r

2 hours 1 hour

30,000

1,800

0.07

0.06

0.06 0

0

0 12 24 36 48 60 72

Figure 3.6: w - r, wherer = 2 hours.

3.4 Recurring trends

The size ofw(current window) andr(reference window) creates problems regarding interfering recurring trends. Hashtags like "jobs" turns out as a trend each day, no matter the fact that it follow the same pattern each day (see Figure 3.2).

These types can be daily, weekly or yearly recurring defined as:

Day - Examples of recurring daily trends will be hashtags such as "jobs".

This hashtag is recurring each day, but not necessarily on the same time, and tests shows that the amount tends to be a bit lower during the weekend.

Week - TV-shows is a great example of trends, which are recurring each week on the same day and time as long as they are shown. Another type of weekly recurring trends is the natural difference between weekdays with work and weekends.

Year - New Years Eve, Christmas or Halloween are all examples of yearly recurring trends.

(35)

3.5 Trend score 21

In this project the focus is to get rid of the daily recurring trends. The issue will be solved, by subtracting the maximal value of the hashtag from the day before the time period t:

max(fi), i∈ {t−24;t} (3.4) This would make the framework sensitive for outliers. But in this project outliers are in fact what is being looked for - trends. Subtracting the maximum does not mean that a hashtag can not be trending two days in a row, the amount of tweets containing it, just needs to rise.

The weekly and yearly recurring trends where not implemented, but the principle is the same.

3.5 Trend score

Combining Equation 3.1, 3.2, 3.3 and 3.4 gives the complete Equation 3.5 for calculating thetrend_score of a hashtag on a given timet:

trend_score= (ft−ft_ref −max(fi))×wt, i∈ {t−24;t} (3.5) Figure 3.7 shows the final result of the trending framework after applying Equation 3.5 to the data:

tweets per hour

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

1 6 12 18 24

#apple

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

Untitled 124 27 Untitled 630 33 Untitled 1136 Untitled 16 Untitled 2145 48

w r

2 hours 1 hour

30,000

1,800

0.07

0.06

0.06 0

0

0 12 24 36 48 60 72

Figure 3.7: w -r, wherer = 24 hours.

where thex-axis represents time (as for the other plots), where they-axis represents the trend_score.

(36)

22 Trending framework

In this plot the importance of the hashtag "jobs" is reduced to a point, where it is insignificant (a trend score below 0) and the other two turns out to be trendy, which is exactly what we want.

Sometimes the framework produces too many trends than needed for Issuu. Because of this a threshold has been set as a limit. If and only if thetrend_score is above the threshold, will the hashtag be accepted as a trend. The threshold is a parameter which is tunable

tweets per hour

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

#pll

#apple

#jobs

1 6 12 18 24

#apple

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

0 12 24 36 48 60 72

Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 21

24

36 48

30 33 45

27

w r

2 hours 1 hour

30,000

1,800

0.07

0.06

0.06 0

0

0 12 24 36 48 60 72

!

#pll

#jobs

weighted tweet count

0 12 24 36 48 60 72

1.0

Figure 3.8: Trend scores for all trends in the timeperiod of 22. October.

The red line displays the chosen threshold.

Figure 3.8 is a visual representation of the trending scores of the Apple event day, the 22nd of October. The red line represents the threshold.

3.6 Aggregating trends

At Issuu it is not very likely that users come back each hour, but more likely ones a day, therefore it would be unnecessary to recommend new document each hour. A solution where it is possible to aggregate trend throughout longer periods, which should be tunable, would be preferable.

The computationally and extensible most optimal solution would be to extend the existing database (explained in Section C.3), make it able to store trends and references to corresponding tweets.

(37)

3.6 Aggregating trends 23

Two new tables: trendand tweet_trend_relation, where added to the database, containing the trend and the time where it where trendy (see Figure 3.9).

!"#$

% !"#$

&%'(&% )

%&%

*+

,*

' )--

*+

'

*+

./

./'

Figure 3.9: E/R Diagram v2

(38)

24 Trending framework

(39)

Chapter

4

From trends to magazines

This chapter is all about mapping the computed trends to Issuu, which can then be presented as similar magazines/documents for the users. Two different approaches will be investigated in order to solve this problem:

Latent Dirichlet Allocation (LDA) and Apache Solr search engine

4.1 LDA

Latent Dirichlet Allocation (LDA) is a generative probabilistic model, that allows to automatically discover the topics in a document. A topic is defined by the probability distribution over words in a fixed vocabulary, which means that each topic, contains a probability for each word.

LDA can be expressed as a graphical model, known as the plate notation (Figure 4.1)

(40)

26 From trends to magazines

α

N D

Observed word

Topic Hyperparameter Per-word

topic assignment Per-document

topic proportion Dirichlet

parameter

θ

d

Z β

d,n

W

d,n

Figure 4.1: Plate notation of the LDA model [Ble09]

where,

Variable Definition

D The number of documents.

N Total number of words in all documents.

W The observed wordn for the document d Z Assigns the topics for then’th word in the d’th

document.

α K dimensions per-document topic distribution vector, where K is the number of topics.

β Y dimensions per-topic word distribution vector, where Y defines the number of words in the corpus.

θ Topic proportions for thed’th document.

Table 4.1: Definition of LDA model parameters

(41)

4.1 LDA 27

Figure 4.2 is an example of a visual representation of a LDA space with three topics. A given document x has a probability, to belong in each topic, which all sums up to 1. The corners of the simplex, corresponds to the probability 1 for the given topic.

Topic 1

Topic 2 Topic 3

Figure 4.2: LDA topic simplex, with three dummy topics.

Topic distribution for a document can be visualized using a bar-plot.

This describes which topics are present in the document and by that, it’s underlying hidden (latent) structure:

1 2 3 4 5 6 7 8 9 10 x-1 x

Topic

Figure 4.3: Representation of topic distribution using dummy data, where the x-axis represents thex number of topics and the y-axis represents the probability of belonging to the given topicx.

At Issuu, the LDA model is trained on 4.5 million English Wikipedia1 articles. LDA makes the assumption that all words, in the same article is somehow related. Every article is unique in the sense, that it has unique distributions of words.

1www.wikipedia.org

(42)

28 From trends to magazines

This could be interpreted as a unique topic for each article, resulting in 4.5 million topics. This would be useless, since the goal is to make a model that finds similarities among documents, instead of declaring them all different. One of the main steps in LDA is dimensionality reduction where the number of topics is reduced (in Issuu case to 150 topics) forcing similar topics to “merge” and reveal deeper underlying patterns.

4.2 Using LDA

All the tweets containing the trending hashtag, will be used as data source for Issuu’s LDA model, instead of only the hashtag itself. A hashtag does not provide enough information and context, to give a stable result from Issuu’s LDA model. The model is context-dependent and would not be able to differentiate, between the fruit and the electronic company, based on the hashtag #apple, without any context.

To give an idea of the richness of the context, behind the tweets from a single hashtag like#apple(see Figure 4.4), two tag clouds were generated.

A tag cloud is a visual representation of text, which favors the words that is mostly used in the text, by either color or size.

Figure 4.4: Left: Tag cloud for all words, in the tweets containing

#apple. Right: Same tagcloud after removing #apple,

#freeand #mavericks, to get a deeper understanding.

(43)

4.2 Using LDA 29

The flowchart (Figure 4.5) contains four steps, visualized as arrows and denoted with numbers. It shows overall structure of how to turn trends from Twitter, into similar magazines/documents using LDA.

!

Trend related

Tweets Issuu’s

LDA

#

#

Topic Distribution

!

Similar documents

$

All tweets

in the world Location filter [USA]

!

Tweets in USA

$

Database

%

1 2 3 4

Figure 4.5: From trend to magazines flowchart

The tweets corresponding to the trending hashtag is feed in to the LDA model (step 1), which produce the topic distribution (step 2). Figure 4.6 shows the topic distribution for the#appletweets, where it is easy to see that the software/electronics topic is dominating, as expected.

0.000007945967 0.011839627149 0.000007945967 0.001273715413 0.010046488782 0.020267836621 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.009953154993 0.000007945967 0.000007945967 0.002414483548 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.008195320570 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.026455528532 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.003986194631 0.000007945967 0.005491066789 0.000007945967 0.000007945967 0.000007945967 0.001353193101 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.001436402768 0.005035626070 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000921799862 0.061102617619 0.000007945967 0.001225442698 0.000007945967 0.000007945967 0.005200749857 0.000007945967 0.000007945967 0.009508312644 0.002172815590 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.001200310125 0.017425434171 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.051578089927 0.002091128987 0.000007945967 0.447611991520 0.000007945967 0.003475499744 0.000007945967 0.050404963941 0.000007945967 0.005985321782 0.000007945967 0.000007945967 0.000007945967 0.001527621895 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.006667538538 0.000007945967 0.009114833388 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.113652862505 0.000007945967 0.000007945967 0.019504673443 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.009007024311 0.007163380404 0.046579565550 0.000007945967 0.004113406199 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.003511620738 0.000007945967 0.000007945967 0.000007945967 0.005560819158 0.003021698753 0.000007945967 0.000007945967 0.000007945967 0.002047781268

Software

Figure 4.6: Topic distribution for #appletweets

Using this LDA topic distribution, it is possible, in combination with the Jensen-Shannon divergence algorithm (Equation 4.1) (step 3), to find similar magazines to recommend from Issuu (step 4):

JSD(PkQ)= 1

2D(PkQ)+1

2D(QkM), whereM = 1

2D(P +Q),

(4.1)

where P and Q are two probability distributions, which in Issuu’s case are two LDA topic distributions.

(44)

30 From trends to magazines

4.2.1 Results using LDA

Figure 4.7 is a a subset of the magazines found being similar using LDA, to the tweets containing the hashtag#apple.

Figure 4.7: Subset of the similar#appledocuments using LDA.

NB:The documents are blurred due to copyright issues and the terms of services/privacy policy on Issuu, this applies for all figures which shows magazine covers

The resulting magazines range from learning material such as "... For Dummies" to magazines like "Computer Magazine" and "macworld".

The popularity and the date they are uploaded on Issuu differs from magazine to magazine. These would be parameters which also could be used to weight the documents.

(45)

4.3 Solr 31

4.3 Solr

Lucene is a Java-based high-performance text search engine library. A document in Lucene terms, is not a document as we know it, but merely a collection of fields, which describes the document. For any given doc- ument, these fields could be information like the title and the amount of pages. Lucene uses text analyzer, which tokenize the data from a field, into a series of words. After that a process called stemming will be per- formed, which is reducing all the words to its stem/base. See Figure 4.8 for example.

Magazine recommendations based on social media trends

Magazine recommendations based social mediatrends

on

recommend trend base

Tokenization Stemming

Text to be processed Tokenized text Stemmed words

Magazine on social media

Figure 4.8: Example of tokenizing and stemming

Lucene uses term frequency-inverse document frequency (tf-idf) as a part of the scoring model, wheretf is the term frequency in a document, which is the measure of how often, a term appears in a document. Idf is the measure of how often a term, appears across a collection of documents.

Solr is a open sourced enterprise search server, used widely by services like Netflix2. It is a web application service build around Lucene, adding useful functionality such as geospatial search, replication and web admin- istrative interface for configuration.

2www.netflix.com

(46)

32 From trends to magazines

4.4 Using Solr

In this project an integration with Issuu Solr server has been created, the use of the search engine is throughout HTTP requests and JSON3 responses. JSON is a easy human-readable open standard text format, which is mostly used to transfer data between servers and web application, like in this project. An example of a request could be:

<base url>?

q=apple+mavericks+free+new&

wt=json&

debug=true&

start=0&

row=50&

where,

Parameter Description

base url Address to access the Solr search engine.

debug If true, the response will contain addition informa- tion, including scores and reason for each document.

q Main text to be queried in the request.

rows Maximum number of results - used to paginate.

start Used to define current position, in combination with rows to paginate.

wt The format of the response like json.

Table 4.2: Explanation of parameters to Solr.

Theqparameter is constructed using thex most occurring words, in the tweets corresponding to the trending hashtag, wherex is tunable.

3json.org

(47)

4.4 Using Solr 33

4.4.1 Results using Solr

The similar magazines produced by Solr (Figure 4.9) are more diverse more than the similar documents from the LDA model (Figure 4.7). They range from Apple magazines and learning material to magazines about the surf spot "Mavericks" in California and the NBA (National Basketball Association) team Dallas Mavericks.

Figure 4.9: Subset of the similar #appledocuments using Solr

Appendix B contains a complete example, using the hashtag#bostonstrong displaying the complete process from tweets to similar documents using both LDA and Solr.

(48)

34 From trends to magazines

(49)

Chapter

5

Conclusion

A prototype of an end-to-end solution, with the purpose of spotting lo- cation based trends from a social media network and mapping them to Issuu, has been developed. Twitter was selected as the data source, be- cause it suited the requirements the best, as described in section 1.2.

Both of the results (LDA: Figure 4.7 and Solr: 4.9) suggest that improve- ments could be made. Four improvements for the trending framework and three for the LDA was found useful:

• Trending Framework:

1. Support the existing 28 languages supported1 by Issuu.

2. Capture"Slow" trends.

3. Non-parametric model.

4. Recurring weekly and yearly trends.

1Magazine written in other languages does not have a LDA topic distribution, because only those languages are incorporated into the translation framework and therefore can be translated to English.

(50)

36 Conclusion

• LDA:

1. Limited by Issuu’s LDA model.

2. Wikipedia lacks certain topics.

3. Big magazines results in many topics.

5.1 Improvements of the trending framework

USA was chosen as location in this prototype, because most Issuu readers are from USA and more than half of Twitter users are from the country too. In the future the goal will be to support the existing 28 languages supported by Issuu (Figure 5.1).

English Spanish German French Portuguese Russian Arabic Italian Dutch

Turkish Farsi Polish Indonesian Swedish Norwegian Catalan Czech Hebrew

Danish Finnish Romanian Hungarian Croatian Icelandic

Supported Not supported

Figure 5.1: List of languages supported by Issuu, including colored world map of countries speaking those languages.

A solution is to extend the existing end-to-end solution with Issuu’s trans- lation framework (Figure 5.2), which will be able to translate all non- English tweets to English (1st improvement of the trending framework).

A tuneable trending framework has been developed, which is capable of spotting "fast" trends on Twitter and reduce the importance on daily recurring trends. The hashtags: #apple and #pll were found to be trendy the 22nd of October, among a vast number of unimportant re- curring trends. It was the day Apple presented their updated product line including the new OSX Mavericks2 and the Halloween episode of the popular TV-show "Pretty Little Liars (pll)" was shown.

2Apple Special Event: http://www.apple.com/apple-events/october-2013/

(51)

5.1 Improvements of the trending framework 37

! # $

%

&

MySQL

Database Trends English? Issuu’s

Translation LDA

''

Topic Distribution

Yes

No

Figure 5.2: Translation module to improve the solution.

Hashtags like #happyhalloween described in Appendix B, along with

#bostonstrong, was possible to spot during the 31st of October, but the ability to spot "slow" trends (described in Section 1.3) would need to be improved (2nd improvement of the trending framework). A possi- ble solution is to run multiple instances of the framework simultaneously, with various sizes of the current window (w) and the reference window (r) (Figure 5.3).

! !

2h

6h

12h

#

#

#

MySQL

Database Tweets Trending

Frameworks Trends Fast

Slow

Slow

Figure 5.3: Three simultaneously running trending frameworks.

Creating a new trending framework, which is build on a non-parametric model (3rd improvement of the trending framework) [Nik12] would make the system more robust and faster to spot the trends. Robust because parameters like the threshold is not defined from the beginning, but ob- served from training data, which is used decide whether a new dataset is a trend or not.

Referencer

RELATEREDE DOKUMENTER

This figure tells us that the real area-specific conductance found by means of this model must lie somewhere on this curve between 6.66710 -8 and 1.00510 -8 mmols -1

On popular social media platforms, like Twitter, this model would translate into for example “textual speeches” (e.g. tweets only containing text). In short, this model is

If Internet technology is to become a counterpart to the VANS-based health- care data network, it is primarily neces- sary for it to be possible to pass on the structured EDI

China‟s decision to enhance its interaction with Africa is based on the pragmatism of the Infrastructure for Resources model , which envisages developing countries

Based on this report detailing the findings of an Open Source Intelligence gathering performed on ACME A/S, it is found that ACME A/S is vulnerable to 4 of 5 common, OSINT-

Best classification: The best model found in this project, based on the classifi- cation accuracy of normal and LQT2 ECGs, was found using three log-likelihood differences between

The structure of the model was first established to ascertain the model behavior, this was followed by unit checks; the model was calibrated using values for the parameters in the

As a model-based approach is used in this work, this paper starts by presenting the model of a submersible pump application in section III. The fault detection algorithm is