A new ANEW: Evaluation of a word list for sentiment analysis in microblogs
Finn ˚ Arup Nielsen
DTU Informatics, Technical University of Denmark, Lyngby, Denmark.
fn@imm.dtu.dk, http://www.imm.dtu.dk/~fn/
Abstract. Sentiment analysis of microblogs such as Twitter has re- cently gained a fair amount of attention. One of the simplest sentiment analysis approaches compares the words of a posting against a labeled word list, where each word has been scored for valence, — a “sentiment lexicon” or “affective word lists”. There exist several affective word lists, e.g., ANEW (Affective Norms for English Words) developed before the advent of microblogging and sentiment analysis. I wanted to examine how well ANEW and other word lists performs for the detection of sen- timent strength in microblog posts in comparison with a new word list specifically constructed for microblogs. I used manually labeled postings from Twitter scored for sentiment. Using a simple word matching I show that the new word list may perform better than ANEW, though not as good as the more elaborate approach found in SentiStrength.
1 Introduction
Sentiment analysis has become popular in recent years. Web services, such as socialmention.com, may even score microblog postings on Identi.ca and Twitter for sentiment in real-time. One approach to sentiment analysis starts with labeled texts and uses supervised machine learning trained on the labeled text data to classify the polarity of new texts [1]. Another approach creates a sentiment lexicon and scores the text based on some function that describes how the words and phrases of the text matches the lexicon. This approach is, e.g., at the core of the SentiStrength algorithm [2].
It is unclear how the best way is to build a sentiment lexicon. There ex- ist several word lists labeled with emotional valence, e.g., ANEW [3], General Inquirer, OpinionFinder [4], SentiWordNet and WordNet-Affect as well as the word list included in the SentiStrength software [2]. These word lists differ by the words they include, e.g., some do not include strong obscene words and Internet slang acronyms, such as “WTF” and “LOL”. The inclusion of such terms could be important for reaching good performance when working with short informal text found in Internet fora and microblogs. Word lists may also differ in whether the words are scored with sentiment strength or just positive/negative polarity.
I have begun to construct a new word list with sentiment strength and the
inclusion of Internet slang and obscene words. Although we have used it for
sentiment analysis on Twitter data [5] we have not yet validated it. Data sets with
manually labeled texts can evaluate the performance of the different sentiment analysis methods. Researchers increasingly use Amazon Mechanical Turk (AMT) for creating labeled language data, see, e.g., [6]. Here I take advantage of this approach.
2 Construction of word list
My new word list was initially set up in 2009 for tweets downloaded for on- line sentiment analysis in relation to the United Nation Climate Conference (COP15). Since then it has been extended. The version termed AFINN-96 dis- tributed on the Internet
1has 1468 different words, including a few phrases. The newest version has 2477 unique words, including 15 phrases that were not used for this study. As SentiStrength
2it uses a scoring range from −5 (very negative) to +5 (very positive). For ease of labeling I only scored for valence, leaving out, e.g., subjectivity/objectivity, arousal and dominance. The words were scored manually by the author.
The word list initiated from a set of obscene words [7, 8] as well as a few pos- itive words. It was gradually extended by examining Twitter postings collected for COP15 particularly the postings which scored high on sentiment using the list as it grew. I included words from the public domain Original Balanced Af- fective Word List
3by Greg Siegle. Later I added Internet slang by browsing the Urban Dictionary
4including acronyms such as WTF, LOL and ROFL. The most recent additions come from the large word list by Steven J. DeRose, The Compass DeRose Guide to Emotion Words.
5The words of DeRose are catego- rized but not scored for valence with numerical values. Together with the DeRose words I browsed Wiktionary and the synonyms it provided to further enhance the list. In some cases I used Twitter to determine in which contexts the word appeared. I also used the Microsoft Web n-gram similarity Web service (“Clus- tering words based on context similarity”
6) to discover relevant words. I do not distinguish between word categories so to avoid ambiguities I excluded words such as patient, firm, mean, power and frank. Words such as “surprise”—with high arousal but with variable sentiment—were not included in the word list.
Most of the positive words were labeled with +2 and most of the negative words with –2, see the histogram in Figure 1. I typically rated strong obscene words, e.g., as listed in [7], with either –4 or –5. The word list have a bias towards negative words (1598, corresponding to 65%) compared to positive words (878).
A single phrase was labeled with valence 0. The bias corresponds closely to the bias found in the OpinionFinder sentiment lexicon (4911 (64%) negative and 2718 positive words).
1
http://www2.imm.dtu.dk/pubdb/views/publication details.php?id=59819
2
http://sentistrength.wlv.ac.uk/
3
http://www.sci.sdsu.edu/CAL/wordlist/origwordlist.html
4
http://www.urbandictionary.com
5
http://www.derose.net/steve/resources/emotionwords/ewords.html
6
http://web-ngram.research.microsoft.com/similarity/
I compared the score of each word with mean valence of ANEW. Figure 2 shows a scatter plot for this comparison yielding a Spearman’s rank correlation on 0.81 when words are directly matched and including words only in the in- tersection of the two word lists. I also tried to match entries in ANEW and my word list by applying Porter word stemming (on both word lists) and WordNet
6 4 2 My valences0 2 4 6
0 200 400 600 800 1000
Absolute frequency
Histogram of valences for my word list
Fig. 1. Histogram of my valences.
lemmatization (on my word list) as implemented in NLTK [9]. The results did not change significantly.
When splitting the ANEW at valence 5 and my list at valence 0 I find a few discrepancies: ag- gressive, mischief, ennui, hard, silly, alert, mischiefs, noisy. Word stem- ming generates a few further dis- crepancies, e.g., alien/alienation, af- fection/affected, profit/profiteer.
6 4 2 0 2 4 6
My list 1
2 3 4 5 6 7 8 9
ANEW
Pearson correlation = 0.91 Spearman correlation = 0.81 Kendall correlation = 0.63 Correlation between sentiment word lists
Fig. 2. Correlation between ANEW and my new word list.
Apart from ANEW I also exam- ined General Inquirer and the Opin- ionFinder word lists. As these word lists report polarity I associated words with positive sentiment with the va- lence +1 and negative with –1. I furthermore obtained the sentiment strength from SentiStrength via its Web service
7and converted its pos- itive and negative sentiments to one single value by selecting the one with the numerical largest value and zero- ing the sentiment if the positive and negative sentiment magnitudes were equal.
3 Twitter data
For evaluating and comparing the word list with ANEW, General Inquirer, Opin- ionFinder and SentiStrength a data set of 1,000 tweets labeled with AMT was applied. These labeled tweets were collected by Alan Mislove for the Twitter- mood/“Pulse of a Nation”
8study [10]. Each tweet was rated ten times to get a more reliable estimate of the human-perceived mood, and each rating was a sentiment strength with an integer between 1 (negative) and 9 (positive). The average over the ten values represented the canonical “ground truth” for this study. The tweets were not used during the construction of the word list.
To compute a sentiment score of a tweet I identified words and found the va-
7
http://sentistrength.wlv.ac.uk/
8
http://www.ccs.neu.edu/home/amislove/twittermood/
Table 1. Example tweet scoring. –5 has been subtracted from the original ANEW score. SentiStrength reported “positive strength 1 and negative strength –2”.
Words: ear infection making it impossible 2 sleep headed 2 the doctors 2 get new prescription so fucking early
My 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -4 0 -4
ANEW 0 -3.34 0 0 0 0 2.2 0 0 0 0 0 0 0 0 0 0 0 -1.14
GI 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1
OF 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 -1
SS -2
lence for each word by lookup in the sentiment lexicons. The sum of the valences of the words divided by the number of words represented the combined sentiment strength for a tweet. I also tried a few other weighting schemes: The sum of valence without normalization of words, normalizing the sum with the number of words with non-zero valence, choosing the most extreme valence among the words and quantisizing the tweet valences to +1, 0 and –1. For ANEW I also applied a version with match using the NLTK WordNet lemmatizer.
4 Results
11.5 1.0 0.5 My list0.0 0.5 1.0 1.5
2 3 4 5 6 7 8 9
Amazon Mechanical Turk
Pearson correlation = 0.564 Spearman correlation = 0.596 Kendall correlation = 0.439 Scatterplot of sentiment strengths for tweets
Fig. 3. Scatter plot of sentiment strengths for 1,000 tweets with AMT sentiment plotted against sentiment found by application or my word list.
My word tokenization identified 15,768 words in total among the 1,000 tweets with 4,095 unique words. 422 of these 4,095 words hit my 2,477 word sized list, while the corresponding number for ANEW was 398 of its 1034 words.
Of the 3392 words in General Inquirer I labeled with non-zero sentiment 358 were found in our Twitter corpus and for OpinionFinder this number was 562 from a total of 6442, see Table 1 for a scored example tweet.
My ANEW GI OF SS
AMT .564 .525 .374 .458 .610
My .696 .525 .675 .604
ANEW .592 .624 .546
GI .705 .474
OF .512
Table 2. Pearson correlations between sentiment strength detections methods on 1,000 tweets. AMT: Amazon Me- chanical Turk, GI: General Inquirer, OF:
OpinionFinder, SS: SentiStrength.
I found my list to have a
higher correlation (Pearson correla-
tion: 0.564, Spearman’s rank corre-
lation: 0.596, see the scatter plot
in Figure 3) with the labeling from
the AMT than ANEW had (Pear-
son: 0.525, Spearman: 0.544). In my
application of the General Inquirer
word list it did not perform well hav-
ing a considerable lower AMT correla-
tion than my list and ANEW (Pear-
son: 0.374, Spearman: 0.422). Opin-
ionFinder with its 90% larger lexi-
con performed better than General In-
quirer but not as good as my list and
ANEW (Pearson: 0.458, Spearman: 0.491). The SentiStrength analyzer showed superior performance with a Pearson correlation on 0.610 and Spearman on 0.616, see Table 2.
I saw little effect of the different tweet sentiment scoring approaches: For ANEW 4 different Pearson correlations were in the range 0.522–0.526. For my list I observed correlations in the range 0.543–0.581 with the extreme scoring as the lowest and sum scoring without normalization the highest. With quantization of the tweet scores to +1, 0 and –1 the correlation only dropped to 0.548. For the Spearman correlation the sum scoring with normalization for the number of words appeared as the one with the highest value (0.596).
5100 300 500 1000 1500 2000 2477
0.00.1 0.20.3 0.40.5 0.6
Pearson correlation
Evolution of word list performance
5100 300 500 1000Word list size1500 2000 2477 0.35
0.40 0.45 0.50 0.55
Spearman rank correlation
Fig. 4. Performance growth with word list extension from 5 words 2477 words.
Upper panel: Pearson, lower: Spearman rank correlation, generated from 50 re- samples among the 2477 words.
To examine whether the difference in performance between the applica- tion of ANEW and my list is due to a different lexicon or a different scor- ing I looked on the intersection be- tween the two word lists. With a di- rect match this intersection consisted of 299 words. Building two new sen- timent lexicons with these 299 words, one with the valences from my list, the other with valences from ANEW, and applying them on the Twitter data I found that the Pearson correlations were 0.49 and 0.52 to ANEW’s advan- tage.
5 Discussion
On the simple word list approach for sentiment analysis I found my list perform- ing slightly ahead of ANEW. However the more elaborate sentiment analysis in SentiStrength showed the overall best performance with a correlation to AMT labels on 0.610. This figure is close to the correlations reported in the evaluation of the SentiStrength algorithm on 1,041 MySpace comments (0.60 and 0.56) [2].
Even though General Inquirer and OpinionFinder have the largest word lists I found I could not make them perform as good as SentiStrength, my list and ANEW for sentiment strength detection in microblog posting. The two former lists both score words on polarity rather than strength and it could explain the difference in performance.
Is the difference between my list and ANEW due to better scoring or more words? The analysis of the intersection between the two word list indicated that the ANEW scoring is better. The slightly better performance of my list with the entire lexicon may be due to its inclusion of Internet slang and obscene words.
Newer methods, e.g., as implemented in SentiStrength, use a range of tech- niques: detection of negation, handling of emoticons and spelling variations [2].
The present application of my list used none of these approaches and might have
benefited. However, the SentiStrength evaluation showed that valence switching at negation and emoticon detection might not necessarily increase the perfor- mance of sentiment analyzers (Tables 4 and 5 in [2]).
The evolution of the performance (Figure 4) suggests that the addition of words to my list might still improve its performance slightly.
Although my list comes slightly ahead of ANEW in Twitter sentiment analy- sis, ANEW is still preferable for scientific psycholinguistic studies as the scoring has been validated across several persons. Also note that ANEW’s standard deviation was not used in the scoring. It might have improved its performance.
Acknowledgment I am grateful to Alan Mislove and Sune Lehmann for pro- viding the 1,000 tweets with the Amazon Mechanical Turk labels and to Steven J. DeRose and Greg Siegle for providing their word lists. Mislove, Lehmann and Daniela Balslev also provided input to the article. I thank the Danish Strate- gic Research Councils for generous support to the ‘Responsible Business in the Blogosphere’ project.
References
1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1-2) (2008) 1–135
2. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology 61 (12) (2010) 2544–2558
3. Bradley, M.M., Lang, P.J.: Affective norms for English words (ANEW): Instruction manual and affective ratings. Technical Report C-1, The Center for Research in Psychophysiology, University of Florida (1999)
4. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase- level sentiment analysis. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, Association for Computational Linguistics (2005)
5. Hansen, L.K., Arvidsson, A., Nielsen, F.˚ A., Colleoni, E., Etter, M.: Good friends, bad news — affect and virality in Twitter. Accepted for The 2011 International Workshop on Social Computing, Network, and Services (SocialComNet 2011) (2011)
6. Akkaya, C., Conrad, A., Wiebe, J., Mihalcea, R.: Amazon Mechanical Turk for subjectivity word sense disambiguation. In: Proceedings of the NAACL HLT 2010 Workshop on Creating, Speech and Language Data with Amazon’s Mechanical Turk, Association for Computational Linguistics (2010) 195–203
7. Baudhuin, E.S.: Obscene language and evaluative response: an empirical study.
Psychological Reports 32 (1973)
8. Sapolsky, B.S., Shafer, D.M., Kaye, B.K.: Rating offensive words in three television program contexts. BEA 2008, Research Division (2008)
9. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly, Sebastopol, California (June 2009)
10. Biever, C.: Twitter mood maps reveal emotional states of America. The New
Scientist 207 (2771) (July 2010) 14
Supplementary material
A Extra results
AMT Mislove ANEW AFINN-96 AFINN AFINN(q) AFINN(s) AFINN(a) ANEW(r) ANEW(l) ANEW(rs) ANEW(a)
0.506 0.546 0.564 0.548 0.581 0.564 0.526 0.527 0.526 0.523
0.564 0.578 0.588 0.583 0.616 0.780 0.760 0.855 0.950
0.970 0.734 0.846 0.802 0.660 0.643 0.580 0.582
0.754 0.870 0.824 0.696 0.682 0.599 0.598
0.836 0.889 0.531 0.528 0.592 0.606
0.884 0.547 0.536 0.628 0.594 0.572 0.563 0.606 0.620 0.973 0.853 0.819 0.829 0.796 0.893
Table 3. Pearson correlation matrix. The first data row is the correlation for Amazon Mechanical Turk (AMT) labeling. AFINN is my list. AFINN: sum of word valences with weighting over number of words in text, AFINN(q): (+1, 0, −1)-quantization, AFINN(s): sum of word valences, AFINN(a): average of non-zero word valences, AFINN(x): extreme scoring, ANEW: ANEW with “raw” (direct match) and sum of word valences weighted by number of words, ANEW: with NLTK WordNet lemmatizer word match, ANEW(rs): raw match with sum of word valences, ANEW(a): average of non-zero word valences, GI: General Inquirer, OF: OpinionFinder, SS: SentiStrength.
AMT Mislove ANEW AFINN-96 AFINN AFINN(q) AFINN(s) AFINN(a) ANEW(r) ANEW(l) ANEW(rs) ANEW(a)
0.507 0.592 0.596 0.580 0.591 0.581 0.545 0.548 0.537 0.526
0.630 0.635 0.615 0.625 0.635 0.884 0.857 0.895 0.950
0.970 0.912 0.941 0.925 0.656 0.646 0.650 0.647
0.941 0.969 0.954 0.668 0.657 0.661 0.656
0.944 0.933 0.622 0.620 0.644 0.634
0.959 0.630 0.622 0.662 0.644 0.641 0.635 0.651 0.644 0.972 0.946 0.936 0.918 0.905 0.943
Table 4. Spearman rank correlation matrix. For explanation of columns see Table 3.
Listings
1 #! / u s r / b i n / env python
2 #
3 # Th i s program go t h r o u g h t h e l i s t a c a p t u r e s d o u b l e e n t r i e s 4 # and o r d e r i n g p r o b l e m s
5 #
6 # $ I d : N i e l s e n 2 0 1 1 N e w c h e c k . py , v 1 . 4 2 0 1 1 / 0 3 / 1 6 1 1 : 3 4 : 2 4 f n Exp $ 7
8
9 f i l e b a s e = ’ /home/ f n i e l s e n / ’
10 f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / d a t a / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . c s v ’ 11
12 l i n e s = [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e a f i n n) ] 13 f o r l i n e i n l i n e s:
14 i f l e n(l i n e) != 2 : 15 p r i n t(l i n e) 16
17 a f i n n = map(lambda (k,v) : (k,i n t(v) ) ,
18 [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e a f i n n) ] ) 19 words = [ word f o r word, v a l e n c e i n a f i n n ]
20 21
22 swo r d s = s o r t e d(l i s t(s e t(words) ) ) 23
24 f o r n i n r a n g e(l e n(words) ) : 25 i f words[n] != swo r d s[n] :
26 p r i n t(words[n] + ” ” + swo r d s[n] )
27 break
1 #! / u s r / b i n / env python
2 #
3 # C o n s t r u c t h i s t o g r a m o f AFINN v a l e n c e s .
4 #
5 # $ I d : N i e l s e n 2 0 1 1 N e w h i s t . py , v 1 . 2 2 0 1 1 / 0 3 / 1 4 2 0 : 5 3 : 1 7 f n Exp $ 6
7 import numpy a s np 8 import p y l a b 9 import s y s 10 r e l o a d(s y s)
11 s y s.s e t d e f a u l t e n c o d i n g(’ u t f−8 ’) 12
13 f i l e b a s e = ’ /home/ f n i e l s e n / ’
14 f i l e n a m e = f i l e b a s e + ’ f n i e l s e n / d a t a / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . c s v ’ 15 a f i n n = d i c t(map(lambda (k,v) : (k,i n t(v) ) ,
16 [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e) ] ) ) 17
18 p y l a b.h i s t(a f i n n.v a l u e s( ) , b i n s=11 , r a n g e=(−5.5 , 5 . 5 ) , f a c e c o l o r= ( 0 , 1 , 0 ) ) 19 p y l a b.x l a b e l(’My v a l e n c e s ’)
20 p y l a b.y l a b e l(’ A b s o l u t e f r e q u e n c y ’)
21 p y l a b.t i t l e(’ H i st o g r a m o f v a l e n c e s f o r my word l i s t ’) 22
23 p y l a b.show( ) 24
25 # p y l a b . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / e p s / ’ + ’ N i e l s e n 2 0 1 1 N e w h i s t . e p s ’ )
1 #! / u s r / b i n / env python
2 #
3 # $ I d : N i e l se n 2 0 1 1 N e w e x a m p l e . py , v 1 . 2 2 0 1 1 / 0 4 / 1 1 1 7 : 3 1 : 2 2 f n Exp $ 4
5 import c s v
6 import r e
7 import numpy a s np
8 from n l t k.stem.wordnet import WordNetLemmatizer
9 from n l t k import s e n t t o k e n i z e 10 import p y l a b
11 from s c i p y.s t a t s.s t a t s import k e n d a l l t a u, sp e a r m a n r 12 import s i m p l e j s o n
13
14 # Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t 15 m i s l o v e = 2
16
17 # F i l e n a m e s
18 f i l e b a s e = ’ /home/ f n / ’
19 f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / d a t a / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . c s v ’ 20 f i l e n a m e a f i n n 9 6 = ’AFINN−96. t x t ’
21 f i l e n a m e m i s l o v e 1 = ” r e s u l t s . t x t ” 22 f i l e n a m e m i s l o v e 2 a = ” t u r k . t x t ”
23 f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ” 24 f i l e n a m e a n e w = f i l e b a s e + ’ d a t a /ANEW.TXT ’ 25 f i l e n a m e g i = f i l e b a s e + ” d a t a / i n q t a b s . t x t ”
26 f i l e n a m e o f = f i l e b a s e + ” d a t a / s u b j c l u e s l e n 1−HLTEMNLP05. t f f ” 27
28 # Word s p l i t t e r p a t t e r n
29 p a t t e r n s p l i t = r e.c o m p i l e(r”\W+”) 30
31
32 a f i n n = d i c t(map(lambda (k,v) : (k,i n t(v) ) ,
33 [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e a f i n n) ] ) ) 34
35
36 a f i n n 9 6 = d i c t(map(lambda (k,v) : (k,i n t(v) ) ,
37 [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e a f i n n 9 6) ] ) ) 38
39
40 # ANEW
41 anew = d i c t(map(lambda l: (l[ 0 ] , f l o a t(l[ 2 ] ) − 5 ) ,
42 [ r e.s p l i t(’\s+ ’, l i n e) f o r l i n e i n open(f i l e n a m e a n e w) .r e a d l i n e s( ) [ 4 1 : 1 0 7 5 ] ] ) ) 43
44
45 # O p i n i o n F i n d e r 46 o f = {}
47 f o r l i n e i n open(f i l e n a m e o f) .r e a d l i n e s( ) : 48 e l e m e n t s = r e.s p l i t(’ (\s| \= ) ’, l i n e) 49 i f e l e m e n t s[ 2 2 ] == ’ p o s i t i v e ’: 50 o f[e l e m e n t s[ 1 0 ] ] = +1 51 e l i f e l e m e n t s[ 2 2 ] == ’ n e g a t i v e ’: 52 o f[e l e m e n t s[ 1 0 ] ] = −1 53
54
55 # G e n e r a l i n q u i r e r
56 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e g i) , d e l i m i t e r=’\t ’) 57 h e a d e r = [ ]
58 g i = {}
59 p r e v i o u s w o r d = [ ] 60 p r e v i o u s v a l e n c e = [ ] 61 f o r row i n c s v r e a d e r:
62 i f not h e a d e r:
63 h e a d e r = row
64 e l i f l e n(row) > 2 :
65 word = r e.s e a r c h(”\w+”, row[ 0 ] .l o w e r( ) ) .g r o u p( ) 66 i f row[ 2 ] == ” P o s i t i v ”:
67 v a l e n c e = 1
68 e l i f row[ 3 ] == ” N e g a t i v ”:
69 v a l e n c e = −1
70 e l s e:
71 v a l e n c e = 0
72 i f r e.s e a r c h(”#”, row[ 0 ] .l o w e r( ) ) :
73 i f p r e v i o u s w o r d == word:
74 i f p r e v i o u s v a l e n c e == [ ] :
75 p r e v i o u s v a l e n c e = v a l e n c e
76 e l i f p r e v i o u s v a l e n c e != v a l e n c e:
77 p r e v i o u s v a l e n c e = 0
78 e l s e:
79 i f p r e v i o u s v a l e n c e:
80 g i[p r e v i o u s w o r d] = p r e v i o u s v a l e n c e
81 p r e v i o u s w o r d = word
82 p r e v i o u s v a l e n c e = [ ]
83 e l i f v a l e n c e:
84 g i[word] = v a l e n c e
85 86 87
88 # Lemmatizer f o r WordNet
89 l e m m a t i z e r = WordNetLemmatizer( ) 90
91 92
93 d ef wo r d s2 a n e wse n t i m e n t(words) :
94 ”””
95 Co n v e r t words t o s e n t i m e n t b a se d on ANEW v i a WordNet Lemmatizer
96 ”””
97 s e n t i m e n t = 0 98 f o r word i n words:
99 i f word i n anew:
100 s e n t i m e n t += anew[word]
101 continue
102 l wo r d = l e m m a t i z e r.l e m m a t i z e(word)
103 i f l wo r d i n anew:
104 s e n t i m e n t += anew[l wo r d]
105 continue
106 l wo r d = l e m m a t i z e r.l e m m a t i z e(word, p o s=’ v ’)
107 i f l wo r d i n anew:
108 s e n t i m e n t += anew[l wo r d]
109 return s e n t i m e n t/l e n(words) 110
111
112 d ef e x t r e m e v a l e n c e(v a l e n c e s) :
113 ”””
114 Return t h e most e x t r e m e v a l e n c e . I f e x t r e m e s have d i f f e r e n t s i g n t h e n 115 z e r o i s r e t u r n e d
116 ”””
117 imax = np.a r g s o r t(np.a b s(v a l e n c e s) ) [−1 ]
118 e x t r e m e s = f i l t e r(lambda v: a b s(v) == a b s(v a l e n c e s[imax] ) , v a l e n c e s) 119 e x t r e m e s s a m e s i g n = f i l t e r(lambda v: v == v a l e n c e s[imax] , v a l e n c e s) 120 i f e x t r e m e s == e x t r e m e s s a m e s i g n:
121 return v a l e n c e s[imax] 122 e l s e:
123 return 0
124
125 d ef c o r r m a t r i x 2 l a t e x(C, columns=None) :
126 ”””
127 s = c o r r m a t r i x 2 l a t e x (C) 128 p r i n t ( s )
129 ”””
130 s = ’\n\\b e g i n{t a b u l a r}{’ + (’ r ’∗(C.sh a p e[ 0 ]−1 ) ) + ’}\n ’
131 i f columns:
132 s += ” & ”.j o i n(columns[ 1 : ] ) + ” \\\\\n\\h l i n e\n”
133 f o r n i n r a n g e(C.sh a p e[ 0 ] ) :
134 row = [ ]
135 f o r m i n r a n g e( 1 , C.sh a p e[ 1 ] ) :
136 i f m> n:
137 row.append(” %.3 f ” %C[n,m] )
138 e l s e:
139 row.append(” ”)
140 s += ” & ”.j o i n(row) + ’ \\\\\n ’
141 s += ’\\end{t a b u l a r}\n ’
142 return s 143
144
145 d ef sp e a r m a n m a t r i x(d a t a) :
146 ”””
147 Spearman r r a n k c o r r e l a t i o n m a t r i x
148 ”””
149 C= np.z e r o s( (d a t a.sh a p e[ 1 ] , d a t a.sh a p e[ 1 ] ) ) 150 f o r n i n r a n g e(d a t a.sh a p e[ 1 ] ) :
151 f o r m i n r a n g e(d a t a.sh a p e[ 1 ] ) :
152 C[n,m] = sp e a r m a n r(d a t a[ : ,n] , d a t a[ : ,m] ) [ 0 ]
153 return C
154 155
156 d ef k e n d a l l m a t r i x(d a t a) :
157 ”””
158 K e n d a l l t a u r a n k c o r r e l a t i o n m a t r i x
159 ”””
160 C= np.z e r o s( (d a t a.sh a p e[ 1 ] , d a t a.sh a p e[ 1 ] ) ) 161 f o r n i n r a n g e(d a t a.sh a p e[ 1 ] ) :
162 f o r m i n r a n g e(d a t a.sh a p e[ 1 ] ) :
163 C[n,m] = k e n d a l l t a u(d a t a[ : ,n] , d a t a[ : ,m] ) [ 0 ]
164 return C
165 166 167
168 # Read M i s l o v e CSV T w i t t e r d a t a : ’ t w e e t s ’ an a r r a y o f d i c t i o n a r i e s 169 i f m i s l o v e == 1 :
170 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 1) , d e l i m i t e r=’\t ’)
171 h e a d e r = [ ]
172 t w e e t s = [ ]
173 f o r row i n c s v r e a d e r:
174 i f not h e a d e r:
175 h e a d e r = row
176 e l s e:
177 t w e e t s.append({’ i d ’: row[ 0 ] ,
178 ’ quant ’: i n t(row[ 1 ] ) ,
179 ’ s c o r e o u r ’: f l o a t(row[ 2 ] ) ,
180 ’ s c o r e m e a n ’: f l o a t(row[ 3 ] ) ,
181 ’ s c o r e s t d ’: f l o a t(row[ 4 ] ) ,
182 ’ t e x t ’: row[ 5 ] ,
183 ’ s c o r e s ’: map(i n t, row[ 6 : ] )})
184 e l i f m i s l o v e == 2 :
185 i f F a l s e:
186 t w e e t s = s i m p l e j s o n.l o a d(open(’ t w e e t s w i t h s e n t i s t r e n g t h . j s o n ’) ) 187 e l s e:
188 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 2 a) , d e l i m i t e r=’\t ’)
189 t w e e t s = [ ]
190 f o r row i n c s v r e a d e r:
191 t w e e t s.append({’ i d ’: row[ 2 ] ,
192 ’ s c o r e m i s l o v e 2 ’: f l o a t(row[ 1 ] ) ,
193 ’ t e x t ’: row[ 3 ]})
194 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 2 b) , d e l i m i t e r=’\t ’)
195 t w e e t s d i c t = {}
196 h e a d e r = [ ]
197 f o r row i n c s v r e a d e r:
198 i f not h e a d e r:
199 h e a d e r = row
200 e l s e:
201 t w e e t s d i c t[row[ 0 ] ] = {’ i d ’: row[ 0 ] ,
202 ’ s c o r e m i s l o v e ’: f l o a t(row[ 1 ] ) ,
203 ’ s c o r e a m t w r o n g ’: f l o a t(row[ 2 ] ) ,
204 ’ s c o r e a m t ’: np.mean(map(i n t, r e.s p l i t(”\s+”, ” ”.j o i n(row
205 }
206 f o r n i n r a n g e(l e n(t w e e t s) ) :
207 t w e e t s[n] [’ s c o r e m i s l o v e ’] = t w e e t s d i c t[t w e e t s[n] [’ i d ’] ] [’ s c o r e m i s l o v e ’] 208 t w e e t s[n] [’ s c o r e a m t w r o n g ’] = t w e e t s d i c t[t w e e t s[n] [’ i d ’] ] [’ s c o r e a m t w r o n g ’] 209 t w e e t s[n] [’ s c o r e a m t ’] = t w e e t s d i c t[t w e e t s[n] [’ i d ’] ] [’ s c o r e a m t ’]
210 211
212 # Add s e n t i m e n t s t o ’ t w e e t s ’ 213 f o r n i n r a n g e(l e n(t w e e t s) ) :
214 words = p a t t e r n s p l i t.s p l i t(t w e e t s[n] [ ’ t e x t ’] .l o w e r( ) ) 215 t w e e t s[n] [’ words ’] = words
216 a f i n n s e n t i m e n t s = map(lambda word: a f i n n.g e t(word, 0 ) , words) 217 a f i n n s e n t i m e n t = f l o a t(sum(a f i n n s e n t i m e n t s) ) /l e n(a f i n n s e n t i m e n t s) 218 a f i n n 9 6 s e n t i m e n t s = map(lambda word: a f i n n 9 6.g e t(word, 0 ) , words)
219 a f i n n 9 6 s e n t i m e n t = f l o a t(sum(a f i n n 9 6 s e n t i m e n t s) ) /l e n(a f i n n 9 6 s e n t i m e n t s) 220 a n e w s e n t i m e n t s = map(lambda word: anew.g e t(word, 0 ) , words)
221 a n e w s e n t i m e n t = f l o a t(sum(a n e w s e n t i m e n t s) ) /l e n(a n e w s e n t i m e n t s) 222 g i s e n t i m e n t s = map(lambda word: g i.g e t(word, 0 ) , words)
223 g i s e n t i m e n t = f l o a t(sum(g i s e n t i m e n t s) ) /l e n(g i s e n t i m e n t s) 224 o f s e n t i m e n t s = map(lambda word: o f.g e t(word, 0 ) , words) 225 o f s e n t i m e n t = f l o a t(sum(o f s e n t i m e n t s) ) /l e n(o f s e n t i m e n t s) 226 t w e e t s[n] [’ s e n t i m e n t a f i n n 9 6 ’] = a f i n n 9 6 s e n t i m e n t
227 t w e e t s[n] [’ s e n t i m e n t a f i n n ’] = a f i n n s e n t i m e n t
228 t w e e t s[n] [’ s e n t i m e n t a f i n n q u a n t ’] = np.s i g n(a f i n n s e n t i m e n t) 229 t w e e t s[n] [’ s e n t i m e n t a f i n n s u m ’] = sum(a f i n n s e n t i m e n t s)
230 n o n z e r o s = l e n(f i l t e r(lambda n o n z e r o: n o n z e r o, a f i n n s e n t i m e n t s) ) 231 i f not n o n z e r o s: n o n z e r o s = 1
232 t w e e t s[n] [’ s e n t i m e n t a f i n n n o n z e r o ’] = sum(a f i n n s e n t i m e n t s) /n o n z e r o s 233 t w e e t s[n] [’ s e n t i m e n t a f i n n e x t r e m e ’] = e x t r e m e v a l e n c e(a f i n n s e n t i m e n t s) 234 t w e e t s[n] [’ s e n t i m e n t a n e w r a w ’] = a n e w s e n t i m e n t
235 t w e e t s[n] [’ s e n t i m e n t a n e w l e m m a t i z e ’] = wo r d s2 a n e wse n t i m e n t(words) 236 t w e e t s[n] [’ se n t i m e n t a n e w r a w su m ’] =sum(a n e w s e n t i m e n t s)
237 n o n z e r o s = l e n(f i l t e r(lambda n o n z e r o: n o n z e r o, a n e w s e n t i m e n t s) ) 238 i f not n o n z e r o s: n o n z e r o s = 1
239 t w e e t s[n] [’ s e n t i m e n t a n e w r a w n o n z e r o s ’] = sum(a n e w s e n t i m e n t s) /n o n z e r o s 240 t w e e t s[n] [’ s e n t i m e n t g i r a w ’] = g i s e n t i m e n t
241 t w e e t s[n] [’ s e n t i m e n t o f r a w ’] = o f s e n t i m e n t 242
243
244 # I n d e x f o r example t w e e t
245 words = t w e e t s[i n d e x] [’ words ’] [ :−1 ] 246 i n d e x = 10
247 s = ” ”
248 s += ” Text : & \\m u l t i c o l u m n{%d}{c}{” % l e n(words) + t w e e t s[i n d e x] [’ t e x t ’] + ”} \\\\ [ 1 p t ] \\h l i n e 249 s += ”Words : & ” + ” & ”.j o i n(words) + ” \\\\ [ 1 p t ] \n”
250 s += ”My & ” + ” & ”.j o i n( [ s t r(a f i n n.g e t(w, 0 ) ) f o r w i n words ] ) + ” & ”+ s t r(sum( [ a f i n n.g e t(w, ] ) ) + ” \\\\ [ 1 p t ] \n”
251 s += ”ANEW & ” + ” & ”.j o i n( [ s t r(anew.g e t(w, 0 ) ) f o r w i n words ] ) + ” & ”+ s t r(sum( [ anew.g e t(w, ] ) ) + ” \\\\ [ 1 p t ] \n”
252 s += ”GI & ” + ” & ”.j o i n( [ s t r(g i.g e t(w, 0 ) ) f o r w i n words ] ) + ” & ”+ s t r(sum( [ g i.g e t(w, 0 ) f o r ] ) ) + ” \\\\ [ 1 p t ] \n”
253 s += ”OF & ” + ” & ”.j o i n( [ s t r(o f.g e t(w, 0 ) ) f o r w i n words ] ) + ” & ”+ s t r(sum( [ o f.g e t(w, 0 ) f o r ] ) ) + ” \\\\ [ 1 p t ] \n”
254 s += ”SS ” + ” & ” ∗ (l e n(words)+1) + ” −1 ” + ”\\\\ [ 1 p t ] \h l i n e \n”
255
256 # ’ Ear i n f e c t i o n making i t i m p o s s i b l e 2 s l e e p . headed 2 t h e d o c t o r s 2
257 # g e t new p r e s c r i p t i o n . s o f u c k i n g ’ h a s p o s i t i v e s t r e n g t h 1 and n e g a t i v e s t r e n g t h −2 258
259 p r i n t(s) 260
261 262
263 s c o r e a m t = np.a s a r r a y( [ t[’ s c o r e a m t ’] f o r t i n t w e e t s ] )
264 s c o r e a f i n n = np.a s a r r a y( [ t[’ s e n t i m e n t a f i n n ’] f o r t i n t w e e t s ] ) 265
266 t = ”””
267 AMT
268 p o s i t i v e n e u r a l n e g a t i v e
269 AFINN p o s i t i v e %d %d %d
270 n e u t r a l %d %d %d
271 n e g a t i v e %d %d %d
272 ””” % (sum( ( 1∗(s c o r e a m t>5)) ∗ ( 1∗(s c o r e a f i n n>0 ) ) ) , 273 sum( ( 1∗(s c o r e a m t==5)) ∗ ( 1∗(s c o r e a f i n n>0 ) ) ) ,
274 sum( ( 1∗(s c o r e a m t<5)) ∗ ( 1∗(s c o r e a f i n n>0 ) ) ) , 275 sum( ( 1∗(s c o r e a m t>5)) ∗ ( 1∗(s c o r e a f i n n==0))) , 276 sum( ( 1∗(s c o r e a m t==5)) ∗ ( 1∗(s c o r e a f i n n==0))) , 277 sum( ( 1∗(s c o r e a m t<5)) ∗ ( 1∗(s c o r e a f i n n==0))) , 278 sum( ( 1∗(s c o r e a m t>5)) ∗ ( 1∗(s c o r e a f i n n<0 ) ) ) , 279 sum( ( 1∗(s c o r e a m t==5)) ∗ ( 1∗(s c o r e a f i n n<0 ) ) ) , 280 sum( ( 1∗(s c o r e a m t<5)) ∗ ( 1∗(s c o r e a f i n n<0 ) ) ) ) 281
282 p r i n t(t) 283
284 # 0 . 1∗( 2 7 7 + 2 9 9 + 5 )
1 #! / u s r / b i n / env python
2 #
3 # The program compares AFINN and ANEW word l i s t s .
4 #
5 # ( Copied from H a n s e n 2 0 1 0 D i f f u s i o n and e x t e n d e d . )
6 #
7 # $ I d : H a n s e n 2 0 1 0 D i f f u s i o n a n e w . py , v 1 . 3 2 0 1 0 / 1 2 / 1 5 1 5 : 5 0 : 3 9 f n Exp $ 8
9 from n l t k.stem.wordnet import WordNetLemmatizer
10 import n l t k 11 import numpy a s np 12 import p y l a b 13 import r e
14 from s c i p y.s t a t s.s t a t s import k e n d a l l t a u, sp e a r m a n r 15 import s y s
16 r e l o a d(s y s)
17 s y s.s e t d e f a u l t e n c o d i n g(’ u t f−8 ’) 18
19
20 f i l e b a s e = ’ /home/ f n / ’
21 f i l e n a m e = f i l e b a s e + ’ f n i e l s e n / d a t a / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . c s v ’ 22 a f i n n = d i c t(map(lambda (k,v) : (k,i n t(v) ) ,
23 [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e) ] ) ) 24
25 f i l e n a m e = f i l e b a s e + ’ d a t a /ANEW.TXT ’
26 anew = d i c t(map(lambda l: (l[ 0 ] , f l o a t(l[ 2 ] ) ) ,
27 [ r e.s p l i t(’\s+ ’, l i n e) f o r l i n e i n open(f i l e n a m e) .r e a d l i n e s( ) [ 4 1 : 1 0 7 5 ] ] ) ) 28
29 l e m m a t i z e r = WordNetLemmatizer( ) 30 stemmer = n l t k.PorterStemmer( ) 31
32 anew stem = d i c t( [ (stemmer.stem(word) , v a l e n c e) f o r word, v a l e n c e i n anew.i t e m s( ) ] ) 33
34 35
36 d ef wo r d 2 a n e wse n t i m e n t r a w(word) : 37 return anew.g e t(word, None) 38
39
40 d ef wo r d 2 a n e wse n t i m e n t wo r d n e t(word) : 41 s e n t i m e n t = None
42 i f word i n anew:
43 s e n t i m e n t = anew[word]
44 e l s e:
45 l wo r d = l e m m a t i z e r.l e m m a t i z e(word)
46 i f l wo r d i n anew:
47 s e n t i m e n t = anew[l wo r d]
48 e l s e:
49 l wo r d = l e m m a t i z e r.l e m m a t i z e(word, p o s=’ v ’)
50 i f l wo r d i n anew:
51 s e n t i m e n t = anew[l wo r d]
52 return s e n t i m e n t 53
54
55 d ef wo r d 2 a n e wse n t i m e n t st e m(word) :
56 return anew stem.g e t(stemmer.stem(word) , None) 57
58 59
60 s e n t i m e n t s r a w = [ ] 61 f o r word i n a f i n n.k e y s( ) :
62 s e n t i m e n t a n e w = wo r d 2 a n e wse n t i m e n t r a w(word) 63 i f s e n t i m e n t a n e w:
64 s e n t i m e n t s r a w.append( (a f i n n[word] , s e n t i m e n t a n e w) ) 65
66
67 s e n t i m e n t s w o r d n e t = [ ] 68 f o r word i n a f i n n.k e y s( ) :
69 s e n t i m e n t a n e w = wo r d 2 a n e wse n t i m e n t wo r d n e t(word) 70 i f s e n t i m e n t a n e w:
71 s e n t i m e n t s w o r d n e t.append( (a f i n n[word] , s e n t i m e n t a n e w) ) 72 i f (a f i n n[word] > 0 and s e n t i m e n t a n e w < 5 ) or \
73 (a f i n n[word] < 0 and s e n t i m e n t a n e w > 5 ) :
74 p r i n t(word)
75 76
77 s e n t i m e n t s s t e m = [ ] 78 f o r word i n a f i n n.k e y s( ) :
79 s e n t i m e n t s t e m a n e w = wo r d 2 a n e wse n t i m e n t st e m(word) 80 i f s e n t i m e n t s t e m a n e w:
81 s e n t i m e n t s s t e m.append( (a f i n n[word] , s e n t i m e n t s t e m a n e w) ) 82 i f (a f i n n[word] > 0 and s e n t i m e n t s t e m a n e w < 5 ) or \ 83 (a f i n n[word] < 0 and s e n t i m e n t s t e m a n e w > 5 ) :
84 p r i n t(word)
85 86
87 s e n t i m e n t s r a w = np.a s a r r a y(s e n t i m e n t s r a w) 88 p y l a b.f i g u r e( 1 )
89 p y l a b.p l o t(s e n t i m e n t s r a w[ : , 0 ] , s e n t i m e n t s r a w[ : , 1 ] , ’ . ’) 90 p y l a b.x l a b e l(’ Our l i s t ’)
91 p y l a b.y l a b e l(’ANEW’)
92 p y l a b.t i t l e(’ C o r r e l a t i o n between s e n t i m e n t word l i s t s ( D i r e c t match ) ’)
93 p y l a b.t e x t( 1 , 3 , ” P e a r so n c o r r e l a t i o n = %.2 f ” % np.c o r r c o e f(s e n t i m e n t s r a w.T) [ 1 , 0 ] )
94 p y l a b.t e x t( 1 , 2 . 5 , ” Spearman c o r r e l a t i o n = %.2 f ” % sp e a r m a n r(s e n t i m e n t s r a w[ : , 0 ] , s e n t i m e n t s r a w[ 95 p y l a b.t e x t( 1 , 2 , ” K e n d a l l c o r r e l a t i o n = %.2 f ” % k e n d a l l t a u(s e n t i m e n t s r a w[ : , 0 ] , s e n t i m e n t s r a w[ : , 96 # p y l a b . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / e p s / ’ + ’ N i e l se n 2 0 1 1 N e w a n e w r a w . e p s ’ )
97
98 # p y l a b . show ( ) 99
100 101
102 s e n t i m e n t s = np.a s a r r a y(s e n t i m e n t s w o r d n e t) 103 p y l a b.f i g u r e( 2 )
104 p y l a b.p l o t(s e n t i m e n t s[ : , 0 ] , s e n t i m e n t s[ : , 1 ] , ’ . ’) 105 p y l a b.x l a b e l(’ Our l i s t ’)
106 p y l a b.y l a b e l(’ANEW’)
107 p y l a b.t i t l e(’ C o r r e l a t i o n between s e n t i m e n t word l i s t s ( WordNet l e m m a t i z e r ) ’) 108 p y l a b.t e x t( 1 , 3 , ” P e a r so n c o r r e l a t i o n = %.2 f ” % np.c o r r c o e f(s e n t i m e n t s.T) [ 1 , 0 ] )
109 p y l a b.t e x t( 1 , 2 . 5 , ” Spearman c o r r e l a t i o n = %.2 f ” % sp e a r m a n r(s e n t i m e n t s[ : , 0 ] , s e n t i m e n t s[ : , 1 ] ) [ 0 ] 110 p y l a b.t e x t( 1 , 2 , ” K e n d a l l c o r r e l a t i o n = %.2 f ” % k e n d a l l t a u(s e n t i m e n t s[ : , 0 ] , s e n t i m e n t s[ : , 1 ] ) [ 0 ] ) 111 # p y l a b . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / e p s / ’ + ’ N i e l se n 2 0 1 1 N e w a n e w wo r d n e t . e p s ’ )
112
113 # p y l a b . show ( ) 114
115 116
117 s e n t i m e n t s s t e m = np.a s a r r a y(s e n t i m e n t s s t e m) 118 p y l a b.f i g u r e( 3 )
119 p y l a b.p l o t(s e n t i m e n t s s t e m[ : , 0 ] , s e n t i m e n t s s t e m[ : , 1 ] , ’ . ’) 120 p y l a b.x l a b e l(’My l i s t ’)
121 p y l a b.y l a b e l(’ANEW’)
122 p y l a b.t i t l e(’ C o r r e l a t i o n between s e n t i m e n t word l i s t s ( P o r t e r stemmer ) ’) 123 p y l a b.t e x t( 1 , 3 , ” C o r r e l a t i o n = %.2 f ” % np.c o r r c o e f(s e n t i m e n t s s t e m.T) [ 1 , 0 ] )
124 p y l a b.t e x t( 1 , 2 . 5 , ” Spearman c o r r e l a t i o n = %.2 f ” % sp e a r m a n r(s e n t i m e n t s s t e m[ : , 0 ] , s e n t i m e n t s s t e m 125 p y l a b.t e x t( 1 , 2 , ” K e n d a l l c o r r e l a t i o n = %.2 f ” % k e n d a l l t a u(s e n t i m e n t s s t e m[ : , 0 ] , s e n t i m e n t s s t e m[ 126 # p y l a b . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / e p s / ’ + ’ N i e l se n 2 0 1 1 N e w a n e w st e m . e p s ’ )
127
128 # p y l a b . show ( )
1 #! / u s r / b i n / env python
2 #
3 # $ I d : N i e l se n 2 0 1 1 N e w . py , v 1 . 1 0 2 0 1 1 / 0 3 / 1 6 1 3 : 4 1 : 3 6 f n Exp $ 4
5 import c s v 6 import r e
7 import numpy a s np
8 from n l t k.stem.wordnet import WordNetLemmatizer
9 from n l t k import s e n t t o k e n i z e 10 import p y l a b
11 from s c i p y.s t a t s.s t a t s import k e n d a l l t a u, sp e a r m a n r 12 import s i m p l e j s o n
13
14 # Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t 15 m i s l o v e = 2
16
17 # F i l e n a m e s
18 f i l e b a s e = ’ /home/ f n i e l s e n / ’
19 f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / d a t a / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . c s v ’ 20 f i l e n a m e a f i n n 9 6 = ’AFINN−96. t x t ’
21 f i l e n a m e m i s l o v e 1 = ” r e s u l t s . t x t ” 22 f i l e n a m e m i s l o v e 2 a = ” t u r k . t x t ”
23 f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ” 24 f i l e n a m e a n e w = f i l e b a s e + ’ d a t a /ANEW.TXT ’ 25 f i l e n a m e g i = f i l e b a s e + ” d a t a / i n q t a b s . t x t ”
26 f i l e n a m e o f = f i l e b a s e + ” d a t a / s u b j c l u e s l e n 1−HLTEMNLP05. t f f ” 27
28 # Word s p l i t t e r p a t t e r n
29 p a t t e r n s p l i t = r e.c o m p i l e(r”\W+”) 30
31
32 a f i n n = d i c t(map(lambda (k,v) : (k,i n t(v) ) ,
33 [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e a f i n n) ] ) ) 34
35
36 a f i n n 9 6 = d i c t(map(lambda (k,v) : (k,i n t(v) ) ,
37 [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e a f i n n 9 6) ] ) ) 38
39
40 # ANEW
41 anew = d i c t(map(lambda l: (l[ 0 ] , f l o a t(l[ 2 ] ) − 5 ) ,
42 [ r e.s p l i t(’\s+ ’, l i n e) f o r l i n e i n open(f i l e n a m e a n e w) .r e a d l i n e s( ) [ 4 1 : 1 0 7 5 ] ] ) ) 43
44
45 # O p i n i o n F i n d e r 46 o f = {}
47 f o r l i n e i n open(f i l e n a m e o f) .r e a d l i n e s( ) : 48 e l e m e n t s = r e.s p l i t(’ (\s| \= ) ’, l i n e) 49 i f e l e m e n t s[ 2 2 ] == ’ p o s i t i v e ’: 50 o f[e l e m e n t s[ 1 0 ] ] = +1 51 e l i f e l e m e n t s[ 2 2 ] == ’ n e g a t i v e ’: 52 o f[e l e m e n t s[ 1 0 ] ] = −1 53
54
55 # G e n e r a l i n q u i r e r
56 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e g i) , d e l i m i t e r=’\t ’) 57 h e a d e r = [ ]
58 g i = {}
59 p r e v i o u s w o r d = [ ] 60 p r e v i o u s v a l e n c e = [ ] 61 f o r row i n c s v r e a d e r:
62 i f not h e a d e r:
63 h e a d e r = row
64 e l i f l e n(row) > 2 :
65 word = r e.s e a r c h(”\w+”, row[ 0 ] .l o w e r( ) ) .g r o u p( ) 66 i f row[ 2 ] == ” P o s i t i v ”:
67 v a l e n c e = 1
68 e l i f row[ 3 ] == ” N e g a t i v ”:
69 v a l e n c e = −1
70 e l s e:
71 v a l e n c e = 0
72 i f r e.s e a r c h(”#”, row[ 0 ] .l o w e r( ) ) :
73 i f p r e v i o u s w o r d == word:
74 i f p r e v i o u s v a l e n c e == [ ] :
75 p r e v i o u s v a l e n c e = v a l e n c e
76 e l i f p r e v i o u s v a l e n c e != v a l e n c e:
77 p r e v i o u s v a l e n c e = 0
78 e l s e:
79 i f p r e v i o u s v a l e n c e:
80 g i[p r e v i o u s w o r d] = p r e v i o u s v a l e n c e
81 p r e v i o u s w o r d = word
82 p r e v i o u s v a l e n c e = [ ]
83 e l i f v a l e n c e:
84 g i[word] = v a l e n c e
85 86 87
88 # Lemmatizer f o r WordNet
89 l e m m a t i z e r = WordNetLemmatizer( ) 90
91 92
93 d ef wo r d s2 a n e wse n t i m e n t(words) :
94 ”””
95 Co n v e r t words t o s e n t i m e n t b a se d on ANEW v i a WordNet Lemmatizer
96 ”””
97 s e n t i m e n t = 0 98 f o r word i n words:
99 i f word i n anew:
100 s e n t i m e n t += anew[word]
101 continue
102 l wo r d = l e m m a t i z e r.l e m m a t i z e(word)
103 i f l wo r d i n anew:
104 s e n t i m e n t += anew[l wo r d]
105 continue
106 l wo r d = l e m m a t i z e r.l e m m a t i z e(word, p o s=’ v ’)
107 i f l wo r d i n anew:
108 s e n t i m e n t += anew[l wo r d]
109 return s e n t i m e n t/l e n(words) 110
111
112 d ef e x t r e m e v a l e n c e(v a l e n c e s) :
113 ”””
114 Return t h e most e x t r e m e v a l e n c e . I f e x t r e m e s have d i f f e r e n t s i g n t h e n 115 z e r o i s r e t u r n e d
116 ”””
117 imax = np.a r g s o r t(np.a b s(v a l e n c e s) ) [−1 ]
118 e x t r e m e s = f i l t e r(lambda v: a b s(v) == a b s(v a l e n c e s[imax] ) , v a l e n c e s) 119 e x t r e m e s s a m e s i g n = f i l t e r(lambda v: v == v a l e n c e s[imax] , v a l e n c e s) 120 i f e x t r e m e s == e x t r e m e s s a m e s i g n:
121 return v a l e n c e s[imax] 122 e l s e:
123 return 0
124
125 d ef c o r r m a t r i x 2 l a t e x(C, columns=None) :
126 ”””
127 s = c o r r m a t r i x 2 l a t e x (C) 128 p r i n t ( s )
129 ”””
130 s = ’\n\\b e g i n{t a b u l a r}{’ + (’ r ’∗(C.sh a p e[ 0 ]−1 ) ) + ’}\n ’
131 i f columns:
132 s += ” & ”.j o i n(columns[ 1 : ] ) + ” \\\\\n\\h l i n e\n”
133 f o r n i n r a n g e(C.sh a p e[ 0 ] ) :
134 row = [ ]
135 f o r m i n r a n g e( 1 , C.sh a p e[ 1 ] ) :
136 i f m> n:
137 row.append(” %.3 f ” %C[n,m] )
138 e l s e:
139 row.append(” ”)
140 s += ” & ”.j o i n(row) + ’ \\\\\n ’
141 s += ’\\end{t a b u l a r}\n ’
142 return s
143 144
145 d ef sp e a r m a n m a t r i x(d a t a) :
146 ”””
147 Spearman r r a n k c o r r e l a t i o n m a t r i x
148 ”””
149 C= np.z e r o s( (d a t a.sh a p e[ 1 ] , d a t a.sh a p e[ 1 ] ) ) 150 f o r n i n r a n g e(d a t a.sh a p e[ 1 ] ) :
151 f o r m i n r a n g e(d a t a.sh a p e[ 1 ] ) :
152 C[n,m] = sp e a r m a n r(d a t a[ : ,n] , d a t a[ : ,m] ) [ 0 ]
153 return C
154 155
156 d ef k e n d a l l m a t r i x(d a t a) :
157 ”””
158 K e n d a l l t a u r a n k c o r r e l a t i o n m a t r i x
159 ”””
160 C= np.z e r o s( (d a t a.sh a p e[ 1 ] , d a t a.sh a p e[ 1 ] ) ) 161 f o r n i n r a n g e(d a t a.sh a p e[ 1 ] ) :
162 f o r m i n r a n g e(d a t a.sh a p e[ 1 ] ) :
163 C[n,m] = k e n d a l l t a u(d a t a[ : ,n] , d a t a[ : ,m] ) [ 0 ]
164 return C
165 166 167
168 # Read M i s l o v e CSV T w i t t e r d a t a : ’ t w e e t s ’ an a r r a y o f d i c t i o n a r i e s 169 i f m i s l o v e == 1 :
170 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 1) , d e l i m i t e r=’\t ’)
171 h e a d e r = [ ]
172 t w e e t s = [ ]
173 f o r row i n c s v r e a d e r:
174 i f not h e a d e r:
175 h e a d e r = row
176 e l s e:
177 t w e e t s.append({’ i d ’: row[ 0 ] ,
178 ’ quant ’: i n t(row[ 1 ] ) ,
179 ’ s c o r e o u r ’: f l o a t(row[ 2 ] ) ,
180 ’ s c o r e m e a n ’: f l o a t(row[ 3 ] ) ,
181 ’ s c o r e s t d ’: f l o a t(row[ 4 ] ) ,
182 ’ t e x t ’: row[ 5 ] ,
183 ’ s c o r e s ’: map(i n t, row[ 6 : ] )})
184 e l i f m i s l o v e == 2 :
185 i f True:
186 t w e e t s = s i m p l e j s o n.l o a d(open(’ t w e e t s w i t h s e n t i s t r e n g t h . j s o n ’) ) 187 e l s e:
188 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 2 a) , d e l i m i t e r=’\t ’)
189 t w e e t s = [ ]
190 f o r row i n c s v r e a d e r:
191 t w e e t s.append({’ i d ’: row[ 2 ] ,
192 ’ s c o r e m i s l o v e 2 ’: f l o a t(row[ 1 ] ) ,
193 ’ t e x t ’: row[ 3 ]})
194 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 2 b) , d e l i m i t e r=’\t ’)
195 t w e e t s d i c t = {}
196 h e a d e r = [ ]
197 f o r row i n c s v r e a d e r:
198 i f not h e a d e r:
199 h e a d e r = row
200 e l s e:
201 t w e e t s d i c t[row[ 0 ] ] = {’ i d ’: row[ 0 ] ,
202 ’ s c o r e m i s l o v e ’: f l o a t(row[ 1 ] ) ,
203 ’ s c o r e a m t w r o n g ’: f l o a t(row[ 2 ] ) ,
204 ’ s c o r e a m t ’: np.mean(map(i n t, r e.s p l i t(”\s+”, ” ”.j o i n(row
205 }
206 f o r n i n r a n g e(l e n(t w e e t s) ) :
207 t w e e t s[n] [’ s c o r e m i s l o v e ’] = t w e e t s d i c t[t w e e t s[n] [’ i d ’] ] [’ s c o r e m i s l o v e ’] 208 t w e e t s[n] [’ s c o r e a m t w r o n g ’] = t w e e t s d i c t[t w e e t s[n] [’ i d ’] ] [’ s c o r e a m t w r o n g ’] 209 t w e e t s[n] [’ s c o r e a m t ’] = t w e e t s d i c t[t w e e t s[n] [’ i d ’] ] [’ s c o r e a m t ’]
210 211
212 # Add s e n t i m e n t s t o ’ t w e e t s ’ 213 f o r n i n r a n g e(l e n(t w e e t s) ) :
214 words = p a t t e r n s p l i t.s p l i t(t w e e t s[n] [ ’ t e x t ’] .l o w e r( ) )
215 a f i n n s e n t i m e n t s = map(lambda word: a f i n n.g e t(word, 0 ) , words) 216 a f i n n s e n t i m e n t = f l o a t(sum(a f i n n s e n t i m e n t s) ) /l e n(a f i n n s e n t i m e n t s) 217 a f i n n 9 6 s e n t i m e n t s = map(lambda word: a f i n n 9 6.g e t(word, 0 ) , words)
218 a f i n n 9 6 s e n t i m e n t = f l o a t(sum(a f i n n 9 6 s e n t i m e n t s) ) /l e n(a f i n n 9 6 s e n t i m e n t s) 219 a n e w s e n t i m e n t s = map(lambda word: anew.g e t(word, 0 ) , words)
220 a n e w s e n t i m e n t = f l o a t(sum(a n e w s e n t i m e n t s) ) /l e n(a n e w s e n t i m e n t s) 221 g i s e n t i m e n t s = map(lambda word: g i.g e t(word, 0 ) , words)
222 g i s e n t i m e n t = f l o a t(sum(g i s e n t i m e n t s) ) /l e n(g i s e n t i m e n t s) 223 o f s e n t i m e n t s = map(lambda word: o f.g e t(word, 0 ) , words) 224 o f s e n t i m e n t = f l o a t(sum(o f s e n t i m e n t s) ) /l e n(o f s e n t i m e n t s) 225 t w e e t s[n] [’ s e n t i m e n t a f i n n 9 6 ’] = a f i n n 9 6 s e n t i m e n t
226 t w e e t s[n] [’ s e n t i m e n t a f i n n ’] = a f i n n s e n t i m e n t
227 t w e e t s[n] [’ s e n t i m e n t a f i n n q u a n t ’] = np.s i g n(a f i n n s e n t i m e n t) 228 t w e e t s[n] [’ s e n t i m e n t a f i n n s u m ’] = sum(a f i n n s e n t i m e n t s)
229 n o n z e r o s = l e n(f i l t e r(lambda n o n z e r o: n o n z e r o, a f i n n s e n t i m e n t s) ) 230 i f not n o n z e r o s: n o n z e r o s = 1
231 t w e e t s[n] [’ s e n t i m e n t a f i n n n o n z e r o ’] = sum(a f i n n s e n t i m e n t s) /n o n z e r o s 232 t w e e t s[n] [’ s e n t i m e n t a f i n n e x t r e m e ’] = e x t r e m e v a l e n c e(a f i n n s e n t i m e n t s) 233 t w e e t s[n] [’ s e n t i m e n t a n e w r a w ’] = a n e w s e n t i m e n t
234 t w e e t s[n] [’ s e n t i m e n t a n e w l e m m a t i z e ’] = wo r d s2 a n e wse n t i m e n t(words) 235 t w e e t s[n] [’ se n t i m e n t a n e w r a w su m ’] =sum(a n e w s e n t i m e n t s)
236 n o n z e r o s = l e n(f i l t e r(lambda n o n z e r o: n o n z e r o, a n e w s e n t i m e n t s) ) 237 i f not n o n z e r o s: n o n z e r o s = 1
238 t w e e t s[n] [’ s e n t i m e n t a n e w r a w n o n z e r o s ’] = sum(a n e w s e n t i m e n t s) /n o n z e r o s 239 t w e e t s[n] [’ s e n t i m e n t g i r a w ’] = g i s e n t i m e n t
240 t w e e t s[n] [’ s e n t i m e n t o f r a w ’] = o f s e n t i m e n t 241
242
243 # Numpy m a t r i x 244 i f m i s l o v e == 1 :
245 columns = [”AMT”, ’AFINN ’, ’AFINN ( q ) ’, ’AFINN ( s ) ’, ’AFINN ( a ) ’,
246 ’ANEW( r ) ’, ’ANEW( l ) ’, ’ANEW( r s ) ’, ’ANEW( a ) ’]
247 s e n t i m e n t s = np.m a t r i x( [ [t[’ s c o r e m e a n ’] ,
248 t[’ s e n t i m e n t a f i n n ’] ,
249 t[’ s e n t i m e n t a f i n n q u a n t ’] ,
250 t[’ s e n t i m e n t a f i n n s u m ’] ,
251 t[’ s e n t i m e n t a f i n n n o n z e r o ’] ,
252 t[’ s e n t i m e n t a n e w r a w ’] ,
253 t[’ s e n t i m e n t a n e w l e m m a t i z e ’] ,
254 t[’ se n t i m e n t a n e w r a w su m ’] ,
255 t[’ s e n t i m e n t a n e w r a w n o n z e r o s ’] ] f o r t i n t w e e t s ] )
256 e l i f m i s l o v e == 2 :
257 columns = [”AMT”,
258 ’AFINN ’, ’AFINN ( q ) ’, ’AFINN ( s ) ’, ’AFINN ( a ) ’, ’AFINN ( x ) ’,
259 ’ANEW( r ) ’, ’ANEW( l ) ’, ’ANEW( r s ) ’, ’ANEW( a ) ’,
260 ”GI ”, ”OF”, ” SS”]
261 s e n t i m e n t s = np.m a t r i x( [ [t[’ s c o r e a m t ’] ,
262 t[’ s e n t i m e n t a f i n n ’] ,
263 t[’ s e n t i m e n t a f i n n q u a n t ’] ,
264 t[’ s e n t i m e n t a f i n n s u m ’] ,
265 t[’ s e n t i m e n t a f i n n n o n z e r o ’] ,
266 t[’ s e n t i m e n t a f i n n e x t r e m e ’] ,
267 t[’ s e n t i m e n t a n e w r a w ’] ,
268 t[’ s e n t i m e n t a n e w l e m m a t i z e ’] ,
269 t[’ se n t i m e n t a n e w r a w su m ’] ,
270 t[’ s e n t i m e n t a n e w r a w n o n z e r o s ’] ,
271 t[’ s e n t i m e n t g i r a w ’] ,
272 t[’ s e n t i m e n t o f r a w ’] ,
273 t[’ s e n t i s t r e n g t h ’] ] f o r t i n t w e e t s ] )
274 275
276 x = np.a s a r r a y(s e n t i m e n t s[ : , 1 ] ) .f l a t t e n( ) 277 y = np.a s a r r a y(s e n t i m e n t s[ : , 0 ] ) .f l a t t e n( ) 278 p y l a b.p l o t(x, y, ’ . ’)
279 p y l a b.x l a b e l(’My l i s t ’)
280 p y l a b.y l a b e l(’ Amazon M e c h a n i c a l Turk ’)
281 p y l a b.t e x t( 0 . 1 , 2 , ” P e a r so n c o r r e l a t i o n = %.3 f ” % np.c o r r c o e f(x, y) [ 1 , 0 ] ) 282 p y l a b.t e x t( 0 . 1 , 1 . 6 , ” Spearman c o r r e l a t i o n = %.3 f ” % sp e a r m a n r(x, y) [ 0 ] ) 283 p y l a b.t e x t( 0 . 1 , 1 . 2 , ” K e n d a l l c o r r e l a t i o n = %.3 f ” % k e n d a l l t a u(x, y) [ 0 ] ) 284 p y l a b.t i t l e(’ S c a t t e r p l o t o f s e n t i m e n t s t r e n g t h s f o r t w e e t s ’)
285 p y l a b.show( )
286 # p y l a b . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / e p s / N i e l s e n 2 0 1 1 N e w t w e e t s c a t t e r . e p s ’ ) 287
288 # O r d i n a r y c o r r e l a t i o n c o e f f i c i e n t 289 C= np.c o r r c o e f(s e n t i m e n t s.t r a n s p o s e( ) ) 290 s 1 = c o r r m a t r i x 2 l a t e x(C, columns) 291 p r i n t(s 1)
292
293 f = open(f i l e b a s e + ’ / f n i e l s e n / t e x / N i e l s e n 2 0 1 1 N e w c o r r m a t r i x . t e x ’, ’w ’) 294 f.w r i t e(s 1)
295 f.c l o s e( ) 296
297
298 # Spearman Rank c o r r e l a t i o n 299 C2 = sp e a r m a n m a t r i x(s e n t i m e n t s) 300 s 2 = c o r r m a t r i x 2 l a t e x(C2, columns) 301 p r i n t(s 2)
302
303 f = open(f i l e b a s e + ’ / f n i e l s e n / t e x / N i e l se n 2 0 1 1 N e w sp e a r m a n m a t r i x . t e x ’, ’w ’) 304 f.w r i t e(s 2)
305 f.c l o s e( ) 306
307
308 # K e n d a l l r a n k c o r r e l a t i o n 309 C3 = k e n d a l l m a t r i x(s e n t i m e n t s) 310 s 3 = c o r r m a t r i x 2 l a t e x(C3, columns) 311 p r i n t(s 3)
312
313 f = open(f i l e b a s e + ’ / f n i e l s e n / t e x / N i e l s e n 2 0 1 1 N e w k e n d a l l m a t r i x . t e x ’, ’w ’) 314 f.w r i t e(s 3)
315 f.c l o s e( )
1 #! / u s r / b i n / env python
2 #
3 # Th i s s c r i p t w i l l c a l l t h e S e n t i S t r e n g t h Web s e r v i c e wi t h t h e t e x t from 4 # t h e 1000 t w e e t s and w r i t e a JSON f i l e wi t h t w e e t s and t h e S e n t i S t r e n g t h .
5 #
6 # $ I d : N i e l s e n 2 0 1 1 N e w s e n t i s t r e n g t h . py , v 1 . 1 2 0 1 1 / 0 3 / 1 3 1 9 : 1 2 : 4 6 f n Exp $ 7
8 import c s v 9 import r e
10 import numpy a s np 11 import p y l a b 12 import random
13 from s c i p y import s p a r s e
14 from s c i p y.s t a t s.s t a t s import k e n d a l l t a u, sp e a r m a n r 15 import s i m p l e j s o n
16 import s y s
17 r e l o a d(s y s)
18 s y s.s e t d e f a u l t e n c o d i n g(’ u t f−8 ’) 19
20 from t i m e import s l e e p, t i m e
21 from u r l l i b import FancyURLopener, u r l e n c o d e, u r l r e t r i e v e 22 from xml.sa x.s a x u t i l s import u n e s c a p e a s u n e s c a p e s a x u t i l s
23 24
25 c l a s s MyOpener(FancyURLopener) :
26 v e r s i o n = ’ RBBBot , Finn Aarup N i e l s e n ( h t t p : / /www. imm . dtu . dk /˜ f n / , fn@imm . dtu . dk ) ’ 27
28 29
30 # Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t 31 m i s l o v e = 2
32
33 # F i l e n a m e s
34 f i l e b a s e = ’ /home/ f n / ’
35 f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / d a t a / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . c s v ’ 36 f i l e n a m e m i s l o v e 1 = ” r e s u l t s . t x t ”
37 f i l e n a m e m i s l o v e 2 a = ” t u r k . t x t ”
38 f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ” 39
40
41 u r l b a s e = ” h t t p : / / s e n t i s t r e n g t h . wlv . a c . uk/ r e s u l t s . php ? ” 42
43 p a t t e r n p o s i t i v e = r e.c o m p i l e(” p o s i t i v e s t r e n g t h <b>(\d)</b>”) 44 p a t t e r n n e g a t i v e = r e.c o m p i l e(” n e g a t i v e s t r e n g t h <b>(\−\d)</b>”) 45
46
47 # Word s p l i t t e r p a t t e r n
48 p a t t e r n s p l i t = r e.c o m p i l e(r”\W+”) 49
50 myopener = MyOpener( )
51
52 # Read M i s l o v e CSV T w i t t e r d a t a : ’ t w e e t s ’ an a r r a y o f d i c t i o n a r i e s 53 i f m i s l o v e == 1 :
54 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 1) , d e l i m i t e r=’\t ’)
55 h e a d e r = [ ]
56 t w e e t s = [ ]
57 f o r row i n c s v r e a d e r:
58 i f not h e a d e r:
59 h e a d e r = row
60 e l s e:
61 t w e e t s.append({’ i d ’: row[ 0 ] ,
62 ’ quant ’: i n t(row[ 1 ] ) ,
63 ’ s c o r e o u r ’: f l o a t(row[ 2 ] ) ,
64 ’ s c o r e m e a n ’: f l o a t(row[ 3 ] ) ,
65 ’ s c o r e s t d ’: f l o a t(row[ 4 ] ) ,
66 ’ t e x t ’: row[ 5 ] ,
67 ’ s c o r e s ’: map(i n t, row[ 6 : ] )})
68 e l i f m i s l o v e == 2 :
69 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 2 a) , d e l i m i t e r=’\t ’)
70 t w e e t s = [ ]
71 f o r row i n c s v r e a d e r:
72 t w e e t s.append({’ i d ’: row[ 2 ] ,
73 ’ s c o r e m i s l o v e 2 ’: f l o a t(row[ 1 ] ) ,
74 ’ t e x t ’: row[ 3 ]})
75 c s v r e a d e r = c s v.r e a d e r(open(f i l e n a m e m i s l o v e 2 b) , d e l i m i t e r=’\t ’) 76 t w e e t s d i c t = {}
77 h e a d e r = [ ]
78 f o r row i n c s v r e a d e r:
79 i f not h e a d e r:
80 h e a d e r = row
81 e l s e:
82 t w e e t s d i c t[row[ 0 ] ] = {’ i d ’: row[ 0 ] ,
83 ’ s c o r e m i s l o v e ’: f l o a t(row[ 1 ] ) ,
84 ’ s c o r e a m t w r o n g ’: f l o a t(row[ 2 ] ) ,
85 ’ s c o r e a m t ’: np.mean(map(i n t, r e.s p l i t(”\s+”, ” ”.j o i n(row[ 4 : ]
86 }
87 f o r n i n r a n g e(l e n(t w e e t s) ) :
88 t w e e t s[n] [’ s c o r e m i s l o v e ’] = t w e e t s d i c t[t w e e t s[n] [’ i d ’] ] [’ s c o r e m i s l o v e ’] 89 t w e e t s[n] [’ s c o r e a m t w r o n g ’] = t w e e t s d i c t[t w e e t s[n] [ ’ i d ’] ] [’ s c o r e a m t w r o n g ’] 90 t w e e t s[n] [’ s c o r e a m t ’] = t w e e t s d i c t[t w e e t s[n] [’ i d ’] ] [’ s c o r e a m t ’]
91 92 93 94
95 f o r n i n r a n g e(l e n(t w e e t s) ) :
96 u r l = u r l b a s e + u r l e n c o d e({’ t e x t ’: t w e e t s[n] [’ t e x t ’]})
97 try:
98 html = myopener.open(u r l) .r e a d( )
99 p o s i t i v e = i n t(p a t t e r n p o s i t i v e.f i n d a l l(html) [ 0 ] ) 100 n e g a t i v e = i n t(p a t t e r n n e g a t i v e.f i n d a l l(html) [ 0 ] ) 101 t w e e t s[n] [’ s e n t i s t r e n g t h p o s i t i v e ’] = p o s i t i v e 102 t w e e t s[n] [’ s e n t i s t r e n g t h n e g a t i v e ’] = n e g a t i v e 103 i f p o s i t i v e > a b s(n e g a t i v e) :
104 t w e e t s[n] [’ s e n t i s t r e n g t h ’] = p o s i t i v e 105 e l i f a b s(n e g a t i v e) > p o s i t i v e:
106 t w e e t s[n] [’ s e n t i s t r e n g t h ’] = n e g a t i v e
107 e l s e:
108 t w e e t s[n] [’ s e n t i s t r e n g t h ’] = 0 109 except E x c e p t i o n, e:
110 e r r o r = s t r(e)
111 t w e e t s[n] [’ s e n t i s t r e n g t h e r r o r ’] = e r r o r 112 p r i n t(n)
113 114
115 s i m p l e j s o n.dump(t w e e t s, open(” t w e e t s w i t h s e n t i s t r e n g t h . j s o n ”, ”w”) )
1 #! / u s r / b i n / env python
2 #
3 # G e n e r a t e s a p l o t o f t h e e v o l u t i o n o f t h e p e r f o r m a n c e a s 4 # t h e word l i s t i s e x t e n d e d .
5 #
6 # $ I d : N i e l s e n 2 0 1 1 N e w e v o l u t i o n . py , v 1 . 2 2 0 1 1 / 0 3 / 1 3 2 3 : 4 8 : 3 8 f n Exp $ 7
8
9 import c s v 10 import r e
11 import numpy a s np 12 import p y l a b 13 import random
14 from s c i p y import s p a r s e
15 from s c i p y.s t a t s.s t a t s import k e n d a l l t a u, sp e a r m a n r 16
17 # Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t 18 m i s l o v e = 2
19
20 # F i l e n a m e s
21 f i l e b a s e = ’ /home/ f n i e l s e n / ’
22 f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / d a t a / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . c s v ’ 23 f i l e n a m e m i s l o v e 1 = ” r e s u l t s . t x t ”
24 f i l e n a m e m i s l o v e 2 a = ” t u r k . t x t ”
25 f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ” 26
27
28 # Word s p l i t t e r p a t t e r n
29 p a t t e r n s p l i t = r e.c o m p i l e(r”\W+”) 30
31
32 a f i n n = d i c t(map(lambda (k,v) : (k,i n t(v) ) ,
33 [ l i n e.s p l i t(’\t ’) f o r l i n e i n open(f i l e n a m e a f i n n) ] ) ) 34
35