aﬁnn project

(1)

Finn ˚Arup Nielsen DTU Compute

Technical University of Denmark March 28, 2017

(2)

afinn

Started out as a English sentiment word list for use in analysis of Twitter messages in 2009.

Later the approach was eval- uated with manually labeled tweets in published paper.

Shown Python code snippets on the Internet including my blog on how to use it.

In July 2015, turned into a GitHub repository.

0.1 release in November 2016.

(3)

Philosophies for afinn

Simple approach with little dependencies: The package should do what it should do and nothing more.

Open source.

Test thoroughly all elements of the package.

Documentation in the code for everything.

Tutorials.

Easy installation for other developers.

Should work for a broad number of Python versions.

(4)

GitHub-based development

Git-based development with GitHub.

Repository contains the Python module itself with data, test function, setup and package files files (setup.py, README.rst), notebooks with example code.

Other developers can work from it: 36 forks by different peoples.

(5)

The AFINN word list

Word associated with sentiment score between −5 (most negative) and +5 (most positive):

a b a n d o n -2

a b a n d o n e d -2

a b a n d o n s -2

a b d u c t e d -2

a b d u c t i o n -2 a b d u c t i o n s -2 a b h o r -3

a b h o r r e d -3

a b h o r r e n t -3 a b h o r s -3

a b i l i t i e s 2 a b i l i t y 2

(6)

Basic Afinn object

The word list is encapulated as a Python class (object-orientation)

The word list is loaded at object instantiation time, to avoid reading overhead during sentiment scoring

A text scored for sentiment based on the sentiment of individual words with a method from the class:

c l a s s A f i n n ():

def _ _ i n i t _ _ ( s e l f ):

s e l f . d a t a = s e l f . l o a d _ d a t a () def s c o r e ( self , t e x t ):

s c o r e = 0

for w o r d in t e x t :

s c o r e += s e l f . d a t a . get ( word , d e f a u l t =0) r e t u r n s c o r e

(7)

Basic use

Using the class: Object instantiation followed by calling the score methods:

> > > f r o m a f i n n i m p o r t A f i n n

> > > a f i n n = A f i n n () # a f i n n is a o b j e c t n a m e now , not m o d u l e

> > > a f i n n . s c o r e ( ’ It is so h o r r e n d o u s l y bad ’ ) -3.0

> > > a f i n n . s c o r e ( ’ v e r y f u n n y ’ ) 4.0

Or score multiple texts in a list:

a f i n n _ s c o r e s = [ a f i n n . s c o r e ( t e x t ) for t e x t in t e x t s ]

(8)

Basic processing

The central part of the text processing uses regular expression (Python module: re) to extract words or to directly match against the AFINN dictionary.

i m p o r t re # I m p o r t r e g u l a r e x p r e s s i o n s t a n d a r d l i b r a r y m o d u l e

# S e t u p

l e x i c o n = { ’ i k k e god ’ : -2 , ’ i m p o n e r e n d e ’ : 3 , ’ i n e f f e k t i v ’ : -2}

r e g e x = re .c o m p i l e( ’ ( i k k e god | i m p o n e r e n d e | i n e f f e k t i v ) ’ )

# M a t c h and s c o r i n g

m a t c h e d = r e g e x . f i n d a l l ( " Den er i n e f f e k t i v og i k k e god " ) s c o r e = sum([ l e x i c o n [ w o r d ] for w o r d in m a t c h e d ])

score is now −4. A few phrases can be matched.

(9)

Code checking

flake8 tool can check that the code conforms to convention (PEP8).

$ f l a k e 8 a f i n n

(Nothing is reported if there is no convention issues) Further checking can be made with pylint.

(10)

Documentation

Documention in the “docstring” of a object method:

def s c o r e s _ w i t h _ p a t t e r n ( self , t e x t ):

""" S c o r e t e x t b a s e d on p a t t e r n m a t c h i n g .

P e r f o r m s the a c t u a l s e n t i m e n t a n a l y s i s on a t e x t . It u s e s a r e g u l a r e x p r e s s i o n m a t c h a g a i n s t the w o r d l i s t .

The o u t p u t is a l i s t of f l o a t v a r i a b l e s for e a c h m a t c h e d w o r d or p h r a s e in the w o r d l i s t .

P a r a m e t e r s - - - - t e x t : str

T e x t to be a n a l y z e d for s e n t i m e n t . R e t u r n s

- - - -

s c o r e s : l i s t of f l o a t s

S e n t i m e n t a n a l y s i s s c o r e s for t e x t

(11)

Documentation

and the documentation goes on with example code:

E x a m p l e s - - - -

> > > a f i n n = A f i n n ()

> > > a f i n n . s c o r e s _ w i t h _ p a t t e r n ( ’ G o o d and bad ’ ) [3 , -3]

> > > a f i n n . s c o r e s _ w i t h _ p a t t e r n ( ’ s o m e k i n d of i d i o t ’ ) [0 , -3]

"""

# T O D O : ": D " is not m a t c h e d w o r d s = s e l f . f i n d _ a l l ( t e x t )

s c o r e s = [ s e l f . _ d i c t [ w o r d ] for w o r d in w o r d s ] r e t u r n s c o r e s

(12)

Documention checking

There is a standard for documentation: PEP 257.

Tools exists to check whether the documentation is complete and whether it follows the standard: pydocstyle (previously called pep257).

I can call it with:

p y d o c s t y l e a f i n n

(It should report nothing if ok) There is a plugin in flake8

Afinn uses the Numpy document convention. However this cannot be tested: Currently no tools (AFAIK).

(13)

Testing

Unit tests in afinn/tests/test_afinn.py Test function have the prefix test_.

The prefix tells py.test, http://doc.pytest.org, to test it.

Example for testing the find_all method of the object:

def t e s t _ f i n d _ a l l ():

a f i n n = A f i n n ()

w o r d s = a f i n n . f i n d _ a l l ( " It is so bad " ) a s s e r t w o r d s == [ ’ bad ’ ]

(14)

Testing

Starting py.test in the afinn directory will automatically identify all test functions that should be executed based on test_ prefix:

$ py . t e s t

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = t e s t s e s s i o n s t a r t s = = = = = = = = = = = = = = = = p l a t f o r m l i n u x - - P y t h o n 3.5.2 , pytest -3.0.6 , py - 1 . 4 . 3 2 , pluggy - 0 . 4 . 0 r o o t d i r : / h o m e / f a a n / p r o j e c t s / afinn , i n i f i l e :

c o l l e c t e d 14 i t e m s

t e s t s / t e s t _ a f i n n . py . . . .

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 14 p a s s e d in 0 . 4 9 s e c o n d s = = = = = = = = = = = = =

Succinct!

(15)

Testing: doctesting

From method documentation:

E x a m p l e s - - - -

> > > a f i n n = A f i n n ()

> > > a f i n n . s c o r e s _ w i t h _ p a t t e r n ( ’ G o o d and bad ’ ) [3 , -3]

This piece of code can be tested: “doctest”

p y t h o n - m d o c t e s t a f i n n / a f i n n . py or . . .

(16)

Testing: doctesting

Testing the entire module:

$ py . t e s t - - doctest - m o d u l e s a f i n n

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = t e s t ...

p l a t f o r m l i n u x - - P y t h o n 3.5.2 , pytest -3.0.6 , py - 1 . 4 . 3 2 , ...

r o o t d i r : / h o m e / f a a n / p r o j e c t s / afinn , i n i f i l e : c o l l e c t e d 7 i t e m s

a f i n n / a f i n n . py . . . .

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 7 p a s s e d

Here 7 example code snippets were found in the docstrings, extracted and tested and found to be ok.

(17)

Testing with tox

I would like to have afinn working with different versions of Python:

Versions 2.6, 2.7, 3.3, 3.4 and 3.5.

tox combines testing with virtual environments enabling the test of different versions of Python.

tox creates virtual environments in afinn/.tox/<virtualenv> moves into them and executes whatever is specified in a tox.ini file (for afinn it is setup to execute py.test, doctesting and flake8).

tox neatly enables testing multiple versions with just a single command.

(18)

Testing with tox

$ tox

G L O B sdist - m a k e : / h o m e / f a a n / p r o j e c t s / a f i n n / s e t u p . py

p y 2 6 inst - n o d e p s : / h o m e / f a a n / p r o j e c t s / a f i n n /. tox / d i s t / afinn - 0 . 1 .zip ...

I n s t a l l i n g c o l l e c t e d p a c k a g e s : a f i n n

R u n n i n g s e t u p . py i n s t a l l for a f i n n ... d o n e S u c c e s s f u l l y i n s t a l l e d afinn - 0 . 1

p y 2 6 r u n t e s t s : c o m m a n d s [1] | py . t e s t t e s t _ a f i n n . py

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = t e s t s e s s i o n s t a r t s p l a t f o r m l i n u x 2 - - P y t h o n 2.6.9 , pytest -3.0.7 , py - 1 . 4 . 3 3 , pluggy - 0 . 4 . 0 r o o t d i r : / h o m e / f a a n / p r o j e c t s / afinn , i n i f i l e :

c o l l e c t e d 14 i t e m s

t e s t _ a f i n n . py . . . . ...

p y 2 6 : c o m m a n d s s u c c e e d e d p y 2 7 : c o m m a n d s s u c c e e d e d p y 3 3 : c o m m a n d s s u c c e e d e d p y 3 4 : c o m m a n d s s u c c e e d e d p y 3 5 : c o m m a n d s s u c c e e d e d f l a k e 8 : c o m m a n d s s u c c e e d e d c o n g r a t u l a t i o n s :)

(19)

Testing with Travis

Travis: cloud-based testing at https://travis-ci.

org/fnielsen/afinn

Ensures that the package would also work on another system: Missing data? Missing dependencies?

Specified with a .travis.yml configuration file to run tox.

(20)

Jupyter notebooks

A couple of Jupyter notebooks are available in the GitHub repository.

Used to demonstrate how the module can be applied with a dataset.

GitHub formats the notebook for human readability.

It would otherwise be raw JSON.

This notebook computes accuracy on a manually sentiment-scored Twitter dataset.

(21)

Python Package Index

afinn distributed from the central open archive Python Pack- age Index: https://pypi.python.

org/pypi/afinn

Enables others to download the package seamlessly

pip install afinn

Or search for it with:

pip search sentiment

(22)

Dependencies

Keep dependencies on a bare minimum: None, except standard library (codecs, re, os) — so far.

Otherwise the dependencies should have been added to requirements.txt Example from other package:

b e a u t i f u l s o u p 4 db . py

d o c o p t f a s t t e x t f l a s k

Flask - B o o t s t r a p g e n s i m

j s o n p i c k l e ...

Enables pip install -r requirements.txt

(23)

Issue: Versioneering

Versioneering is a problem at the moment.

Version string “0.1” is hard-coded in the setup file:

s e t u p (

n a m e = ’ a f i n n ’ ,

p a c k a g e s =[ ’ a f i n n ’ ] , v e r s i o n = ’ 0.1 ’ ,

...

PyPI version is 0.1, but if the GitHub repository is changed this version is no longer reflecting differences.

In the old days, developers would manually update the version.

(24)

Summary

The Python environment has good methods to standardize development.

Python can neatly enforce documentation.

A good number of tools help the developer to write in a best practice mode: testing frameworks, code and documentation style checkers.

Python provides a good framework for publishing open source code.

Persistent and versioned distribution.

Most of the “code” is documentation.

(25)

End