• Ingen resultater fundet

afinn project

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "afinn project"

Copied!
25
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Finn ˚Arup Nielsen DTU Compute

Technical University of Denmark March 28, 2017

(2)

afinn

Started out as a English senti- ment word list for use in analysis of Twitter messages in 2009.

Later the approach was eval- uated with manually labeled tweets in published paper.

Shown Python code snippets on the Internet including my blog on how to use it.

In July 2015, turned into a GitHub repository.

0.1 release in November 2016.

(3)

Philosophies for afinn

Simple approach with little dependencies: The package should do what it should do and nothing more.

Open source.

Test thoroughly all elements of the package.

Documentation in the code for everything.

Tutorials.

Easy installation for other developers.

Should work for a broad number of Python versions.

(4)

GitHub-based development

Git-based development with GitHub.

Repository contains the Python module itself with data, test function, setup and package files files (setup.py, README.rst), notebooks with example code.

Other developers can work from it: 36 forks by differ- ent peoples.

(5)

The AFINN word list

Word associated with sentiment score between −5 (most negative) and +5 (most positive):

a b a n d o n -2

a b a n d o n e d -2

a b a n d o n s -2

a b d u c t e d -2

a b d u c t i o n -2 a b d u c t i o n s -2 a b h o r -3

a b h o r r e d -3

a b h o r r e n t -3 a b h o r s -3

a b i l i t i e s 2 a b i l i t y 2

(6)

Basic Afinn object

The word list is encapulated as a Python class (object-orientation)

The word list is loaded at object instantiation time, to avoid reading overhead during sentiment scoring

A text scored for sentiment based on the sentiment of individual words with a method from the class:

c l a s s A f i n n ():

def _ _ i n i t _ _ ( s e l f ):

s e l f . d a t a = s e l f . l o a d _ d a t a () def s c o r e ( self , t e x t ):

s c o r e = 0

for w o r d in t e x t :

s c o r e += s e l f . d a t a . get ( word , d e f a u l t =0) r e t u r n s c o r e

(7)

Basic use

Using the class: Object instantiation followed by calling the score meth- ods:

> > > f r o m a f i n n i m p o r t A f i n n

> > > a f i n n = A f i n n () # a f i n n is a o b j e c t n a m e now , not m o d u l e

> > > a f i n n . s c o r e ( ’ It is so h o r r e n d o u s l y bad ’ ) -3.0

> > > a f i n n . s c o r e ( ’ v e r y f u n n y ’ ) 4.0

Or score multiple texts in a list:

a f i n n _ s c o r e s = [ a f i n n . s c o r e ( t e x t ) for t e x t in t e x t s ]

(8)

Basic processing

The central part of the text processing uses regular expression (Python module: re) to extract words or to directly match against the AFINN dictionary.

i m p o r t re # I m p o r t r e g u l a r e x p r e s s i o n s t a n d a r d l i b r a r y m o d u l e

# S e t u p

l e x i c o n = { ’ i k k e god ’ : -2 , ’ i m p o n e r e n d e ’ : 3 , ’ i n e f f e k t i v ’ : -2}

r e g e x = re .c o m p i l e( ’ ( i k k e god | i m p o n e r e n d e | i n e f f e k t i v ) ’ )

# M a t c h and s c o r i n g

m a t c h e d = r e g e x . f i n d a l l ( " Den er i n e f f e k t i v og i k k e god " ) s c o r e = sum([ l e x i c o n [ w o r d ] for w o r d in m a t c h e d ])

score is now −4. A few phrases can be matched.

(9)

Code checking

flake8 tool can check that the code conforms to convention (PEP8).

$ f l a k e 8 a f i n n

(Nothing is reported if there is no convention issues) Further checking can be made with pylint.

(10)

Documentation

Documention in the “docstring” of a object method:

def s c o r e s _ w i t h _ p a t t e r n ( self , t e x t ):

""" S c o r e t e x t b a s e d on p a t t e r n m a t c h i n g .

P e r f o r m s the a c t u a l s e n t i m e n t a n a l y s i s on a t e x t . It u s e s a r e g u l a r e x p r e s s i o n m a t c h a g a i n s t the w o r d l i s t .

The o u t p u t is a l i s t of f l o a t v a r i a b l e s for e a c h m a t c h e d w o r d or p h r a s e in the w o r d l i s t .

P a r a m e t e r s - - - - t e x t : str

T e x t to be a n a l y z e d for s e n t i m e n t . R e t u r n s

- - - -

s c o r e s : l i s t of f l o a t s

S e n t i m e n t a n a l y s i s s c o r e s for t e x t

(11)

Documentation

and the documentation goes on with example code:

E x a m p l e s - - - -

> > > a f i n n = A f i n n ()

> > > a f i n n . s c o r e s _ w i t h _ p a t t e r n ( ’ G o o d and bad ’ ) [3 , -3]

> > > a f i n n . s c o r e s _ w i t h _ p a t t e r n ( ’ s o m e k i n d of i d i o t ’ ) [0 , -3]

"""

# T O D O : ": D " is not m a t c h e d w o r d s = s e l f . f i n d _ a l l ( t e x t )

s c o r e s = [ s e l f . _ d i c t [ w o r d ] for w o r d in w o r d s ] r e t u r n s c o r e s

(12)

Documention checking

There is a standard for documentation: PEP 257.

Tools exists to check whether the documentation is complete and whether it follows the standard: pydocstyle (previously called pep257).

I can call it with:

p y d o c s t y l e a f i n n

(It should report nothing if ok) There is a plugin in flake8

Afinn uses the Numpy document convention. However this cannot be tested: Currently no tools (AFAIK).

(13)

Testing

Unit tests in afinn/tests/test_afinn.py Test function have the prefix test_.

The prefix tells py.test, http://doc.pytest.org, to test it.

Example for testing the find_all method of the object:

def t e s t _ f i n d _ a l l ():

a f i n n = A f i n n ()

w o r d s = a f i n n . f i n d _ a l l ( " It is so bad " ) a s s e r t w o r d s == [ ’ bad ’ ]

(14)

Testing

Starting py.test in the afinn directory will automatically identify all test functions that should be executed based on test_ prefix:

$ py . t e s t

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = t e s t s e s s i o n s t a r t s = = = = = = = = = = = = = = = = p l a t f o r m l i n u x - - P y t h o n 3.5.2 , pytest -3.0.6 , py - 1 . 4 . 3 2 , pluggy - 0 . 4 . 0 r o o t d i r : / h o m e / f a a n / p r o j e c t s / afinn , i n i f i l e :

c o l l e c t e d 14 i t e m s

t e s t s / t e s t _ a f i n n . py . . . .

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 14 p a s s e d in 0 . 4 9 s e c o n d s = = = = = = = = = = = = =

Succinct!

(15)

Testing: doctesting

From method documentation:

E x a m p l e s - - - -

> > > a f i n n = A f i n n ()

> > > a f i n n . s c o r e s _ w i t h _ p a t t e r n ( ’ G o o d and bad ’ ) [3 , -3]

This piece of code can be tested: “doctest”

p y t h o n - m d o c t e s t a f i n n / a f i n n . py or . . .

(16)

Testing: doctesting

Testing the entire module:

$ py . t e s t - - doctest - m o d u l e s a f i n n

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = t e s t ...

p l a t f o r m l i n u x - - P y t h o n 3.5.2 , pytest -3.0.6 , py - 1 . 4 . 3 2 , ...

r o o t d i r : / h o m e / f a a n / p r o j e c t s / afinn , i n i f i l e : c o l l e c t e d 7 i t e m s

a f i n n / a f i n n . py . . . .

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 7 p a s s e d

Here 7 example code snippets were found in the docstrings, extracted and tested and found to be ok.

(17)

Testing with tox

I would like to have afinn working with different versions of Python:

Versions 2.6, 2.7, 3.3, 3.4 and 3.5.

tox combines testing with virtual environments enabling the test of different versions of Python.

tox creates virtual environments in afinn/.tox/<virtualenv> moves into them and executes whatever is specified in a tox.ini file (for afinn it is setup to execute py.test, doctesting and flake8).

tox neatly enables testing multiple versions with just a single command.

(18)

Testing with tox

$ tox

G L O B sdist - m a k e : / h o m e / f a a n / p r o j e c t s / a f i n n / s e t u p . py

p y 2 6 inst - n o d e p s : / h o m e / f a a n / p r o j e c t s / a f i n n /. tox / d i s t / afinn - 0 . 1 .zip ...

I n s t a l l i n g c o l l e c t e d p a c k a g e s : a f i n n

R u n n i n g s e t u p . py i n s t a l l for a f i n n ... d o n e S u c c e s s f u l l y i n s t a l l e d afinn - 0 . 1

p y 2 6 r u n t e s t s : c o m m a n d s [1] | py . t e s t t e s t _ a f i n n . py

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = t e s t s e s s i o n s t a r t s p l a t f o r m l i n u x 2 - - P y t h o n 2.6.9 , pytest -3.0.7 , py - 1 . 4 . 3 3 , pluggy - 0 . 4 . 0 r o o t d i r : / h o m e / f a a n / p r o j e c t s / afinn , i n i f i l e :

c o l l e c t e d 14 i t e m s

t e s t _ a f i n n . py . . . . ...

p y 2 6 : c o m m a n d s s u c c e e d e d p y 2 7 : c o m m a n d s s u c c e e d e d p y 3 3 : c o m m a n d s s u c c e e d e d p y 3 4 : c o m m a n d s s u c c e e d e d p y 3 5 : c o m m a n d s s u c c e e d e d f l a k e 8 : c o m m a n d s s u c c e e d e d c o n g r a t u l a t i o n s :)

(19)

Testing with Travis

Travis: cloud-based test- ing at https://travis-ci.

org/fnielsen/afinn

Ensures that the pack- age would also work on another system: Missing data? Missing dependen- cies?

Specified with a .travis.yml configuration file to run tox.

(20)

Jupyter notebooks

A couple of Jupyter note- books are available in the GitHub repository.

Used to demonstrate how the module can be applied with a dataset.

GitHub formats the note- book for human readability.

It would otherwise be raw JSON.

This notebook computes accuracy on a manually sentiment-scored Twitter dataset.

(21)

Python Package Index

afinn distributed from the cen- tral open archive Python Pack- age Index: https://pypi.python.

org/pypi/afinn

Enables others to download the package seamlessly

pip install afinn

Or search for it with:

pip search sentiment

(22)

Dependencies

Keep dependencies on a bare minimum: None, except standard library (codecs, re, os) — so far.

Otherwise the dependencies should have been added to requirements.txt Example from other package:

b e a u t i f u l s o u p 4 db . py

d o c o p t f a s t t e x t f l a s k

Flask - B o o t s t r a p g e n s i m

j s o n p i c k l e ...

Enables pip install -r requirements.txt

(23)

Issue: Versioneering

Versioneering is a problem at the moment.

Version string “0.1” is hard-coded in the setup file:

s e t u p (

n a m e = ’ a f i n n ’ ,

p a c k a g e s =[ ’ a f i n n ’ ] , v e r s i o n = ’ 0.1 ’ ,

...

PyPI version is 0.1, but if the GitHub repository is changed this version is no longer reflecting differences.

In the old days, developers would manually update the version.

(24)

Summary

The Python environment has good methods to standardize development.

Python can neatly enforce documentation.

A good number of tools help the developer to write in a best practice mode: testing frameworks, code and documentation style checkers.

Python provides a good framework for publishing open source code.

Persistent and versioned distribution.

Most of the “code” is documentation.

(25)

End

Referencer

RELATEREDE DOKUMENTER

For each exercise (or project) there exists a problem formulation, some m-files in a distribution (ie. a zipped directory) and in most cases a solution (in terms of a set of

For each exercise (or project) there exists a problem formulation, some m-files in a distribution (ie. a zipped directory) and in most cases a solution (in terms of a set of

Reduce the Total Cost of Ownership for Vestas, with a high quality performance and an efficient production and logistics setup. Opportunities for Danish sub suppliers in the Global

Figure 14 Visual Studio build of SystemC example project... Copyright © 2007 Danish Technological Institute Page 18 of 27

More: You can compile to a module instead (callable from Python); you can include static types in the Python code to make it faster (often these files have the extension *.pyx)...

Kevin Sheppard’s Introduction to Python for Econometrics, Statistics and Data Analysis on 381 pages covers both Python basics and Python-based data analysis with Numpy,

types contains a number of entity classes derived from the data types in the Types module in the model. statics contains a number of classes derived from the Statics module in

This will setup the Linux distribution version of nltk , a Python package for natural language processing, and spyder, an integrated development environment for Python....