Testing - Data Mining with Python (Working draft)

2.8.1 Testing for type

In data mining applications numerical list-like objects can have different types: list of integers, list of floats, list of booleans and Numpy arrays or Numpy matrices with different types of elements. Proper testing should cover all relevant input argument types. Below is an example where a mean diff function is tested in the test mean difffunction for both floats and integers:

f r o m n u m p y i m p o r t max, min def m e a n _ d i f f ( a ):

""" C o m p u t e the m e a n d i f f e r e n c e in a s e q u e n c e . P a r a m e t e r s

-a : -a r r -a y _ l i k e

"""

r e t u r n f l o a t((max( a ) - min( a )) / (len( a ) - 1)) def t e s t _ m e a n _ d i f f ():

a s s e r t m e a n _ d i f f ([1. , 7. , 3. , 2. , 5 . ] ) == 1.5 a s s e r t m e a n _ d i f f ([7 , 3 , 2 , 1 , 5]) == 1.5

The test fails in Python 2 because the parenthesis for thefloatclass is not correct, so the division becomes an integer division. Either we need to move the parenthesis or we need to specifyfrom future import division. There are a range of other types we can test for in this case, e.g., should it work for Booleans and then what should be the result? Should it work for Numpy and Pandas data types? Should it work for higher order data types such as matrices, tensors and/or list of lists? A question is also what data type should be returned, — in this case it is always a float, but if the input was[2, 4] we could have returned an integer (2rather than2.0).

2.8.2 Zero-one-some testing

The testing patternzero-one-some attempts to ensure coverage for variables which may have multiple ele-ments. The pattern says you should test with zeros elements, one elements and ‘some’ (2 or more) eleele-ments.

The listing below shows the test_mean function testing an attempt on a mean function with the three zero-one-some cases:

def m e a n ( x ):

r e t u r n f l o a t(sum( x ))/len( x ) i m p o r t n u m p y as np

def t e s t _ m e a n ():

a s s e r t np . i s n a n ( m e a n ( [ ] ) ) a s s e r t m e a n ( [ 4 . 2 ] ) == 4.2 a s s e r t m e a n ([1 , 4.3 , 4]) == 3.1

Here the code fails with a ZeroDivisionError already at the first assert as the mean function does not handle the case for a zero-element list.

A fix for the zero division uses exception handling catching the raisedZeroDivisionErrorand returning a Numpy not-a-number (numpy.nan). Below is the test included as a doctest in the docstring of the function implementation:

i m p o r t n u m p y as np def m e a n ( x ):

""" C o m p u t e m e a n of l i s t of n u m b e r s . E x a m p l e s

-> -> -> np . i s n a n ( m e a n ( [ ] ) ) T r u e

> > > m e a n ( [ 4 . 2 ] ) 4.2

> > > m e a n ([1 , 4.3 , 4]) 3.1

"""

try:

r e t u r n f l o a t(sum( x ))/len( x ) e x c e p t Z e r o D i v i s i o n E r r o r :

r e t u r n np . nan

If we call the filedoctestmean.pywe can then perform doctesting by invoking the doctestmodule on the file bypython -m doctest doctestmean.py. This will report no output if no errors occur.

2.8.3 Test layout and test discovery

Python has a common practice for test layout and test discovery. Test layout is the schema for where you put and name you testing modules, classes and functions. Using a standard layout will help readers of your code to navigate and find the testing function and will ensure that testing tools can automatically identify classes, functions and method that should be executed to test your implementation, i.e., help test discovery.

py.testsupports two test directory layouts, seehttp://pytest.org/latest/goodpractises.html. One where you put atestsdirectory on the same level as the package:

s e t u p . py # y o u r d i s t u t i l s / s e t u p t o o l s P y t h o n p a c k a g e m e t a d a t a m y p k g /

_ _ i n i t _ _ . py

a p p m o d u l e . py ...

t e s t s /

t e s t _ a p p . py ...

For the other layout, the ‘inlining test directories’, you put atestdirectory on the same level as the module:

s e t u p . py # y o u r d i s t u t i l s / s e t u p t o o l s P y t h o n p a c k a g e m e t a d a t a m y p k g /

_ _ i n i t _ _ . py a p p m o d u l e . py ...

t e s t /

t e s t _ a p p . py ...

This second method allows you to distribute the test together with the implementation, letting other devel-opers use your tests as part of the application. In this case you should also add an init .py file to the test directory. For both layouts, files, methods and functions should be prefixed with test for the test discovery, while test classes whould be prefixed withTest.

In data mining where you work with machine learning training and test set, you should be careful not to name your ordinary (non-testing) function with a pre- or postfix of ‘test’, as this may invoke testing when you run the test from the package level.

2.8.4 Test coverage

Test coverage tells you the fraction of code tested, and a developer would hope to reach 100% coverage.

Provided you already have created the test function a Python tool exists for easy reporting of test coverage:

the coverage package. Consider the following Python file numerics.py file with an obviously erroneous computefunction tested with the functiontest compute: We can test this with the standardpy.testtools and find that it reports no errors:

> py . t e s t n u m e r i c s . py

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = t e s t s e s s i o n s t a r t s = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = p l a t f o r m l i n u x 2 - - P y t h o n 2 . 7 . 3 - - pytest - 2 . 3 . 5

c o l l e c t e d 1 i t e m s n u m e r i c s . py .

= = = = = = = = = = = = = = = = = = = = = = = = = = = 1 p a s s e d in 0 . 0 3 s e c o n d s = = = = = = = = = = = = = = = = = = = = = = = = = = = For some systems you will find that thecoveragesetup installs the central script aspython-coverage. You can execute this script from the command-line, first calling it with the run command argument and the filename of the Python source, and then with thereportcommand argument:

> python - c o v e r a g e run n u m e r i c s . py

> python - c o v e r a g e r e p o r t - m

N a m e S t m t s M i s s C o v e r M i s s i n g

-/ usr -/ s h a r e -/ p y s h a r e d -/ c o v e r a g e -/ c o l l e c t o r 132 127 4%

3 -229 , 236 -244 , 248 -292

/ usr / s h a r e / p y s h a r e d / c o v e r a g e / c o n t r o l 236 235 1% 3 -355 , 358 -624 / usr / s h a r e / p y s h a r e d / c o v e r a g e / e x e c f i l e 35 16 54%

3 -17 , 42 -43 , 48 , 54 -65

n u m e r i c s 8 1 88% 4

-T O -T A L 411 379 8%

The-m optional command line argument reports the line numbers missing in the test. It reports that the test did not cover line 4 innumerics. This is because we did not test for the case withx = 2so the block with theifcondictional would be executed.

With the coverage module installed, the nose package can also report the coverage. Here we use the command line scriptnosetestswith a optional input argument:

> n o s e t e s t s - - with - c o v e r n u m e r i c s . py .

N a m e S t m t s M i s s C o v e r M i s s i n g

-n u m e r i c s 8 1 88% 4

-Ran 1 t e s t in 0 . 0 0 3 s

A coverage plugin forpy.test is also available, so the coverage for a module which contains test function may be measured with the--covoption:

> py . t e s t - - cov t h e m o d u l e

The specific lines that the test is missing to test can be show with an option to the report command:

> c o v e r a g e r e p o r t - - show - m i s s i n g

2.8.5 Testing in different environments

It is a ‘good thing’ if a module works in several different environments, e.g., different versions of Python.

Virtual environments can be setup and tests executed in the environments. The tox program greatly simplifies the process allowing the developer to test the code in multiple versions of Python installations without much hassle after the tox initialization is setup.

If test functions and asetup.pypackage file are set up andpy.textinstalled thentoxwill automatically create the virtual environments for testing and perform the actually testing in each of them depending on a specification in the tox.ini configuration file. After the one-time setup of setup.py and tox.ini any subsequent testing needs only to call the command-line programtoxfor the all the test to run.

Data mining application may require expensive compilation of Numpy an SciPy in each virtual environ-ment. Tox can be setup so the virtual environment borrows the site packages, rather than installing new versions in each of the virtual environments.

In document Data Mining with Python (Working draft) (Sider 33-36)