As a data scientist who did not study computer science or software development in school, I’ve had to teach myself numerous skills on the fly (e.g., GitHub, the drawbacks of using for loops all over the place). Software engineering skills were particularly hard to learn in this fashion, as the wide array of approaches and tools–many ill-suited to the particular needs of a data scientist–made it hard to identify a limited set of tools and techniques that could achieve my needs.
This difficulty is why I was particularly excited when professors at the University of Washington–including Jake VanderPlas, author of the wonderfully thorough Python Data Science Handbook–posted the lectures to their new course Software Engineering for Data Scientists online.
This post summarizes the topics in the course that I wish I had found practical guides for earlier: building Python packages, writing stylish code, and debugging. While I largely ignore the lectures on version control (1, 2), the iPython notebook, procedural Python, and software design, those lectures are all good introductions to those topics.
Building Python Packages
A Python module is a file that contains definitions and statements that can expose classes, functions, and global variables. A Python package is simply an installable group of Python modules. Packages structure the Python namespace by allowing you to use dotted module names; for example, after importing sklearn, you access the
RandomForestClassifier class in sklearn’s ensemble module by
sklearn.ensemble.RandomForestClassifier. Popular Python packages like NumPy or Scikit-learn can be downloaded and installed through conda and
pip, Python’s package management system, although packages can installed from local files as well.
(A word of warning for those who watch this lecture: While the Building Python Packages lecture is a great reference, if you’re like me, you will spend ~15 minutes at the end of the video futilely yelling at your computer, pleading for Jake to click the “Sync” button that has popped up on the screen five times and would solve his issue if he’d just notice the button, dang it! The perils of live coding. Spoiler alert: He eventually clicks the button.)
As with most software engineering tasks, there are many workable options someone building a Python package can choose from. In particular, Jake makes specific choices about how to implement continuous integration and unit testing.
Continuous integration (CI) services automatically run a specified series of commands each time the code in a GitHub repository is changed. While most frequently used to automate code testing, CI services can automate many other tasks, including updating a website (by generating HTML files from Markdown and uploading them to S3) or uploading a Python package to PyPI, Python’s package index.
Travis is the most widely used CI service today, and it’s the service used by Jake in the lecture. AppVeyor is the other CI service you’ll frequently see GitHub repositories using; it’s frequently described as “Travis for Windows” since it can be used to automatically generate Windows binaries for packages. Both of these services are the source of the build passing/failing stickers you frequently see on GitHub repositories like scikit-learn’s (which, at the time of writing, is passing according to both Travis and AppVeyor!):
Unit testing is the process of individually testing units of your code piece by piece. One option for unit testing is Python’s unittest package, which is covered in the course’s unit testing lecture. However, Python’s
unittest package is rather cumbersome to use; having been designed with an eye toward Java’s unit testing software, it uses a class inheritance structure that requires the use of lots of boilerplate code. As a result, it’s become trendy to use simpler testing software built atop
For the purposes of building a Python package, Jake decided to instead use nose, a Python package that uses unit tests to “sniff” out code errors and describes itself as “nicer testing for Python.” Other testing tools that you might run across include the
pytest package, which can run both nose- and unittest-style test suites, and domain-specific software like engarde, which helps run tests on pandas dataframes.
Having chosen to use
nose and Travis, the steps to creating a Python package out of a Github repository are:
- Create a folder with the same name as your GitHub repo that will contain your package’s modules. Example: Scikit-learn has a “sklearn” folder in their repo.
- Put a
__init__.pyfile inside this folder. This file (which can be empty) will run whenever you import the python package. Any variables or functions created in this file will be available in the package’s namespace, allowing you to do things like importing submodules into the namespace. Example: Scikit-learn’s
__init__.pyfile defines a
__version__variable that can be accessed by
sklearn.__version__. (The double underscores here are a Python convention to emphasize that these files/variables typically shouldn’t be directly accessed by developers.)
- Put a folder called
testsinside this folder that will contain your unit tests. This folder should also have an
__init__.pyfile, which can also be blank.
- For each file
filename.pyin the main folder of your package, create a file
testsfolder containing all of the functions that test the behavior of
filename.py‘s code. To work with the
nosetesting package, each of these functions’ names should begin with
test_. (Although this is the way Jake sets it up, note that there is some flexibility here; scikit-learn, for example, has separate test folders inside each of its modules.)
- For more information on how to write out these tests, watch Ned Batchelder’s PyCon talk on Getting Started Testing.
- To make your package installable, place a
setup.pyfile in the root directory of your GitHub repo. Shablona, a template for small Python projects, contains a setup.py file that can be configured with your package’s information.
Those five steps will all you to turn a group the Python modules into an installable Python package that you can run tests on. To run your tests, you have two options:
- The Manual Way: At the command line, run
nosetests path_to_project_folder. (You may need to install nose first by running
conda install noseor, if you aren’t using Anaconda,
pip install nose.)
- The Automatic Way: Use Travis for continuous integration by putting a configured version of the travis.yml file in Shablona in the root directory of your GitHub repository. After going to Travis’s site and linking your repository, the
travis.ymlfile will cause Travis to automatically run
nosetestson your code anytime your GitHub repository is updated.
The simplest way to install your package is to navigate to the root directory of your project on the command line and run
python setup.py install. After doing that, Python will be able to import your package by name on your computer. Alternatively, if you use configure the travis.yml file in Shablona with your PyPI username and password, you can deploy your package to PyPI and then install it via
Good code shouldn’t just work. Good code should be easily understood by others who will need to work with it–including your future self. Given that, writing stylish code isn’t just fashionable; it serves an important purpose, helping others easily build upon your work.
A few of the most useful style tips from the lecture are summarized below; for more details and additional tips, flip through the examples in the lecture notes. While these tips are geared toward Python, the principles work across languages
- Avoid putting so-called “magic numbers,” numbers that do not have a clear meaning, in your code. Instead, create a variable with an informative name that equal to the number. Even if you only use the variable once, your code will be much more readable.
- Classes are always named in CamelCase. Functions are always named in lower_case_with_underscores_when_needed.
- Begin the name of a function with a verb-y name. Common choices: compute, get/set, find, is/has/can, add/remove, first/last. This has the nice benefit of suggesting what the function will return, whether an number, string, boolean, or nothing.
- A good docstring has the following parts:
- A single-sentence summary of the function
- A paragraph or two that elaborates on that summary (if necessary).
- A description of each of the function’s parameters
- A description of each object that the function returns.
When in doubt, look at scikit-learn (or your favorite well-maintained open source project) for examples of great docstrings.
- Organize your imports into three sections: first, packages in the standard Python library; second, third-party packages; third, application-specific packages. If you’re feeling like an overachiever, alphabetize the imports to make it easier to scan and see if a particular package is being used. Example:
import os import sys import pandas as pd import seaborn as sns from . import myutils
- Put two blank lines lines around both top-level functions and classes to help them stand out when skimming the code.
pep8package tests your code for complaince with Python’s PEP8 style guide. The TravisCI routine run in Shambala uses
flake8to test code quality, a package that runs both
pyflakes, which checks for non-PEP8 kinds of code errors.
When I first started having to debug code, I frequently used
print() statements to print out variables and check whether they were what I thought they were. Using the Python debugger (pdb) is much preferable to that. Simply insert the following line where you would like to inspect the state of various variables:
import pdb; pdb.set_trace()
This will run the code up to the line where you called pdb.set_trace() and then open a prompt where you can interactively look at the state of various variables, test the output of simple expressions, and step through the code following that statement. Pdb’s interactive prompt works particularly nicely in an iPython notebook, where it opens up below the active cell.
Frequently using keyboard shortcuts is a hallmark of a productive programmer, but–for me, at least–it takes time to break inefficient keyboard habits and replace them with better ones. Here are a few of my favorite keyboard shortcuts covered in the lectures that frequently save me time. These shortcuts work in iPython Notebooks and, for the most part, at the command line and in text editors like Sublime Text 2. (See the “Keyboard Shortcuts” link in the help menu of the iPython notebook for more timesavers.)
Control + a: Move to beginning of line
Control + e: Move to end of line
Control + d: Delete the next character (aka reverse backspace)
Command + [: Indent the current line (or highlighted set of lines) one space to the left
Command + ]: Indent the current line (or highlighted set of lines) one space to the right
Command + /: Comment out the current line (or highlighted set of lines)
- For more resources about software engineering for data scientists, see Trey Causey’s popular blog post on the topic.
- According to Jake, the
shablonapackage is named for the Hebrew word for “template.” Google Translate is silent on the matter, and searching for “Shablona Hebrew” only reveals that there is a font called “Shablona” as well.
- Shablona also contains the structure needed to use Sphinx to automatically generate documentation for your code.
- In addition to unit testing, you may also run across regression testing and integration testing. Regression testing has nothing to do with statistical regressions; instead, regression tests are about testing that steps that once generated a bug no longer generate that bug. Integration testing–the most relevantly named type of testing–tests the interactions between different units of code.
- You can remind yourself that the Unix command
grepworks with regular expressions if you recall that it stands for globally search a regular expression and print. (This, apparently, was the one new thing I learned from the course’s brief overview of the command line.)
- Unrelated, yet nice-to-look-at, header photo c/o David Melchor Diaz