Python on the lab bench¶

Bartosz Teleńczuk

with contributions of FOSS community

Prelude¶

Science as an art of problem solving¶

scientific computing ≠ software engineering (although both can learn from each other)
rapid prototyping
mixing tools implemented in different labs/different languages/different systems etc.
each problem is different and needs special method/tools/process

Tips¶

focus on problems rather than tools
fail early fail often
test your assumptions
learn the tools used in your field

Neural code 1¶

In [2]:

SVGImage('images/spike_train.svg', width=image_width)

Out[2]:

Neural code 2¶

In [3]:

Image('images/nrn1001-704a-i2.png', width=600)

Out[3]:

Data from Simmons et al. Transformation of Stimulus Correlations by the Retina. PLOS Computational Biology, 2013 10.1371/journal.pcbi.1003344

Project phases¶

Phase 1: Data exploration
Phase 2: Analysis workflow
Phase 3: Batch processing
Phase 4: Automation

Phase 1:¶

Data exploration¶

What is exploratory data analysis (EDA)?¶

uses graphical and interactive approach
focuses on getting insight into the data rather than formal statistical modelling
relies on our pattern recognition capabilities

Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone — as the first step.

— John W. Tukey in Exploratory Data Analysis

Why EDA?¶

checks data sanity
leads to serendipitous findings
helps to form new hypotheses
sets standards for further experiments
allows to select best tools and procedures

Tools for EDA¶

In [4]:

SVGImage('images/logos.svg', width='90%')

Out[4]:

The power of shell (aka command line)¶

shell is (yet another) programming language
specialises on operations with files and executing external tools (incl. Python scripts)
allows to pass parameters to programs (command-line arguments)
language agnostic
powerful editors (vim, emacs)

Extra resources¶

JW Tukey, The future of data analysis, online

JW Tukey, Exploratory data analysis

Phase 2:¶

Data analysis workflow¶

What is data analysis workflow?¶

Interchangable elements connected by a common interface.
data-flow oriented (usually represented as a graph)
data provenance

Unix philosophy¶

Unix philosophy emphasizes building short, simple, clear, modular, and extensible code that can be easily maintained and repurposed by developers other than its creators

— Wikipedia

Small is beautiful.
Make each program do one thing well.
Build a prototype as soon as possible.
Store data in flat text files.

Steps¶

database access / query
data analysis
visualisation

Workflow managers¶

General purpose:

command line: drake
Python-based: luigi, joblib, sumatra
Web/GUI-based: Taverna, Kepler, VisTrails

Specialised:

machine learing: modular data processing toolbox (MDP), RapidMiner
bioinformatics: Galaxy
3D visualisation: VTK

Simple Python-based workflow¶

import parse_data
import calculate_correlations
import plot_histogram

def main(data_path):
    data = parse_data.main(data_path)

    correlations = calculate_correlations.main(data)

    plot_histogram.main(correlations,
          saveto='correlation_histogram.svg')

if __name__ == '__main__':

    data_path = '/location/of/datafile'
    main(data_path)

Data management¶

keep backups
never change the raw data
maintain effective meta-data
separate processed files from raw data
separate code from configuration files

Directory structure¶

In [5]:

!tree .. -C -L 2 --dirsfirst --noreport -d

..
├── data
├── docs
│   └── images
├── figures
├── libs
│   └── pyNeuro
├── results
├── scripts
└── workflows

Extra resources¶

Andrew Davision, Best practices for data management in neurophysiology, online

Jeroen Janssens, Data Science at the Command Line, O'Reilly

V. Cuevas-Vicenttín et al., Scientific Workflows and Provenance: Introduction and Research Opportunities, arXiv

Phase 3:¶

Batch processing¶

run same analysis on a set of data
usually lets itself to easy parallization (embarassingly parallel)

Simple Python-based batch processing¶

import single_analysis
files = ['../data/file1.txt', 
         '../data/file2.txt']

for fname in files:
    single_analysis.main(fname)

Phase 4:¶

Automation¶

Who is it good for?¶

Software engineers:

compiling computer source code into binary code
running automated tests
creating documentation from sources

Scientists:

running analyses
producing figures
compiling source documents (such as $\LaTeX$)

Dependency tracking¶

You specify rules and recipes, build tool determines which ones to execute and in what order of execution.

Rule 1:
input.txt --> intermediate.txt | script1.py

Rule 2:
intermediate.txt,params.json --> results.txt | script2.py

In [6]:

SVGImage('images/dependency_graph.svg', width='90%')

Out[6]:

Automation tools¶

build tools: make, cmake, ant
Python-based build tools: SCons, waf
general-purpose: doit, rake
specialised data analysis: drake, luigi

Extra resources¶

Anthony Scopatz & Kathryn Huff, Effective Computation in Physics, O'Reilly