Introducing Snakemake

Bioinformatics analyses, especially involving next-generation sequencing (NGS) technologies, usually consist of the application of many command line tools or scripts, e.g. for mapping reads, calling variants, estimating differential expression, transforming various intermediate results and creating visualizations. With increasing amounts of data being collected in biology and medicine, reproducible and scalable automatic workflow management becomes increasingly important. Snakemake 1 is a workflow management system, consisting of a text-based workflow specification language and a scalable execution environment, that allows the parallelized execution of workflows on workstations, compute servers and clusters without modification of the workflow definition. Following the widely known GNU Make paradigm, Snakemake allows to define workflows in terms of rules that create output files from input files. Dependencies between rules are detected automatically by matching file names. Rules can be generalized using wildcards. Snakemake extends the Make paradigm by allowing to use more than one wildcard in filenames and to have multiple output files per rule, which is particularly important when dealing with bioinformatics tools. Further, it provides data provenance functionality, and ways to annotate rules with additional information like the needed computational resources. The latter serves as the input for a scheduling algorithm based on a multidimensional knapsack problem. This allows Snakemake to maximize workflow execution speed while not exceeding given constraints like the number of available processor cores, cluster nodes or auxilliary hardware like graphics cards.

Since its publication, Snakemake has been widely adopted and was used to build analysis workflows for a variety of publications. With around 900 to 1200 homepage visits per month this year 2 , and more than 65,000 downloads since the first release 3 it appears to have a stable community of regular users.

This tutorial will introduce the Snakemake workflow definition language with a typical NGS example and describe how to use the execution environment to scale workflows to compute servers and clusters while adapting to hardware specific constraints. Further, it will be shown how Snakemake helps to create reproducible analyses that can be adapted to new data with little effort.

Organizer

Dr. Johannes Köster

Department of Biostatistics and Computational Biology
Dana-Farber Cancer Institute
Harvard School of Public Health