View on GitHub

melissagymrek.com

Using docker for reproducible computational publications

Comments

Summary: Producing reproducible computational analyses is a growing concern. Many scientists are starting to publish their code and data, but it is often still a big hassle for others to run that code and reproduce results. Docker allows you to publish not only the necessary code and data, but also the exact compute environment used. Below I discuss ways Docker could be used to improve research reproducibility.

Reproducible research is hard

Just open up the Internet and you will see tons of talk about reproducible research: check it out on twitter, Nature, all over the blogosphere, and at more than a few genomics startups. Clearly, there is growing concern about how we can make our analyses more reliable and transparent.

Encouragingly, some publications are starting to actually publish the code and data used for analysis and to generate figures. For example, many articles in Nature and in the new Elife journal (and others) provide a "source data" download link next to figures. Sometimes Elife might even have an option to download the source code for figures.

This is a great start, but there are still lots of problems: if I really want to rerun analyses from these papers, it is actually pretty annoying. I download the scripts and data, I realize I don't have all the required libraries. Or that it only runs on an Ubuntu version I don't have, or some paths are hard-coded to match the authors' machines. If it's not easy for me to run and doesn't run "out of the box" the chances that I will actually ever run most of these scripts is close to zero.

For published code and data to be useful, it not only has to be reproducible, but it also has to be easy to reproduce and interact with the data. Otherwise no one will ever touch it. Right now, we post code and data and say: "it worked for me, good luck." But then the burden is on the user to figure out how in the world to get this stuff to run. Really, most of this burden should be on the authors if they want other people to use what they published.

I recently started experimenting with Docker, which might be a solution to many of these problems. Basically, it is kind of like a lightweight virtual machine where you can set up a compute environment, including all the libraries, code, data, you need, in a single "image". That image can be distributed publicly and can seamlessly run on theoretically any major operating system. No need for the user to mess with installation, paths, etc. They just run the docker you provided, and everything is set up to run out of the box. Many others (e.g. here and here) have already started discussing this.

I think it'd be pretty awesome if when we published a paper, we also provide a Docker image that is already set up to run all the analyses and recreate all the figures in the study. Below I discuss how to use Docker to do this, show an example Docker to reproduce figures in Rstudio from a published paper, and talk about what I think are the promises and problems. I'd love to have a discussion about how people think this kind of approach can or cannot solve reproducibility issues.

Using Docker for reproducible analyses

Here I want to describe a little bit how the setup works.

Docker has a repository of images called Docker hub that is pretty analagous to github except it's for Docker images instead of for code. You can pull, push, track, and modify images and share them with other people. For instance I made an image to run an Rstudio server by pulling the base ubuntu:14.04 image, modifying it to include everything needed to run Rstudio, and making a new image out of that (which I deposited at mgymrek/docker-rstudio-server, the github repo it's built from has the same name mgymrek/docker-reproducibility-example). Everything to build the image is specified in a Dockerfile that you can view there, so you know exactly how I created that environment.

A very cool feature is that Docker can link up with your github account. So you can make a github repository with a Dockerfile in it specifying how to build an image and what scripts to add to it. Docker will automatically detect when you change that repository, create a Docker image using the instructions in your Dockerfile, and post it as a "trusted build" (see automated builds).

So my vision of how this would work for a publication is that every publication would have its own github repository. There you can store all the code you used with version control. You would then include a Dockerfile in this repository specifying how to set up the exact environment required. Then you would end up with an automatically built docker image on docker hub that anyone can download and run out of the box.

When you run a docker container, you can specify a command that it should run when you first start it up. (The docker documentation explains all about this.). For instance, if you want the user to be able to play around inside, they can start it up with the command /bin/bash. However, you can also provide a Docker that doesn't require users to do any playing around in the shell. For instance, in the example below, I set it up so that it automatically runs the web browser version of rstudio with all the code and data set up, so that all one needs to do is run the docker, go to the web browser, and start having fun.

But that is only one way to do things. You could do a similar setup where you use IPython notebook rather than rstudio. Or you could provide one Makefile that will make all of the figures and reproduce analyses with a single command. Instead of providing the data in the Docker, you can provide the data as a separate download that users can mount to the Docker when they run it (which I think is better anyway, since it makes sense to separate the code from the data).

Below is a working example of how I got this to work going the Rstudio route. But there are countless ways to set this up and I'd love to discuss different possibilities.

Example - reproducing Martin, et al. PLOS Genetics 2014

I wanted to go through an excercise of how it would actually work to provide a Docker to reproduce analyses of a publication. As an example, I am using code and data from Transcriptome Sequencing from Diverse Human Populations Reveals Differentiated Regulatory Architecture by Martin and Costa et al. recently published in PLOS Genetics. The authors published the code and data used for analysis and figures along with the manuscript. First of all, I want to give the Bustamante Lab huge kudos for this. Although many people are talking about publishing code and data, few people are actually providing it, and I was very happy to see this. I also want to point out that the code I use in the example below is modified from the code they published with the paper, and that I received permission from the authors to post it.

Even though the data and code were provided, it was not immediate smooth sailing: the code required many R packages that I do not have installed. Some scripts required downloading scripts from a previous publication. It wasn't immediately clear which scripts went with which figures. And I had to change some hard coded paths in the scripts. But eventually I got the scripts to run with minimal modifications, so the figures really were reproducible.

This took a fair amount of fighting and set up time to install all the things I was missing. I can bet that few people are going to be patient enough to go through this process. But these are exactly the problems that Docker can solve: I can set up this compute environment once, and then provide it to other people in such a way that it just runs, almost regardless of what computing environment they're using.

For my example, I chose two supplemental figures from this study (Figure S7B and figure S1) and made a Docker that provides everything needed to reproduce those in Rstudio. I chose these mostly because they were the first scripts I looked at that ran in under a couple minutes and that I could easily tell which figure the scripts belonged to. The Docker is loaded with a running installation of Rstudio server, and all the code, data, and dependencies needed to produce those figures. Each figure is in a separate script that you can open in Rstudio and run out of the box to reproduce what you see in the published manuscript.

Run the docker yourself

I assume you have some familiarity with Docker and have it installed (if not the tutorials on the site are great and show you how much fun it is).

The Docker for this example is published on Docker hub mgymrek/docker-reproducibility-example. To run it, simply run:

docker run -p 49000:8787 -d mgymrek/docker-reproducibility-example
(You can replace 49000 with another open port if you want.) Navigate in your web browser to http://0.0.0.0:49000. This will open up rstudio. The username and password are both “guest”. You can open files in the scripts/ directory and run them to reproduce figures S7B and S1 from Martin et al. (martin_etal_figS1.R and martin_etal_figS7B.R). Voila, you didn't have to install rstudio, you didn't have to install any R packages, you didn't have to configure any paths, and it should just work (although this is my first Docker attempt, so give me a break if it doesn't!).
Technical Notes

Challenges

I am no expert, but I think Docker is great, and solves a lot of problems. That being said, there are still challenges:

Conclusion

I would love to hear what others think and discuss this and other options for improving reproducibility. Like I mention above, I don't think Docker is the perfect solution, or that a single perfect solution exists, but I think it solves a lot of problems and is a step in the right direction.

Special thanks to Assaf Gordon and Alon Goren for all the great conversations about this.

comments powered by Disqus