Computer Science Notes

Computer Science Notes

CS Notes is a simple blog to keep track about CS-related stuff I consider useful.

26 Aug 2020

Notebooks in Data Science Development and other tools for reproducibility.

by Harpo MAxx (3 min read)

Thanks to my friends at DHARMA for inspiring this humble post 😊 .

In 2004 Robert Gentleman and Duncan Temple Lang wrote a BioConductor working paper called Statistical Analysis and Reproducible Research. In this paper they argued that one should tightly integrate code, methods, algorithms and narrative as closely as possible to facilitate easy reproduction and extension, this idea match perfectly with the concept of Literate programming produced by the genius mind of Donald Knuth.

D.K.

Notebooks

Literate programming (AKA notebook) has been in the Data Science scene for a while. They have become convenient for testing ideas, exploring the data, and even testing models. However, they have some limitations. As your project gets bigger, you eventually will need to move to plain old scripts. Notebooks were not created for replacing years of software engineering. The whole idea behind literate programming is to include the code for analyzing your data in your research report. The report part is way more important than the code part. In other words, if you put only a simple line of comment above your code, you are missing the literate part in literate programming. I think It would be better for you to use a plain old script and consider a real IDE. IMHO a notebook should follow a research paper’s structure with the benefits of having (hopefully) runnable code.

u-shape

An example of a notebook not providing any valuable information to the reader. These kind of notebooks clearly not follow the literate programming paradigm. Still it could be useful as a replacement of the usual presentation tools (AKA ppt)

Two interesting articles about the notebook vs. scripts topic:

  1. A Medium article discussing the benefits and limitations of notebooks vs scripts.
  2. A well-known python IDE is now including notebook support. Something similar to RMarkdown notebooks in Rstudio.

Reproducible Research

One of the ideas of Literate programming was to provide a way for facilitate research reproducibility. Ideally, you can share your notebook to any of your colleague and she should be able to reproduce your whole experiment. However, that situation is uncommon. A notebook requires also your complete environment (i.e. same interpreter, library, etc.), which not always match with your own instalation. So what are the option for dealing with the situation? These are some of the options I found during the last months/years.

  1. DVC stands for Open-source Version Control System for Machine Learning Projects, and as you probably already realized, it, SVC is a sort of git for machine learning projects. It keeps track not only of your code, but also your data and models. I haven’t tried it yet, but it seems very interesting.

  2. Ana Diedrichs from DHARMA point me to Binder. Binder is something new (at least for me). The idea of binder is to provide a mechanism for dealing with all the problems of notebooks environments. It work on Jupyter and Rstudio notebooks. For R, you need put your notebook in your github repo, include the install.R file with all your library dependencies and then call the Binder API via some special badge. More info here.

  3. Plumber for R is a package that allow you to create a web API by merely decorating your existing R source code. My personal choice, actually. I use plumber in combination with Docker for deploying every new version of my models. For instance, if I submit an article to some conference I usually deploy a version of the model using Plumber. Then I can continue with my usual workflow, and anytime I need to try my model, it is easily accessible from the service.

  4. Flask is basically the same as Plumber but for the Python language. Flask similarly to the Plumber approach works better in combination with Docker. Emiliano also from from DHARMA point me to this article

  5. renv package. The renv package is a new effort to bring project-local R dependency management to your projects. The goal is for renv to be a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors.