Building Reproducible Analytical Pipelines with R

Bruno Rodrigues

Intro: Who am I

Bruno Rodrigues, head of the statistics and data strategy departments at the Ministry of Higher Education and Research of Luxembourg

Slides available online at https://is.gd/raps_reprotea

Goal of this workshop

  • Identify what must be managed for reproducibility
  • Learn the following tools to turn your projects reproducible
    1. {renv}
    2. {targets}
    3. Docker
  • What we will not learn (but is very useful!):
    1. Functional programming concepts
    2. Git and Github
    3. Documenting, testing and packaging code

Main reference for this workshop

What I mean by reproducibility

  • Ability to recover exactly the same results from an analysis
  • Why would you want that?
  • Auditing purposes
  • Update of data (only impact must be from data update)
  • Reproducibility as a cornerstone of science
  • (Work on an immutable dev environment)
  • “But if I have the original script and data, what’s the problem?”

Reproducibility is on a continuum (1/2)

Here are the 4 main things influencing an analysis’ reproducibility:

  • Version of R used
  • Versions of packages used
  • Operating system
  • Hardware

Reproducibility is on a continuum (2/2)

Source: Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27

Risks to mitigate: R versions

R < 3.6 (set.seed(1234))

sample(seq(1, 10), 5)
[1] 2 6 5 8 9

R >= 3.6 (set.seed(1234))

sample(seq(1, 10), 5)
[1] 10  6  5  4  1

Real impact on papers published with R < 3.6! (ongoing research project)

Risks to mitigate: package versions

{stringr} < 1.5.0:

stringr::str_subset(c("", "a"), "")
[1] "a"

{stringr} >= 1.5.0:

stringr::str_subset(c("", "a"), "")
Error in `stringr::str_subset()`:
! `pattern` can't be the empty string (`""`).
Run `rlang::last_trace()` to see where the error occurred.

(Actually a good change, but if you rely on that old behaviour for your script…)

Risks to mitigate: operating systems (1/3)

Rarely an issue, but Neupane, et al. 2019:

While preparing a manuscript, to our surprise, attempts by team members to replicate these results produced different calculated NMR chemical shifts despite using the same Gaussian files and the same procedure outlined by Willoughby et al. […] these conclusions were based on chemical shifts that appeared to depend on the computer system on which step 15 of that protocol was performed.

Risks to mitigate: operating systems (2/3)

Risks to mitigate: operating systems (3/3)

The problem

Works on my machine!

We’ll ship your computer then.

Project start

  • Our project: housing in Luxembourg
  • Data to analyse: vente-maison-2010-2021.xlsx in the data folder
  • 2 scripts to analyse data (in the scripts/project_start folder):
    1. One to scrape the Excel file save_data.R
    2. One to analyse the data analysis.R

Project start - What’s wrong with these scripts?

  • The first two scripts -> script-based workflow
  • Just a long series of calls
  • No functions
    • difficult to re-use!
    • difficult to test!
    • difficult to parallelise!
    • lots of repetition (plots)
  • Usually we want a report not just a script
  • No record of package, nor R, versions used

Turning our scripts reproducible

We need to answer these questions

  1. How easy would it be for someone else to rerun the analysis?
  2. How easy would it be to update the project?
  3. How easy would it be to reuse this code for another project?
  4. What guarantee do we have that the output is stable through time?

The easiest, cheapest thing you should do

  • Generate a list of used packages and R using {renv}

Recording packages and R version used

Create a renv.lock file in 2 steps!

  • Open an R session in the folder containing the scripts
  • Run renv::init() and check the folder for renv.lock

(renv::init() will take some time to run the first time)

renv.lock file

  • Open the renv.lock file
{
"R": {
  "Version": "4.2.2",
  "Repositories": [
  {
   "Name": "CRAN",
   "URL": "https://packagemanager.rstudio.com/all/latest"
  }
  ]
},
"Packages": {
  "MASS": {
    "Package": "MASS",
    "Version": "7.3-58.1",
    "Source": "Repository",
    "Repository": "CRAN",
    "Hash": "762e1804143a332333c054759f89a706",
    "Requirements": []
  },
  "Matrix": {
    "Package": "Matrix",
    "Version": "1.5-1",
    "Source": "Repository",
    "Repository": "CRAN",
    "Hash": "539dc0c0c05636812f1080f473d2c177",
    "Requirements": [
      "lattice"
    ]

    ***and many more packages***

Restoring a library using an renv.lock file

  • renv.lock file not just a record
  • Can be used to restore as well!
  • Go to scripts/renv_restore
  • Run renv::restore() (answer Y to active the project when asked)
  • Will take some time to run (so maybe don’t do it now)… and it might not work!

{renv} conclusion

Shortcomings:

  1. Records, but does not restore the version of R
  2. Installation of old packages can fail (due to missing OS-dependencies)

but… :

  1. Generating a renv.lock file is “free”
  2. Provides a blueprint for dockerizing our pipeline
  3. Creates a project-specific library (no interferences)

Where are we in the continuum?

  • Package and R versions are recorded
  • Packages can be restored (but not always!)
  • But where’s the pipeline?

Build automation using {targets}

  • Go to scripts/targets_pipeline_final/ folder
  • Anatomy of the _targets.R script
  • What’s a “target”?
  • Dependencies of the pipeline?
  • How to inspect the pipeline?
  • How to run the pipeline?
  • How to inspect a computed target?

Only outdated targets get recomputed

  • Inspect the pipeline using targets::tar_visnetwork()
  • Let’s remove a commune from the list of communes
  • Inspect the pipeline again
  • Rerun the pipeline! Only what’s needed gets recomputed

Our analysis as a pipeline (1/3)

  • [Discussion] What’s the difference between this pipeline and our original scripts?
  • Benefits:
    1. _targets.R provides a clear view of what’s happening (beyond documentation)
    2. Functions can be easily re-used (and packaged!)
    3. The pipeline is pure (the results don’t depend on any extra, manual, manipulation!)

Our analysis as a pipeline (2/3)

  1. The pipeline compiles a document (which is often what we want or need)
  2. Computations can run in parallel!

Our analysis as a pipeline (3/3)

  • [Discussion] Let’s take stock… what else should we do and what’s missing?
  • Call renv::init()
  • So we have a pipeline, but it’s not 100% reproducible.

Ensuring long-term reproducibility using Docker

Remember the problem: works on my machine?

Turns out we will ship the whole computer to solve the issue using Docker.

What is Docker

  • Docker is a containerisation tool that you install on your computer
  • Docker allows you to build images and run containers (a container is an instance of an image)
  • Docker images:
    1. contain all the software and code needed for your project
    2. are immutable (cannot be changed at run-time)
    3. can be shared on- and offline

A word of warning

  • Docker works best on Linux and macOS
  • Possible to run on Windows, but need to enable options in the BIOS and WSL2
  • This intro will be as gentle as possible

“Hello, Docker!”

  • Start by creating a Dockerfile (see scripts/Docker/hello_docker/Dockerfile)
  • Dockerfile = recipe for an image
  • Build the image: docker build -t hello .
  • Run a container: docker run --rm --name hello_container hello
  • --rm: remove the container after running
  • --name some_name: name your container some_name

Without Docker

With Docker

Dockerizing a project (1/2)

  • At image build-time:
    1. install R (or use an image that ships R)
    2. install packages (using our renv.lock file)
    3. copy all scripts to the image
    4. run the analysis using targets::tar_make()
  • At container run-time:
    1. copy the outputs of the analysis from the container to your computer

Dockerizing a project (2/2)

  • The built image can be shared, or only the Dockerfile (and users can then rebuild the image)
  • The outputs will always stay the same!

Build-time vs run-time

  • Important to understand the distinction
  • Build-time:
    1. builds the image: software, packages and dependencies get installed using RUN statements
    2. must ensure that correct versions get installed (no difference between building today and in 2 years)
  • Run-time:
    1. The last command, CMD, gets executed

The Rocker project

  • Possible to build new images from other images
  • The Rocker project provides many images with R, RStudio, Shiny, and other packages pre-installed
  • We will use the Rocker images “r-ver”, specifically made for reproducibility

Docker Hub

  • Images get automatically downloaded from Docker Hub
  • You can build an image and share it on Docker Hub (see here for an example)
  • It’s also possible to share images on another image registry, or without one at all

An example of a Dockerized project

Look at the Dockerfile here.

  • In your opinion, what does the first line do?
  • In your opinion, what are the lines 3 to 24 doing? See ‘system prerequisites’ here
  • What do all the lines starting with RUN do?
  • What do all the lines starting with COPY do?
  • What does the very last line do?

Dockerizing our project (1/2)

  • The project is dockerized in scripts/Docker/dockerized_project
  • There’s:
  1. A Dockerfile
  2. A renv.lock file
  3. A _targets.R
  4. The source to our analysis analyse_data.Rmd
  5. Required functions in the functions/ folder

Build the image docker build -t housing_image .

Dockerizing our project (2/2)

  1. Run a container:
    1. First, create a shared folder on your computer
    2. Then, use this command, but change /path/to/shared_folder to the one you made: docker run --rm --name housing_container -v /path/to/shared_folder:/home/housing/shared_folder:rw housing_image
  2. Check the shared folder on your computer: the output is now there!

Docker: a panacea?

  • Docker is very useful and widely used
  • But the entry cost is high
  • And it’s a single point of failure (what happens if Docker gets bought, abandoned, etc?)
  • There are alternatives (Podman, or without containerization: Nix)

Conclusion

  • At the very least, generate an renv.lock file
  • Always possible to rebuild a Docker image in the future (either you, or someone else!)
  • Consider using {targets}: not only good for reproducibility, but also an amazing tool all around
  • Long-term reproducibility: must use Docker (or some alternative), and maintenance effort is required as well

The end

Contact me if you have questions:

  • bruno@brodrigues.co
  • Twitter: @brodriguesco
  • Mastodon: @brodriguesco@fosstodon.org
  • Blog: www.brodrigues.co
  • Book: www.raps-with-r.dev

Thank you!