7 min read

Reproducible workflow

A research project is computationally reproducible if a second investigator (including you in the future) can recreate the final reported results of the project, including key quantitative findings, tables, and figures, given only a set of files and written instructions.

Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press. Link to book

In this blog, I intend not to repeat the content of the well-written book, but to collate some notes and resources for us to practice reproducible workflow. (I’m writing mainly for R users here, but a lot applies to Python users.)

As we gradually adopt the practices below, we can make our work reproducible for ourselves, our research team, colleagues, others in the research field, and even the general public.

Working environment

There are many restrictions working only on personal computers. Others, as well as ourselves in a different setting, might not be able to reproduce our work. It is important to be able to set up a working environment that can be accessed easily from most places. I recommend using cloud computing and cloud storage.

  • Cloud computing

    • At UofM, there is Great Lakes HPC Cluster. Check my documentation prepared for Zhu Lab for usage. There are limits on rates and quota.
    • AWS EC2 instances are fairly popular and easy to use. Recommend. You have full control over your instance. Not free, but we can apply for research credit.
    • Microsoft Azure VMs are similar to EC2.
    • Google Colab is a hosted Jupyter Notebook service. Free. Good for machine learning tasks.
    • Cyverse funded by NSF is an Open Science Workspace. They host lots of workshops.
  • Cloud storage

    • At UofM, there is Turbo. Check my documentation prepared for Zhu Lab for usage. There are limits on rates and quota.
    • Google Drive provides large storage for many university accounts and it can be mounted easily as a local drive. It can be accessed easily from R after authentication. Not standard practice for sensitive data.
    # Install the googledrive package.
    if (!require(googledrive)) {
      install.packages("googledrive")
    }
    
    # Load the package
    library(googledrive)
    
    # OAuth2.0 flow, It would require browser authentication
    drive_auth()
    
    # List files in Google Drive
    drive_ls()
    
    # Download a file from Google Drive by specifying its name or id
    drive_download("example.txt")
    
    • AWS S3 buckets are fairly popular and easy to use. Recommend. You can also access it in R.
    # Install aws.s3 package if not already installed
    if (!require(aws.s3)) {
      install.packages("aws.s3")
    }
    
    # Load aws.s3 library
    library("aws.s3")
    
    # Set up S3 access credentials
    Sys.setenv("AWS_ACCESS_KEY_ID" = "your_aws_access_key_here",
                "AWS_SECRET_ACCESS_KEY" = "your_aws_secret_access_key_here", 
                "AWS_DEFAULT_REGION" = "us-east-1") 
    
    # Get object list in the bucket
    bucketlist <- get_bucket(bucket = "your_bucket_name_here")
    
    # Read a csv file from the bucket
    data <- s3read_using(FUN = read.csv, object = "your_file_name_here", bucket = "your_bucket_name_here")
    
    # View dataframe
    View(data)
    
  • Reminders

    • It is important to use relative paths in your code so that they still work when you changing your working environment.
    • Use the correct working directory. Using R projects can help with that.
    • Use symbolic links when using external folders for storage. It is like a shortcut. You won’t have to change all paths in your code, but just have to set up symbolic link again when changing computing environment. This should work in Linux but not Windows.
    ln -s [source] [destination]
    

Documentation and version control

We make a lot of changes in our project along the way and we can easily lose track. We want to keep the deprecated work but we don’t want to make our working directory too cluttered. We want to be able to retrieve previous work and understand what we were doing.

  • Documentation
    • Always write comments for your code.
    • Write metadata. For example, use the EML template for ecological data.
    • Name your functions and objects informatively. Here is what I do.
      • Key functions are named in the fashion of action_object_suffix. Types of actions can be: down, tidy, read, calc, plot, test, summ. Objects are the object of the action, such as “climate,” or “plant”. Suffix represent the level of subsetting or a variation, such as “individual.” There can be multiple suffix.
      • Key R objects are named in the fashion of prefix_class_object_suffix. Types of classes can be: dat, v, ras, df, gg. Prefix “ls” means list. Object and suffix similar to above.
    • Write project meeting notes, perhaps in Google docs. They now have a shortcut to generate headings for meeting notes. Type @meeting notes and choose the meeting you are taking notes for.
  • Version control
    • GitHub is strongly recommended. See my previous blog for a quick start.

    • Use different branches for different stages of the project. Here is what we recommend.

      • An “analysis” branch for preliminary analysis. This branch may be messy.
      • A “manuscript” branch to reproduce all analyses used in the manuscript during the submission stage. Code should be tidy, preferably made into package form in this branch. Data folder might be bulky, so it might be an external folder mounted with a symbolic link. An example of the structure looks like this.
      your_package/
      |-- DESCRIPTION
      |-- NAMESPACE
      |-- R/
      |   |-- file1.R
      |   |-- file2.R
      |-- man/
      |   |-- file1.Rd
      |   |-- file2.Rd
      |-- alldata/
      |   |-- input/
      |   |-- intermediate/
      |   |-- output/
      |-- vignettes/
      |-- LICENSE
      
      • A “release” branch to provide a concise reproducible version for the public after publication. Code should be a subset of the ones in the “manuscript” branch, put into package. Small data files should be included in the package. There should not be dependency on external data folders. An example of the structure should look like this.
      your_package/
      |-- DESCRIPTION
      |-- NAMESPACE
      |-- R/
      |   |-- file1.R
      |   |-- file2.R
      |-- man/
      |   |-- file1.Rd
      |   |-- file2.Rd
      |-- data/
      |-- inst/
      |   |-- extdata/
      |   |-- figures/
      |   |-- doc/
      |-- Meta/
      |-- vignettes/
      |-- LICENSE
      
    • Use “release” in GitHub to mark important versions that you might come back to. Try to write release notes.

Sharing work

This is a key part of computational reproducibility.

  • Sharing code

    • Again, GitHub (or Bitbucket).
  • Sharing code and data

    • For publications, there are many repositories that allow uploading of bundles of code and data for peer-review and publication. They also give you a permanent DOI for your code and data. Popular ones are Zenodo, Figshare, and Dryad. Zenodo and Figshare have easy interactions with GitHub.
  • Sharing working environment, code, and data

    • This is sometimes important if your project uses some software or packages that are not commonly available, or requires specific versions of them. As you know, package installation can be very frustrating and can discourage others from reproducing your work.
    • Provide session info in readme.
    sessionInfo()
    
    • Use established base images such as rocker.
    • Build your own docker images and upload to Docker Hub. You will first need a Dockerfile. After that, you can follow the steps below.
    # build
    docker build -t <your_image_name> .
    
    # test
    docker run -it <your_image_name>
    
    # login
    docker login
    
    # tag
    docker tag <your_image_name> <your_docker_username>/<your_docker_repo>:<tag>
    
    
    # push
    docker push <your_docker_username>/<your_docker_repo>:<tag>
    

Automating workflow

Even with all code and data, we might not know the exact order to load and run them.

  • Write scripts as functions and make packages

    • Package devtools and roxygens are powerful tools for package development. You can also use the GUI of RStudio.
    # create package (from parent directory)
    devtools::create("myPackage")
    
    # generate the documentation
    devtools::document("myPackage")
    
    # install and test package
    devtools::install("myPackage")
    library(myPackage)
    
    # build package
    devtools::build("myPackage")
    
  • Write R vignettes (from R markdown)

devtools::build_vignettes()

# skip some steps if I have already generated cached vignettes and installed package
devtools::build_vignettes(clean = F, install = F)

browseVignettes("myPackage")
  • Executes a series of tasks
  • GitHub Actions
    • Build, test, and deploy our code.
    • Run on GitHub’s virtual machines.
    • Check out their recommended workflows and documentation.

Reporting results

There is still some gap between the results from running our code and what we present in the manuscript.

  • R markdown
  • Bookdown
  • Blogdown
    • A series of blogs, also written in R markdown
    • This current webpage is published through blogdown. Check my previous blog for a quick start.
    • Introduction
    • You can publish it as a GitHub page or on Netlify.