Introduction to Computational Reproducibility

Contents

  1. What is reproducibility?
  2. Project organization
  3. Documentation
  4. Code quality
  5. Git and GitHub
  6. Resources

1. What is reproducibility?

Reproducibility

A result is reproducible when the same analysis steps performed on the same dataset consistently produces the same answer. (The Turing Way Community, 2022)

Why reproducible code?

  • Know what you did last summer (without having to memorize it)
  • Trust that your results are reliable and correct
  • Others can re-use your methods
  • Transparent science = trust in its results

What is needed for reproducibility?

  • 🗄️ The data

  • 📰 The code

  • 🧾 The analysis workflow

  • 💻 The dependencies to run the code on the data (operating system, software, packages, etc.)

  • 💬 Documentation, documentation, documentation

Reproducibility spectrum

A study may be more or less reproducible than another depending on what data and code are made available (Peng, 2011).

2. Project organization

Project structure

Why?

  • Easily find files
  • Colleagues understand what you do
  • Machine-readable files and folders (sorting, parsing)

How?

  • Folder structure
  • File and folder naming
  • Version control

Folder structure

  • Put all project-related documents in 1 project folder
  • Separate raw, intermediate and end outputs
  • Never edit raw data

File and folder naming

  • Meaningful & human-readable
  • No special characters (& % $ # / , +), spaces or dots (.)
  • < 25 characters
  • Separate meaningful chunks by - or _


3. Documentation

What is documentation?

Documentation provides context for your work. It allows your collaborators, colleagues and future you to understand what has been done and why. (The Turing Way Community, 2022)

For example:

  • Project documentation (project proposal, analysis log, README file, etc.)

  • Data documentation (e.g., codebook)

  • Code documentation (e.g., comments, requirements.txt)

README file

Project information:

  • Who, what, when, where?
  • Files in the project folder
  • License
  • Contact details

Code information:

  • Dependencies
  • Instructions for installation, configuration, usage
  • Known bugs
  • Troubleshooting

Comments

Comments: annotations in your code that explain what your code does (to others, and your future self)

  • No replacement for clear and structured code

  • May be used to generate user documentation (if in specific format)

  • In Python & R, use # :

    # This line is not executed
    print("Hello world") # This line is executed

4. Code quality

Aspects of good quality code

  • Readable
  • Reusable
  • (Robust)

Source: xkcd

Code readability: white space

Code is for computer, comments are for humans.

  • Use whitespace and newlines strategically.

Compare:

this <- function(arg1,arg2) res<-arg1*arg2;return(res)
hurts <- mean(c(this(3,4),this(3,1),this(9,9))); print(hurts)
this <- function(arg1,arg2){
  res <- arg1 * arg2
  return(res)
}

hurts <- mean(
  c(
    this(3,4),
    this(3,1),
    this(9,9)
    )
  )
print(hurts)

Code readability: names

  • use descriptive names for functions and variables
    • start functions with a verb
    • make variable names just long enough to be meaningful

Compare:

for i in my_shopping_basket:
  if(test(i)) > 10:
    purch(i)
  else:
    disc(i)
for item in basket:
  if(testNecessity(item)) > 10:
    purchase(item)
  else:
    discard(item)

Code readability: consistency

Consistency will make your code easier to understand and maintain

  • consult a styleguide for your language (keep conventions, and don’t reinvent the wheel)
  • use tools to enforce style (linters, formatters)



Python R
Style Manual Pep-8 Tidyverse style guide
Linters flake8, pylint lintr
Formatters black styler

Code reusability: some guidelines

  • Do One Thing (and do it well)

    • One function for one purpose
    • One script for one purpose

Identify potential functions by action: functions perform tasks (e.g. sorting, plotting, saving a file, transform data…)

If you copy-paste a piece of code, it is often a good candidate for a function

Code reusability: some guidelines

  • Separate code and data: data is specific, code need not be

    • consider using a config file for project-specific (meta)data
    • but DO hard-code unchanging variables, e.g. gravity = 9.80665, once.

Code quality: concluding remarks

  • Code quality is important for reproducibility
  • Not only by others, but most importantly future YOU
  • Invest time in learning to write good code, it will pay off

5. Git and GitHub

Why do you need version control?


What is git?

  • Distributed Version Control system written by Linus Torvalds

  • Allows you to:

    • log snapshots of your project
    • branch your work (so you can experiment without losing the original!)
    • keep all backups
    • while efficiently using your storage
  • Current standard for code

GitHub, GitLab, Bitbucket, etc..

  • GitHub is an online platform for hosting your coding projects

  • Based on git

  • Social coding platform: share, collaborate, and contribute to projects

  • The UU has an institutional GitHub

Terminology

  • Repository: a project folder that is being tracked by git
  • Commit: a snapshot of your project at a certain point in time
  • Branch: a parallel version of your project
  • Clone: a copy of a GitHub repository on your PC

github.com/UtrechtUniversity

Git History

Working with git

  • Firstly designed as a command line tool
  • Now also built into many IDEs (Rstudio, Visual Studio Code, etc.)
  • GUIs available (GitHub Desktop, Sourcetree, etc.)
  • Work on your PC, push to GitHub
  • Making changes via the GitHub website is also possible

Your turn: starting with git

  • Learning to use git efficiently takes time and practice.
  • Take a course or workshop, or follow online tutorials to get started.
  • We will now do a short workshop to familiarize you with GitHub

edu.nl/6gn8x

Bonus: dissemination

Learn more

Thank you!