Setting up a project

Research compendium

A research compendium is a collection of all digital parts of a research project including data, code, texts (…). The collection is created in such a way that reproducing all results is straightforward.

Source: The Turing Way

Getting started

  • Contain your project in a single recognizable folder

  • Distinguish folder types, name them accordingly:

    • Read-only: data, metadata
    • Human-generated: code, paper, documentation
    • Project-generated: clean data, figures, models…
  • Initialize a README file, document your project

  • Choose a license

  • Publish your project.

A Good Enough Project

.
├── .gitignore
├── CITATION.md
├── LICENSE.md
├── README.md
├── requirements.txt
├── bin                <- Compiled and external code, ignored by git (PG)
│   └── external       <- Any external source code, ignored by git (RO)
├── config             <- Configuration files (HW)
├── data               <- All project data, ignored by git
│   ├── processed      <- The final, canonical data sets for modeling. (PG)
│   ├── raw            <- The original, immutable data dump. (RO)
│   └── temp           <- Intermediate data that has been transformed. (PG)
├── docs               <- Documentation notebook for users (HW)
│   ├── manuscript     <- Manuscript source, e.g., LaTeX, Markdown, etc. (HW)
│   └── reports        <- Other project reports and notebooks (e.g. Jupyter, .Rmd) (HW)
├── results
│   ├── figures        <- Figures for the manuscript or reports (PG)
│   └── output         <- Other output for the manuscript or reports (PG)
└── src                <- Source code for this project (HW)

Licenses

  • Copyright is implicit; others cannot use your code without your permission.

  • Licensing gives that permission, and its boundaries and conditions.

  • Choosing a license early on means being aware of your license as the project proceeds (and not creating conflicts).

  • There are over 80 OSI-approved licenses (and many, many others) to choose from.

We will dive into licenses in the Software Publication chapter.

Public or Private?

When creating a GitHub repository for your code you need to decide to make it publicly accessible or to keep it private.

Publishing your project at an early stage - Consider readability throughout - Get feedback during development from your community - May generate collaborations - Makes it easier to create a Publication or a software package

–> Open Science

But what if someone scoops my code! I’m a revolutionary, they will steal my ideas!

You can always opt for a private repository.

Data

How to include large, sensitive data or unpublished data?

  • Don’t include your data in your software repository.

  • Include simulated data or a small example dataset to test your code.

  • Research Data and full datasets:

    • Your data should be separate from your code!
    • Provide instructions where to store the data with respect to the software
    • Provide a configuration file where users define where data is stored.
    • Make software callable with different data paths. (No absolute paths!)