CRISPRscape user manual

Quick start

Install dependencies: git, mamba and Snakemake.

Download the repository (including submodules):

git clone --recurse-submodules https://github.com/UtrechtUniversity/campylobacter-crisprscape.git

Move into the downloaded directory:

cd campylobacter-crisprscape

Download genomes from AllTheBacteria

bash bin/prepare_genomes.sh

Download reference databases for geNomad and SpacePHARER:

mamba env create -f envs/genomad.yaml
mamba activate genomad
genomad download-database data/

bash bin/download_spacepharer_database.sh

Optional: Do a dry-run to check if Snakemake can find all the right input and output files:

snakemake --profile config -n

Run the actual analysis workflow:

snakemake --profile config

For a more detailed explanation and information for adjusting parameters, please see below.

1. Before you start

The CRISPRscape workflow relies on two main tools for managing the workflow and installing software: Snakemake and mamba. Furthermore, since the project is hosted on GitHub, we expect you to use git.

CRISPRscape is designed to work with the AllTheBacteria resource to download all high-quality genomes of a given species. (The example on which it was first tested is Campylobacter coli and C. jejuni, combined.)

Estimated disk use

Warning

CRISPRscape requires downloading multiple databases. Prepare to use hundreds of GBs!

AllTheBacteria metadata: ~1.5GB
- AllTheBacteria genomes: depends on species (e.g., 197GB for ~130,000 Campylobacter genomes)
Databases of SpacePHARER:
- PLSDB: ~80GB
- Phagescope: ~320GB
geNomad database: 1.4GB

Download and install software

Before you begin, you need to install: (follow these links to find installation instructions)

We recommend Snakemake is installed via mamba. This is also the default and linked above.

When you have these tools installed, you can download CRISPRscape:

git clone --recurse-submodules https://github.com/UtrechtUniversity/campylobacter-crisprscape.git

(Note: the --recurse-submodules option is necessary to also automatically download CRISPRidentify, which is one of the two CRISPR-Cas screening tools included.)

Move your current working directory into this newly downloaded one to get started!

cd campylobacter-crisprscape

(You may of course rename this directory if you want to. Just make sure you remember it.)

Tunable parameters

CRISPRscape includes some options that can be modified by the user. These are stored in YAML and TXT files under the config directory.

Species of interest

The species of interest can be modified to your liking by changing the file config/species_of_interest.txt

This file simply lists species names, one per line. The default is:

Campylobacter_D jejuni
Campylobacter_D coli

By changing the species name, one can adjust the species that can be automatically downloaded from AllTheBacteria (ATB). When changing the name, make sure to use the taxonomy from GTDB.

This also affects multilocus sequence typing (MLST): CRISPRscape includes automated MLST, which requires downloading the proper marker gene database. This information is stored under config/parameters.yaml . For finding valid species names, please consult pyMLST.

Technical parameters

Then, there are some technical parameters that you can adjust to fit your system. These range from the input directory in which your genomes are stored (default: data/tmp/ATB/) and the location of databases to the number of CPU threads to use.

Please open config/config.yaml and config/parameters.yaml to review the default parameters and adjust where needed to make it work for your system.

The default CPU settings are:

Use a maximum total of 60 CPU threads
Use 20 CPU threads for most compute-intensive tasks

Download input genomes

CRISPRscape includes a convenient script to autmatically download genomes of interest from ATB. When the desired species name is saved in config/species_of_interest.txt, you can start downloading with:

bash bin/prepare_genomes.sh

Here, you may optionally add a command-line parameter to tell which part of ATB to look into: the complete database (all = largest), only the original version (original) or the incremental update (update = smallest; default). For example:

bash bin/prepare_genomes.sh all

Downloading databases

geNomad

This workflow uses geNomad to predict whether genomic contigs derive from chromosomal DNA, plasmids or viruses. This tool uses both a neural network classifier and a marker-based approach to calculate prediction scores. For the marker-based method, it requires a database which can be downloaded using the tool geNomad itself. If you have installed mamba this can be done as follows:

mamba env create -f envs/genomad.yaml
mamba activate genomad
genomad download-database data/

Note that this will create the subdirectory data/genomad_db/, which is the default that is also defined in config/parameters.yaml.

The current version of the database, v1.7, uses 1.4GB disk space.

SpacePHARER

The bin folder also includes scripts to download and extract pre-selected databases for use in Spacepharer. These include Phagescope for annotated phage sequences and PLSDB for annotated plasmid sequences which have been chosen for their broad taxonomy. By running:

bash bin/download_spacepharer_database.sh

Both databases are downloaded, extracted and then merged for use in Spacepharer. If you wish to use a different database or add to them, see doc/spacepharer.md for advice.

2. Running the workflow

The workflow is fully automated and should complete with one command. For details on what happens under the hood, see the tab 'Workflow details'.

One can do a 'dry-run' to test if all preparations have been satisfied:

snakemake --profile config -n

To run the actual workflow:

snakemake --profile config

3. Interpreting results

After running the workflow, the user is presented with a number of output files. These are described in detail under the tab 'Output files'.