Skip to content

CRISPRscape user manual

Quick start

Install dependencies: git, mamba and Snakemake.

Download the repository (including submodules):

git clone --recurse-submodules https://github.com/UtrechtUniversity/campylobacter-crisprscape.git

Move into the downloaded directory:

cd campylobacter-crisprscape

Download genomes from AllTheBacteria

bash bin/prepare_genomes.sh

Optional: Do a dry-run to check if Snakemake can find all the right input and output files:

snakemake --profile config -n

Run the actual analysis workflow:

snakemake --profile config

For a more detailed explanation and information for adjusting parameters, please see below.

1. Before you start

The CRISPRscape workflow relies on two main tools for managing the workflow and installing software: Snakemake and mamba. Furthermore, since the project is hosted on GitHub, we expect you to use git.

CRISPRscape is designed to work with the AllTheBacteria resource to download all high-quality genomes of a given species. (It has been tested on Campylobacter coli and C. jejuni together.)

Estimated disk use

Warning

CRISPRscape requires downloading multiple databases. Prepare to use hundreds of GBs!

  • AllTheBacteria metadata: ~1.5GB

    • AllTheBacteria genomes: depends on species (e.g., 209GB for ~130,000 Campylobacter genomes)
    • AllTheBacteria genome annotations (Bakta): depends on species (e.g., 531GB for ~130,000 Campylobacter genomes)
  • Databases of SpacePHARER:

    • PLSDB: 14GB initial download + 68GB after indexing = ~80GB

    • Phagescope: 41GB initial download + 155GB after indexing = ~200GB

  • geNomad database: 1.4GB

Download and install software

Before you begin, you need to install: (follow these links to find installation instructions)

  1. git

  2. mamba

  3. Snakemake

We recommend Snakemake is installed via mamba. This is also the default and linked above.

When you have these tools installed, you can download CRISPRscape:

git clone --recurse-submodules https://github.com/UtrechtUniversity/campylobacter-crisprscape.git

(Note: the --recurse-submodules option is necessary to also automatically download CRISPRidentify, which is one of the two CRISPR-Cas screening tools included.)

Move your current working directory into this newly downloaded one to get started!

cd campylobacter-crisprscape

(You may of course rename this directory if you want to. In that case, use the new name instead of campylobacter-crisprscape.)

Tunable parameters

CRISPRscape includes some options that can be modified by the user. These are stored in YAML and TXT files under the config directory.

Species of interest

The species of interest can be modified to your liking by changing the file config/species_of_interest.txt

This file simply lists species names, one per line. The default is:

Campylobacter_D jejuni
Campylobacter_D coli

By changing the species name, one can adjust the species that can be automatically downloaded from AllTheBacteria (ATB). This requires the taxonomic names as defined in GTDB.

This also affects multilocus sequence typing (MLST): CRISPRscape includes automated MLST, which requires downloading the proper marker gene database. This information is stored under config/parameters.yaml . For finding valid species names, please consult pyMLST.

Technical parameters

Then, there are some technical parameters that you can adjust to fit your system. These range from the input directory in which your genomes are stored (default: resources/ATB/) and the location of databases to the number of CPU threads to use.

Please open config/config.yaml and config/parameters.yaml to review the default parameters and adjust where needed to make it work for your system.

The default CPU settings are:

  • Use a maximum total of 60 CPU threads

  • Use 20 CPU threads for most compute-intensive tasks

Download input genomes

CRISPRscape includes a convenient script to autmatically download genomes of interest from ATB. When the desired species name is saved in config/species_of_interest.txt, you can start downloading with:

bash bin/prepare_genomes.sh

Here, you may optionally add a command-line parameter to tell which part of ATB to look into: the complete database (all = largest), only the original version (original) or the incremental update (update = smallest; default). For example:

bash bin/prepare_genomes.sh --part all

One may also set the output directory using the --directory or -d option:

bash bin/prepare_genomes.sh --part all --directory my/genomes

It also has a built-in help function to show the supported options:

$ bash bin/prepare_genomes.sh --help
Prepare genomes from AllTheBacteria for use with CRISPRscape

Syntax: prepare_genomes.sh -p [part] -d [directory] [-h]
Options:
-p/--part      Select which part of AllTheBacteria to download
               for the selected species ('all', 'original', or
               'update', default=update)
-d/--directory Directory in which to download the files
               (default=resources/ATB/)
-h/--help      Print this help message

Downloading databases

Reference databases required by the different tools are automatically downloaded with the Snakemake workflow. No user action required. Below is a summary of the databases that are used.

PADLOC

CRISPRscape uses PADLOC: Prokaryotic Antiviral Defence LOCator to screen genomes for the presence of different antiviral defence systems, including Cas genes. PADLOC uses a database of genes and HMM classifications to predict presence of a large number of different defence systems.

The current version, v2.0.0, uses 954MB disk space.

geNomad

This workflow uses geNomad to predict whether genomic contigs derive from chromosomal DNA, plasmids or viruses. This tool uses both a neural network classifier and a marker-based approach to calculate prediction scores. For the marker-based method, it requires a database which can be downloaded using the tool geNomad itself.

The current version of the database, v1.7, uses 1.4GB disk space.

SpacePHARER

For details of SpacePHARER, please see the corresponding page.

2. Running the workflow :material-run

The workflow is fully automated and should complete with one command. For details on what happens under the hood, see the tab 'Workflow details'.

One can do a 'dry-run' to test if all preparations have been satisfied:

snakemake --profile config -n

To run the actual workflow:

snakemake --profile config

3. Interpreting results

After running the workflow, the user is presented with a number of output files. These are described in detail under the tab 'Output files'.