CRISPRscape user manual
Quick start
Install dependencies: git, mamba and Snakemake.
Download the repository (including submodules):
Move into the downloaded directory:
Download genomes from AllTheBacteria
Optional: Do a dry-run to check if Snakemake can find all the right input and output files:
Run the actual analysis workflow:
For a more detailed explanation and information for adjusting parameters, please see below.
1. Before you start
The CRISPRscape workflow relies on two main tools for managing the workflow and installing software: Snakemake and mamba. Furthermore, since the project is hosted on GitHub, we expect you to use git.
CRISPRscape is designed to work with the AllTheBacteria resource to download all high-quality genomes of a given species. (It has been tested on Campylobacter coli and C. jejuni together.)
Estimated disk use
Warning
CRISPRscape requires downloading multiple databases. Prepare to use hundreds of GBs!
-
AllTheBacteria metadata: ~1.5GB
- AllTheBacteria genomes: depends on species (e.g., 209GB for ~130,000 Campylobacter genomes)
- AllTheBacteria genome annotations (Bakta): depends on species (e.g., 531GB for ~130,000 Campylobacter genomes)
-
Databases of SpacePHARER:
-
PLSDB: 14GB initial download + 68GB after indexing = ~80GB
-
Phagescope: 41GB initial download + 155GB after indexing = ~200GB
-
-
geNomad database: 1.4GB
Download and install software
Before you begin, you need to install: (follow these links to find installation instructions)
We recommend Snakemake is installed via mamba. This is also the default and linked above.
When you have these tools installed, you can download CRISPRscape:
(Note: the --recurse-submodules option is necessary to also automatically
download CRISPRidentify,
which is one of the two CRISPR-Cas screening tools included.)
Move your current working directory into this newly downloaded one to get started!
(You may of course rename this directory if you want to. In that case,
use the new name instead of campylobacter-crisprscape.)
Tunable parameters
CRISPRscape includes some options that can be modified by the user.
These are stored in YAML and TXT files under the config directory.
Species of interest
The species of interest can be modified to your liking by changing the file
config/species_of_interest.txt
This file simply lists species names, one per line. The default is:
By changing the species name, one can adjust the species that can be automatically downloaded from AllTheBacteria (ATB). This requires the taxonomic names as defined in GTDB.
This also affects multilocus sequence typing (MLST): CRISPRscape includes
automated MLST, which requires downloading the proper marker gene database.
This information is stored under
config/parameters.yaml
.
For finding valid species names, please consult
pyMLST.
Technical parameters
Then, there are some technical parameters that you can adjust to fit your
system. These range from the input directory in which your genomes are stored
(default: resources/ATB/) and the location of databases to the number of CPU
threads to use.
Please open config/config.yaml and config/parameters.yaml to review
the default parameters and adjust where needed to make it work for your
system.
The default CPU settings are:
-
Use a maximum total of 60 CPU threads
-
Use 20 CPU threads for most compute-intensive tasks
Download input genomes
CRISPRscape includes a convenient script to autmatically download genomes
of interest from ATB. When the desired species name is saved in
config/species_of_interest.txt, you can start downloading with:
Here, you may optionally add a command-line parameter to tell which part
of ATB to look into: the complete database (all = largest), only the
original version (original) or the incremental update (update = smallest;
default). For example:
One may also set the output directory using the --directory or -d option:
It also has a built-in help function to show the supported options:
$ bash bin/prepare_genomes.sh --help
Prepare genomes from AllTheBacteria for use with CRISPRscape
Syntax: prepare_genomes.sh -p [part] -d [directory] [-h]
Options:
-p/--part Select which part of AllTheBacteria to download
for the selected species ('all', 'original', or
'update', default=update)
-d/--directory Directory in which to download the files
(default=resources/ATB/)
-h/--help Print this help message
Downloading databases
Reference databases required by the different tools are automatically downloaded with the Snakemake workflow. No user action required. Below is a summary of the databases that are used.
PADLOC
CRISPRscape uses PADLOC: Prokaryotic Antiviral Defence LOCator to screen genomes for the presence of different antiviral defence systems, including Cas genes. PADLOC uses a database of genes and HMM classifications to predict presence of a large number of different defence systems.
The current version, v2.0.0, uses 954MB disk space.
geNomad
This workflow uses geNomad to predict whether genomic contigs derive from chromosomal DNA, plasmids or viruses. This tool uses both a neural network classifier and a marker-based approach to calculate prediction scores. For the marker-based method, it requires a database which can be downloaded using the tool geNomad itself.
The current version of the database, v1.7, uses 1.4GB disk space.
SpacePHARER
For details of SpacePHARER, please see the corresponding page.
2. Running the workflow :material-run
The workflow is fully automated and should complete with one command. For details on what happens under the hood, see the tab 'Workflow details'.
One can do a 'dry-run' to test if all preparations have been satisfied:
To run the actual workflow:
3. Interpreting results
After running the workflow, the user is presented with a number of output files. These are described in detail under the tab 'Output files'.