CRISPRscape user manual
1. Before you start
The CRISPRscape workflow relies on two main tools for managing the workflow and installing software: Snakemake and mamba. Furthermore, since the project is hosted on GitHub, we expect you to use git.
CRISPRscape is designed to work with the AllTheBacteria resource to download all high-quality genomes of a given species. (The example on which it was first tested is Campylobacter coli and C. jejuni, combined.)
Download and install software
Before you begin, you need to install: (follow these links to find installation instructions)
We recommend Snakemake is installed via mamba. This is also the default and linked above.
When you have these tools installed, you can download CRISPRscape:
(Note: the --recurse-submodules
option is necessary to also automatically
download CRISPRidentify,
which is one of the two CRISPR-Cas screening tools included.)
Move your current working directory into this newly downloaded one to get started!
(You may of course rename this directory if you want to. Just make sure you remember it.)
Tunable parameters
CRISPRscape includes some options that can be modified by the user.
These are stored in YAML and TXT files under the config
directory.
Species of interest
The species of interest can be modified to your liking by changing the file
config/species_of_interest.txt
This file simply lists species names, one per line. The default is:
By changing the species name, one can adjust the species that can be automatically downloaded from AllTheBacteria (ATB). When changing the name, make sure to use the taxonomy from GTDB.
This also affects multilocus sequence typing (MLST): CRISPRscape includes
automated MLST, which requires downloading the proper marker gene database.
This information is stored under
config/parameters.yaml
.
For finding valid species names, please consult
pyMLST
.
Technical parameters
Then, there are some technical parameters that you can adjust to fit your
system. These range from the input directory in which your genomes are stored
(default: data/tmp/ATB/
) and the location of databases to the number of CPU
threads to use.
Please open config/config.yaml
and config/parameters.yaml
to review
the default parameters and adjust where needed to make it work for your
system.
The default CPU settings are:
-
Use a maximum total of 60 CPU threads
-
Use 20 CPU threads for most compute-intensive tasks
Download input genomes
CRISPRscape includes a convenient script to autmatically download genomes
of interest from ATB. When the desired species name is saved in
config/species_of_interest.txt
, you can start downloading with:
Here, you may optionally add a command-line parameter to tell which part
of ATB to look into: the complete database (all
= largest), only the
original version (original
) or the incremental update (update
= smallest;
default). For example:
Downloading databases
geNomad
This workflow uses geNomad to predict whether genomic contigs derive from chromosomal DNA, plasmids or viruses. This tool uses both a neural network classifier and a marker-based approach to calculate prediction scores. For the marker-based method, it requires a database which can be downloaded using the tool geNomad itself. If you have installed mamba this can be done as follows:
Note that this will create the subdirectory data/genomad_db/
,
which is the default that is also defined in
config/parameters.yaml
.
The current version of the database, v1.7, uses 1.4GB disk space.
SpacePHARER
The bin folder also includes scripts to download and extract pre-selected databases for use in Spacepharer. These include Phagescope for annotated phage sequences and PLSDB for annotated plasmid sequences which have been chosen for their broad taxonomy. By running:
Both databases are downloaded, extracted and then merged for use in Spacepharer.
If you wish to use a different database or add to them, see
doc/spacepharer.md
for advice.
2. Running the workflow
The workflow is fully automated and should complete with one command. For details on what happens under the hood, see the tab 'Workflow details'.
One can do a 'dry-run' to test if all preparations have been satisfied:
To run the actual workflow:
3. Interpreting results
After running the workflow, the user is presented with a number of output files. These are described in detail under the tab 'Output files'.