Development notes
During the development of this workflow, we have made a number of decisions and ran into unexpected situations when using certain tools. This document describes the rationale of developing the workflow in its current form and sheds light on why we use the tools and parameters we use.
Downloading databases
(Rationale for including in the Snakefile
or providing separate scripts
that need to be run before running the workflow with snakemake
.)
Databases are downloaded for:
-
ATB input genomes (script
bin/prepare_genomes.sh
) -
Multilocus sequence typing (Snakemake rule
download_mlst_database
) -
PADLOC anti-phage defence system screening (Snakemake rule
download_padloc_database
) -
geNomad plasmid/phage predictions (not included, undocumented)
-
SpacePHARER CRISPR target prediction (script
download_spacepharer_database.sh
)
Use of GNU parallel over Snakemake
Snakemake is very good at running processes in parallel and making efficient use of computational resources. However, the file-based nature combined with the necessity to generate a directed acyclic graph (DAG) before running the actual workflow may lead to excessive waiting times when you have tens of thousands of input files. Therefore, we decided to use the batches provided by AllTheBacteria as input units and use a separate program, GNU parallel, to handle the files within these batches. This solution is taken from this online post by John van Dam: https://stackoverflow.com/a/55007872. As described there, this makes the whole DAG simpler, and the processing faster, at the cost of losing checks by Snakemake to see if all expected output files are generated.
Using other inputs than ATB
This workflow was designed to work with bacterial genomes from
AllTheBacteria (ATB)
as input and has provided scripts to automatically download genomes
of the bacterial species of interest.
However, the processing steps should work equally well on other genomes in
FASTA format. Currently, the easiest way to 'trick' the workflow into using
custom input is to either modify the input directory line in
config/parameters.yaml
:
to wherever you stored your files, or create this directory and copy
your genomes to there. Furthermore, Snakemake expects the input to be stored
in batches, using a directory that is named 'batch_*', where the asterisk
can match anything. So for example, you can make a directory:
data/tmp/ATB/batch_custom
, put your FASTA files in there and Snakemake
should pick them up and have them analysed.
Also note that downstream processing, i.e., statistical analyses and visualisation, make use of the metadata provided with ATB. This would have to be adjusted to the custom data as well.