Step-by-step explanation of the CRISPR-Cas screening workflow (simplified)
flowchart LR
A([Genome]) -->|Screen CRISPR-Cas| B[Tables]
A([Genome]) -->|Screen CRISPR-Cas| F(FASTA)
B -->|Parse| C[CSV file]
B -->|Concatenate| D[Table / batch]
F(FASTA) -->|Concatenate| G(Spacers / batch)
C[CSV file] -->|Concatenate| H[CSV / batch]
H[CSV / batch] -->|"Combine
batches"| I[("CRISPR-Cas
database")]
1. Screening genomes for presence of CRISPR-Cas
We screen each input genome for the presence of CRISPR-Cas loci using CCTyper. CCTyper produces tab-separated output tables summarising results of:
- Complete CRISPR-Cas systems (CRISPR + Cas locus)
- Orphan CRISPR arrays (CRISPR spacers and repeats, without cas genes)
- Cas operons (Any putative operon of cas genes, with or without CRISPR array)
It also saves the CRISPR spacer sequences as .fa
file:
each array gets one FASTA file with spacers as separate entries/lines.
This is described in the Snakefile
rule crisprcastyper
.
The output files have the extension .tab
.
1.1 Summarising the summaries
To facilitate further processing of the results reported by CCTyper,
we combine the most relevant results from the different .tab
files
into one .csv
file: CRISPR-Cas.csv
. At the same time, the array/
operon names, start and stop positions and DNA sequence orientation
are collected and saved as .bed
files.
This corresponds to rule parse_cctyper
.
We do this using a custom script: bin/cctyper_expender.py
.
1.2 Practical detail on processing ten thousands of genomes
Since we are working with large numbers of genome files, these have
been separated in batches, each with up to 4,000 genomes. For each batch,
we concatenate the tabular files mentioned above. This is described in
Snakefile
rule collect_cctyper
.
This step concatenates the results from all .tab
files in a batch,
the .csv
file generated in step 1.1,
and the .fa
files containing all separate CRISPR spacer sequences
as generated by CCTyper.
This step uses two custom bash scripts to concatenate tables in a way
that only one header line is used in the concatenated file:
bin/concatenate_cctyper_output.sh
and
bin/concatenate_cctyper_csv.sh
.
To concatenate the spacer fasta files, we simply use cat
.
Note that CCTyper reports putative cas operons.
The concatenate_cctyper_output.sh
script also makes a .tab
file of
predicted 'true' operons as cas_operons-[batch name].tab
.
1.3 Collect all identified CRISPR spacers
After the concatenation of CCTyper's output files per batch, CRISPR
spacer sequences (as .fa
files) for all batches are concatenated
in one batch with the rule concatenate_all_spacers
.
This results in a single file all_spacers.fasta
for all input genomes together.