Skip to content

Step-by-step explanation of the CRISPR-Cas screening workflow (simplified)

flowchart LR
    A([Genome]) -->|Screen CRISPR-Cas| B[Tables]
    A([Genome]) -->|Screen CRISPR-Cas| F(FASTA)
    B -->|Parse| C[CSV file]
    B -->|Concatenate| D[Table / batch]
    F(FASTA) -->|Concatenate| G(Spacers / batch)
    C[CSV file] -->|Concatenate| H[CSV / batch]
    H[CSV / batch] -->|"Combine
                        batches"| I[("CRISPR-Cas
                                      database")]

1. Screening genomes for presence of CRISPR-Cas

We screen each input genome for the presence of CRISPR-Cas loci using CCTyper. CCTyper produces tab-separated output tables summarising results of:

  1. Complete CRISPR-Cas systems (CRISPR + Cas locus)
  2. Orphan CRISPR arrays (CRISPR spacers and repeats, without cas genes)
  3. Cas operons (Any putative operon of cas genes, with or without CRISPR array)

It also saves the CRISPR spacer sequences as .fa file: each array gets one FASTA file with spacers as separate entries/lines.

This is described in the Snakefile rule crisprcastyper. The output files have the extension .tab.

1.1 Summarising the summaries

To facilitate further processing of the results reported by CCTyper, we combine the most relevant results from the different .tab files into one .csv file: CRISPR-Cas.csv. At the same time, the array/ operon names, start and stop positions and DNA sequence orientation are collected and saved as .bed files.

This corresponds to rule parse_cctyper. We do this using a custom script: bin/cctyper_expender.py.

1.2 Practical detail on processing ten thousands of genomes

Since we are working with large numbers of genome files, these have been separated in batches, each with up to 4,000 genomes. For each batch, we concatenate the tabular files mentioned above. This is described in Snakefile rule collect_cctyper. This step concatenates the results from all .tab files in a batch, the .csv file generated in step 1.1, and the .fa files containing all separate CRISPR spacer sequences as generated by CCTyper.

This step uses two custom bash scripts to concatenate tables in a way that only one header line is used in the concatenated file: bin/concatenate_cctyper_output.sh and bin/concatenate_cctyper_csv.sh. To concatenate the spacer fasta files, we simply use cat.

Note that CCTyper reports putative cas operons. The concatenate_cctyper_output.sh script also makes a .tab file of predicted 'true' operons as cas_operons-[batch name].tab.

1.3 Collect all identified CRISPR spacers

After the concatenation of CCTyper's output files per batch, CRISPR spacer sequences (as .fa files) for all batches are concatenated in one batch with the rule concatenate_all_spacers. This results in a single file all_spacers.fasta for all input genomes together.