CRISPR-Cas refinement workflow (simplified)
2. Assessing putative CRISPR-Cas loci
Rationale of two-step approach
It has been shown that CCTyper, the primary method we use for CRISPR-Cas identification, may have a high false-positive rate. Or rather, that the tool that it relies on ( MinCED , derived from CRT ) has a high false-positive rate. Furthermore, we have found that with highly similar CRISPR arrays in different genomes, the start and stop positions of repeats and spacers may shift by one or more positions when comparing these arrays. Therefore, we include a second tool, CRISPRidentify to evaluate the loci identified by CCTyper. For a detailed description of how CRISPRidentify works, we refer you to the publication. Also note the related publication of CRISPRloci for information on the different modules of which this tool or suite consists.
Word of caution regarding CRISPRidentify's superior accuracy
An important comment has been posted below the publication of CRISPRidentify, criticising their claims of superiority over previous tools. Link to comment
In short, results should be checked by an expert. There is no CRISPR-Cas tool that outclasses the other available tools in all aspects.
In brief, CRISPRidentify has two crucial advantages compared to CCTyper:
-
It has a more sophisticated method of identifying the correct start and stop positions of CRISPR repeats (details in Supplementary file 1 )
-
It uses a complex confidence scoring algorithm. This machine learning-based approach should exclude nearly all false-positive CRISPR-Cas loci.
The downside, however, is that running CRISPRidentify on complete genome assemblies takes a lot more time than CCTyper. Therefore, we came up with this two-step approach.
2.1 Passing CCTyper output to CRISPRidentify
2.2 Re-evaluation of the CRISPR arrays
(Brief description of how CRISPRidentify works)
2.3 Calculating CRISPR confidence
(Brief description of what the machine learning thing does)
2.4 Combining the output with CCTyper's
Output files generated in the process
Each step in the process generates a number of output files, which by default are written to:
data/
tmp/
crispridentify/ # Here go overall files, such as 'all_spacers.fa'
batch_[number]/ # Here is only one subfolder
CRISPR_arrays-with_flanks/ # In here are subfolders for each CRISPR
# array identified with CCTyper.
[CRISPR_ID]/ # Here are CRISPRidentify's output files
For more details on the output files, see output.