CRISPR-Cas refinement workflow (simplified)

2. Assessing putative CRISPR-Cas loci

Rationale of two-step approach

It has been shown that CCTyper, the primary method we use for CRISPR-Cas identification, may have a high false-positive rate. Or rather, that the tool that it relies on ( MinCED , derived from CRT ) has a high false-positive rate. Furthermore, we have found that with highly similar CRISPR arrays in different genomes, the start and stop positions of repeats and spacers may shift by one or more positions when comparing these arrays. Therefore, we include a second tool, CRISPRidentify to evaluate the loci identified by CCTyper. For a detailed description of how CRISPRidentify works, we refer you to the publication. Also note the related publication of CRISPRloci for information on the different modules of which this tool or suite consists.

Word of caution regarding CRISPRidentify's superior accuracy

An important comment has been posted below the publication of CRISPRidentify, criticising their claims of superiority over previous tools. Link to comment

In short, results should be checked by an expert. There is no CRISPR-Cas tool that outclasses the other available tools in all aspects.

In brief, CRISPRidentify has two crucial advantages compared to CCTyper:

It has a more sophisticated method of identifying the correct start and stop positions of CRISPR repeats (details in Supplementary file 1 )
It uses a complex confidence scoring algorithm. This machine learning-based approach should exclude nearly all false-positive CRISPR-Cas loci.

The downside, however, is that running CRISPRidentify on complete genome assemblies takes a lot more time than CCTyper. Therefore, we came up with this two-step approach.

2.1 Passing CCTyper output to CRISPRidentify

2.2 Re-evaluation of the CRISPR arrays

Using the arrays parsed from CCTyper, CRISPRidentify has two seperate steps to come to its final conclusion: Candidate generation and Candidate evaluation. In Candidate generation, CRISPRidentify uses Vmatch to find putative repeat pairs, which by default need to be between 21 and 55 nucleotides long and 18-78 nucleotides apart. This process is relatively sensitive and usually more than one repeat candidate is generated. All the repeat candidates are then aligned. This alignment is created from a maximum element and a minimum element. The maximum element is the largest repeat string generated from the most common nucleotides in each base of all candidates. The minimum element is generated from the most common substring of all repeats, this also by definition has 100% identity as a substring of the maximum element.

Every possible repeat is then generated between the maximum and minimum element and put alongside the matches found by Vmatch and has duplicates filtered out, forming the set of repeat candidates. This set of candidates is then even further extended by omitting up to 3 nucleotides on each side of the repeats, generating an additional 15 candidates per repeat. In essence, many possible repeats are considered for analysis.

CRISPR array candidates are created by string searching the different repeats in the provided sequence and attempts to minimise the number of editing operations needed for the consensus repeat, while still allowing mutations in the repeats to be detected.

2.3 Calculating CRISPR confidence

After all CRISPR array candidates are generated, they are evaluated by CRISPRidentify's internal scoring system. This scoring system was created by considering 13 features that can predict array viability in multiple ways. The 13 features are listed in Supplementary file 1, table S2 (page 20). Performing feature subset selection on all combinations of these 13 features, three models containing 8, 9 and 10 of the 13 features achieved similar accuracy. By default, CRISPRidentify uses the average of these three models to score the candidate arrays.

The scoring is divided into three possible categories. 0-0.4 are low scoring candidates which are unlikely to be CRISPR. 0.4-0.75 are possible candidate CRISPR arrays. 0.75-1.0 are Bona-Fide CRISPR arrays and are very likely to be valid CRISPR arrays. In cases where CRISPR array candidates are overlapping but both Bona-Fide, the lower scoring arrays are instead put into the alternative candidate category.

2.4 Combining the output with CCTyper's

Output files generated in the process

Each step in the process generates a number of output files, which by default are written to:

data/
  tmp/
    crispridentify/                 # Here go overall files, such as 'all_spacers.fa'
      batch_[number]/               # Here is only one subfolder
        CRISPR_arrays-with_flanks/  # In here are subfolders for each CRISPR
                                    #  array identified with CCTyper.
          [CRISPR_ID]/              # Here are CRISPRidentify's output files

For more details on the output files, see output.

Next steps

→ Cluster spacers