Output files
Note
Files that we consider to be the 'final output' of this workflow are labelled with a box like this:
Final output: results/filename
They are stored in the directory results/.
Other output files are often not needed for end-users and are stored in
subfolders within results/.
1. CRISPR-Cas screening
1.1 CCTyper
The results of CCTyper are by default written to:
For example:
$ ls -F results/cctyper/atb.assembly.incr_release.202408.batch.38/
arguments.tab crisprs_all.tab crisprs_putative.tab Flank.ntf plot.png
cas_operons_putative.tab crisprs.gff Flank.ndb Flank.nto plot.svg
CRISPR_Cas_putative.tab crisprs_near_cas.tab Flank.njs genes.tab spacers/
CRISPR_Cas.tab crisprs_orphan.tab Flank.not hmmer.tab
For the *.tab files, plot.[png|svg], Flank.* and spacers/, please see
CCTyper's
own documentation.
This comprises the fundamental output of our workflow.
For other (derived) files, please read below.
1.2 Collecting many outputs from CCTyper
After running CCTyper, there is a script
bin/cctyper_extender
that reads these tabular output files (*.tab files) and converts them to
BED files for each CRISPR array (spacers and repeats), CRISPR-Cas system
(CRISPR array with cas genes), or cas operon.
These files contain the contig ID in which the locus was found, its start and
stop positions, the CRISPR array or cas operon ID as listed by CCTyper,
the number of CRISPR repeats or cas genes found,
and whether the cas genes were on the template or reverse complement strand.
For example:
$ head -2 results/cctyper-parse/atb.assembly.incr_release.202408.batch.38/*bed
==> results/cctyper-parse/atb.assembly.incr_release.202408.batch.38/Cas_operons.bed <==
SAMEA115501754.contig00002 112331 116642 SAMEA115501754.contig00002@4 4 +
SAMEA115501782.contig00002 121539 125851 SAMEA115501782.contig00002@4 3 +
==> results/cctyper-parse/atb.assembly.incr_release.202408.batch.38/CRISPR_arrays.bed <==
SAMEA115501754.contig00002 116789 117088 SAMEA115501754.contig00002_56 5 +
SAMEA115501782.contig00002 125998 127282 SAMEA115501782.contig00002_80 20 +
==> results/cctyper-parse/atb.assembly.incr_release.202408.batch.38/CRISPR-Cas.bed <==
SAMEA115501754.contig00002 112331 117088 SAMEA115501754.contig00002_56 5 +
SAMEA115501782.contig00002 121539 127282 SAMEA115501782.contig00002_80 20 +
Furthermore, the script summarises all characteristics of the CRISPR array and
cas genes in a single file: results/cctyper-parse/[batch]/CRISPR-Cas.tsv.
It has 28 columns, which are:
$ head -1 results/cctyper-parse/atb.assembly.incr_release.202408.batch.38/CRISPR-Cas.tsv | tr "\t" "\n" | nl
1 Sample_accession
2 Contig
3 System
4 CRISPR_ID
5 CRISPR_start
6 CRISPR_end
7 Trusted
8 Repeat_subtype
9 Repeat_type_probability
10 Consensus_repeat
11 N_repeats
12 Repeat_len
13 Repeat_identity
14 Spacer_len_avg
15 Spacer_identity
16 Spacer_len_sem
17 Operon_ID
18 Distance_to_CRISPR
19 Cas_start
20 Cas_end
21 Best_type
22 Best_score
23 Genes
24 Complete_interference
25 Complete_adaptation
26 Strand_cas
27 N_genes
28 Gene_lengths_aa
Note that while these are taken from CCTyper, some names have been adjusted
or added. These per-batch tables are then concatenated into one overall
results table: results/crisprs-primary.tsv.
1.3 Use BED files to extract sequences
Using the BED files generated in the step above, we use
seqkit
to extract the loci of interest from the original assemblies based on the
contig IDs and positions.
For this, we use a custom script:
bin/extract_crispr-cas_from_fasta.sh
.
This script extracts the regions from start to stop, with (up to)
5000bp up- and downstream of the region. These are then written to FASTA files
in the subdirectory results/crispr_fasta/
$ ls -sh results/crispr_fasta/atb.assembly.incr_release.202408.batch.38/
total 188K
48K CRISPR-Cas.fasta 140K CRISPR-Cas-with_flanks.fasta
1.4 Concatenate spacers
All spacers detected during the primary screening step of CRISPRscape
are concatenated in one file with cat.
$ ls -sh results/spacers-primary.fasta
360K results/spacers-primary.fasta
$ head results/spacers-primary.fasta
>SAMN39693127.contig00003_1:1
TGCGTGGGCGACCTTTTGTATAAACAAGTT
>SAMN39693127.contig00003_1:2
TGTATCCCTTACCACTACTTTGCCATATTG
>SAMN39693127.contig00003_1:3
CTCTTACAAGCTCTATTGCAATAATTTTAA
>SAMN39693127.contig00003_1:4
TCTAAAAAGGAAATTTTTTAAAGCTGACTT
>SAMN39693127.contig00003_1:5
CAATGTTTTGTCAAGTTTCAAGCGAAGGCG
Cluster spacers: CCTyper
All CRISPR spacer sequences identified by CCTyper are clustered with
CD-HIT to determine the number of unique spacers. This happens in the rule
cluster_unique_spacers, and generates:
ls -sh results/cluster-cctyper/spacers-clustered*
948K results/cluster-cctyper/spacers-clustered
20M results/cluster-cctyper/spacers-clustered.clstr
The file without extension is the FASTA file that CD-HIT outputs,
spacers-clustered.clstr is the list of clusters as determined
by CD-HIT (in FASTA-like format).
Also see the user guide on bioinformatics.org for details on CD-HIT's output files and extra tools.
Spacer cluster table
To facilitate further analyses on the spacer level, we have included a script
to summarise the spacer cluster information in one table. This scripts,
bin/make_cluster_table.py,
parses the output of CD-HIT to create a tab-separated table
results/spacers-primary.tsv that looks like:
$ head results/spacers-primary.tsv
Spacer_ID Sequence Length Cluster Spacer_base_ID Spacer_position
SAMN39693127.contig00003_1:1 TGCGTGGGCGACCTTTTGTATAAACAAGTT 30 1129 SAMN39693127.contig00003_1 1
SAMN39693127.contig00003_1:2 TGTATCCCTTACCACTACTTTGCCATATTG 30 1026 SAMN39693127.contig00003_1 2
SAMN39693127.contig00003_1:3 CTCTTACAAGCTCTATTGCAATAATTTTAA 30 983 SAMN39693127.contig00003_1 3
SAMN39693127.contig00003_1:4 TCTAAAAAGGAAATTTTTTAAAGCTGACTT 30 1156 SAMN39693127.contig00003_1 4
SAMN39693127.contig00003_1:5 CAATGTTTTGTCAAGTTTCAAGCGAAGGCG 30 904 SAMN39693127.contig00003_1 5
SAMN39693127.contig00003_1:6 AAACCTAAATCGTTATTCTCACCGCTCTTT 30 1051 SAMN39693127.contig00003_1 6
SAMN39693127.contig00003_1:7 AATAATGGCTAAATATTTCATGAGAATGGA 30 840 SAMN39693127.contig00003_1 7
SAMN39693127.contig00007_2:1 CGTAGAATACGATGAAAG 18 1635 SAMN39693127.contig00007_2 1
SAMN39693127.contig00010_3:1 TTAAAGTCCCTAAAAATAGGCATTTTTTGCTTTAGGTTAGGTAC 44 781 SAMN39693127.contig00010_3 1
The results per cluster are summarised in a separate table that lists
the number of sequences per cluster, the most common spacer sequence,
the shortest and the longest sequence: results/spacer_clusters-primary.tsv.
Final output:
results/crisprs-primary.tsv
results/spacers-primary.fasta
results/spacers-primary-deduplicated_and_counted.fasta
results/spacers-primary.tsv
results/spacer_clusters-primary.tsv
2. CRISPR-Cas refinement
2.1 Evaluation by CRISPRidentify
See the original documentation on CRISPRidentify's output.
Complete CRISPR-Cas arrays detected by CCTyper are passed on to CRISPRidentify for further evaluation and refinement. This generates per array output files, and summaries per batch:
$ ls -F results/crispridentify/atb.assembly.incr_release.202408.batch.38/SAMEA112866769-contig00010_2/
Alternative_Candidates.txt Bona-Fide_Candidates.txt gff_result/ Possible_Candidates.txt Repeats.fasta Summary.csv
Arrays.fasta combined.gff Low_Score_Candidates.txt Possible_Discarded_Candidates.txt Spacers.fasta
$ ls -F results/crispridentify/atb.assembly.incr_release.202408.batch.38/
Complete_array_dataset.fasta Complete_repeat_dataset.fasta SAMEA112866769-contig00010_2/
Complete_Cassette_summary.csv Complete_spacer_dataset.fasta SAMEA112867010-contig00031_10-11/
Complete_Cas_summary.csv Complete_summary.csv
The 'Complete_summary.csv' files are combined for all batches into one
large table: results/crisprs-final.tsv.
$ head -3 results/crisprs-final.tsv
Name Global ID ID Region index Start End Length Consensus repeat Repeat Length Average Spacer Length Number of spacers Strand Category Score batch
SAMN39693127-contig00003_1 1 1 1 9409 9906 498 ATTTTACCATAAAGAAATTTAAAAAGGGACTAAAAC 36 30 7 Forward (Orientation was not computed) Bona-fide 0.9462372113536278 atb.assembly.incr_release.202408.batch.23
SAMN39693157-contig00006_6 2 1 1 9410 10039 630 ATTTTACCATAAAGAAATTTAAAAAGGGACTAAAAC 36 30 9 Forward (Orientation was not computed) Bona-fide 0.9462372113536278 atb.assembly.incr_release.202408.batch.23
Also, spacers are collected in one overall file and clustered into unique
sequences, generating: results/spacers-final.fasta,
results/spacers-final-deduplicated_and_counted.fasta,
results/spacers-final.tsv, and results/spacer_clusters-final.tsv.
$ ls -1sh results/spacer*final*
76K results/spacer_clusters-final.tsv
52K results/spacers-final-deduplicated_and_counted.fasta
108K results/spacers-final.fasta
176K results/spacers-final.tsv
$ head -2 results/spacer*final*
==> results/spacer_clusters-final.tsv <==
Cluster Number_of_spacers Most_common_sequence Most_common_length Most_common_number Shortest_sequence Shortest_length Longest_sequence Longest_length
0 3 TCTAGCTTGCCTAGTATAGTTACACTTCCAC 31 3 TCTAGCTTGCCTAGTATAGTTACACTTCCAC 31 TCTAGCTTGCCTAGTATAGTTACACTTCCAC 31
==> results/spacers-final-deduplicated_and_counted.fasta <==
>15.AATAATGGCTAAATATTTCATGAGAATGGA
AATAATGGCTAAATATTTCATGAGAATGGA
==> results/spacers-final.fasta <==
>SAMN39693127-contig00003_1_-_CRISPR_1_9409_9906_spacer_1
TGCGTGGGCGACCTTTTGTATAAACAAGTT
==> results/spacers-final.tsv <==
Spacer_ID Sequence Length Cluster Spacer_base_ID Spacer_position
SAMN39693127-contig00003_1_-_CRISPR_1_9409_9906_spacer_1 TGCGTGGGCGACCTTTTGTATAAACAAGTT 30 252 SAMN39693127-contig00003_1_-_CRISPR_1_9409_9906 1
Final output:
results/crisprs-final.tsv
results/spacers-final.fasta
results/spacers-final-deduplicated_and_counted.fasta
results/spacers-final.tsv
results/spacer_clusters-final.tsv
Target prediction
To predict the targets of CRISPR spacers, CRISPRscape maps the unique spacer sequences back to the input genomes while ignoring the loci from which the spacers derive. This is done with KMA, generating output like described in its documentation.
From that, we collect a list of spacers and the genome contig to which it maps:
results/kma/CRISPR-alignments.tsv.
$ ls -shF results/kma
total 564K
276K CRISPR_alignment.tsv 44K CRISPR.aln 184K CRISPR.frag.gz 20K CRISPR.fsa 36K CRISPR.res 4.0K spacer_DB/
$ head -5 results/kma/CRISPR_alignment.tsv
spacer genome
SAMN39693127-contig00003_9409_9906 SAMN39693127-contig00003_1_-_CRISPR_1_9409_9906_spacer_1 SAMEA115501831.contig00001
SAMN39693127-contig00003_9409_9906 SAMN39693127-contig00003_1_-_CRISPR_1_9409_9906_spacer_1 SAMEA115501785.contig00006
SAMN39693127-contig00003_9409_9906 SAMN39693127-contig00003_1_-_CRISPR_1_9409_9906_spacer_1 SAMEA115501713.contig00005
SAMN39693127-contig00003_9409_9906 SAMN39693127-contig00003_1_-_CRISPR_1_9409_9906_spacer_1 SAMEA115501708.contig00005
This contig information can be matched with the output of geNomad and Jaeger to predict if spacers match bacteriophage or plasmid sequences.
Multilocus Sequence Typing
The workflow includes classical MLST with pyMLST and the pubMLST database. The output is a simple table with two columns containing the genome's name and its assigned sequence type (if any):
$ head results/mlst_table.tsv
Genome ST
SAMN10081961 48
SAMN09605354 12411
SAMN09343354 8
SAMN09405115 8
SAMN09489478 9111
Final output: results/mlst_table.tsv
Antiviral defence systems
For identifying diverse antiviral defence systems in bacteria, we use PADLOC. The developers have explained its output in table format as described on GitHub.
We concatenate output from all genome batches in one file:
results/padloc_table.csv. A comma-separated table with 19
columns:
$ ls -sh results/padloc_table.csv
9.0M results/padloc_table.csv
$ head -1 results/padloc_table.csv | tr "," "\n" | nl
1 system.number
2 seqid
3 system
4 target.name
5 hmm.accession
6 hmm.name
7 protein.name
8 full.seq.E.value
9 domain.iE.value
10 target.coverage
11 hmm.coverage
12 start
13 end
14 strand
15 target.description
16 relative.position
17 contig.end
18 all.domains
19 best.hits
Final output: data/processed/padloc_table.csv
Virus and plasmid predictions
For each contig, we collect chromosome/plasmid/virus predictions derived from geNomad and Jaeger.
For geNomad, we combine the prediction scores with available plasmid or
virus information into one tab-separatede dataframe:
results/genomad_predictions.tsv.
$ ls -sh results/genomad_predictions.tsv
2.0M results/genomad_predictions.tsv
$ head -5 results/genomad_predictions.tsv
genome batch contig chromosome_score plasmid_score virus_score plasmid_topology plasmid_genes conjugation_genes amr_genes virus_topology virus_genes virus_taxonomy genomad_prediction
SAMN39693127 atb.assembly.incr_release.202408.batch.23 SAMN39693127.contig00001 0.6773 0.2585 0.0641 NA NA NA NA NA NA NA chromosome
SAMN39693127 atb.assembly.incr_release.202408.batch.23 SAMN39693127.contig00002 0.7124 0.23 0.0576 NA NA NA NA NA NA NA chromosome
SAMN39693127 atb.assembly.incr_release.202408.batch.23 SAMN39693127.contig00003 0.7214 0.2219 0.0567 NA NA NA NA NA NA NA chromosome
SAMN39693127 atb.assembly.incr_release.202408.batch.23 SAMN39693127.contig00004 0.7716 0.1783 0.0502 NA NA NA NA NA NA NA chromosome
If geNomad did not assign virus or plasmid information, we classify the contig as 'chromosome'. This is listed in the final column, 'genomad_prediction'.
Final output: results/genomad_predictions.tsv
We also run Jaeger as additional virus check. The output is described in its own documentation.
$ ls -sh results/jaeger_predictions.tsv
440K results/jaeger_predictions.tsv
$ head -5 results/jaeger_predictions.tsv
accession_id contig length prediction reliability_score prophage_contam
SAMN39830054 SAMN39830054.contig00012 10686 non-phage 0.892 TRUE
SAMN39706706 SAMN39706706.contig00037 3106 non-phage 0.957 FALSE
SAMN39847329 SAMN39847329.contig00016 41162 non-phage 0.932 TRUE
SAMN39693127 SAMN39693127.contig00017 6561 non-phage 0.882 TRUE
Final output: results/jaeger_predictions.tsv
Genome dereplication
As described in the extra functions,
CRISPRscape looks for identical genomes and provides a 'dereplication table'
that may be used to discard duplicates for statistical analyses. The output
is written to results/dereplication_table.tsv.
This table lists for each genome the corresponding cluster representative,
which may be used to flag and remove duplicates.
$ head -4 results/dereplication_table.tsv
batch genome secondary_cluster representative score cluster_method comparison_algorithm threshold primary_cluster
atb.assembly.incr_release.202408.batch.23 SAMN39748578.fa 1_1 SAMN39748578.fa 1.6179841263202184 complete fastANI 9.9999999999989e-5 1
atb.assembly.incr_release.202408.batch.23 SAMN40530429.fa 1_2 SAMN40530429.fa 1.6180244997520763 complete fastANI 9.9999999999989e-5 1
atb.assembly.incr_release.202408.batch.23 SAMN39746878.fa 1_3 SAMN39746878.fa 1.6179033569300467 complete fastANI 9.9999999999989e-5 1
Final output: results/dereplication_table.tsv