Output files
Note
Files that we consider to be the 'final output' of this workflow are labelled with a box like this:
Final output: data/processed/filename
They are stored in the directory data/processed/
.
Other output files are often not needed for end-users and are stored in
data/tmp/
.
1. CRISPR-Cas screening
1.1 CCTyper
The results of CCTyper are by default written to:
For example:
$ ls -F data/tmp/cctyper/batch_22/SAMN36083307
arguments.tab cas_operons_putative.tab CRISPR-Cas.bed CRISPR_Cas.tab crisprs.gff fasta/ Flank.njs Flank.ntf genes.tab plot.png spacers/
Cas_operons.bed CRISPR_arrays.bed CRISPR-Cas.csv crisprs_all.tab crisprs_near_cas.tab Flank.ndb Flank.not Flank.nto hmmer.tab plot.svg
For the *.tab
files, plot.[png|svg]
, Flank.*
and spacers/
, please see
CCTyper's
own documentation.
This comprises the fundamental output of our workflow.
For other (derived) files, please read below.
1.2 Collecting many outputs from CCTyper
After running CCTyper, there is a script
bin/cctyper_extender
that reads these tabular output files (*.tab
files) and converts them to
BED files for each CRISPR array (spacers and repeats), CRISPR-Cas system
(CRISPR array with cas genes), or cas operon.
These files contain the contig ID in which the locus was found, its start and
stop positions, the CRISPR array or cas operon ID as listed by CCTyper,
the number of CRISPR repeats or cas genes found,
and whether the cas genes were on the template or reverse complement strand.
For example:
$ head data/tmp/cctyper/batch_22/SAMN36083307/*bed
==> data/tmp/cctyper/batch_22/SAMN36083307/Cas_operons.bed <==
SAMN36083307.contig00015 38536 42801 SAMN36083307.contig00015@1 3 -
==> data/tmp/cctyper/batch_22/SAMN36083307/CRISPR_arrays.bed <==
SAMN36083307.contig00015 37630 38392 SAMN36083307.contig00015_1 12 -
==> data/tmp/cctyper/batch_22/SAMN36083307/CRISPR-Cas.bed <==
SAMN36083307.contig00015 37630 42801 SAMN36083307.contig00015_1 12 -
Furthermore, the script summarises all characteristics of the CRISPR array and
cas genes in a single file: CRISPR-Cas.csv
.
The columns that it includes are:
Sample_accession Contig System CRISPR_ID CRISPR_start CRISPR_end
Trusted Repeat_subtype Repeat_type_probability Consensus_repeat N_repeats
Repeat_len Repeat_identity Spacer_len_avg Spacer_identity Spacer_len_sem
Operon_ID Distance_to_CRISPR Cas_start Cas_end Best_type
Best_score Genes Complete_interference Complete_adaptation Strand_cas
N_genes Gene_lengths_aa
Note that while these are taken from CCTyper, some names have been adjusted or added.
1.3 Use BED files to extract sequences
Using the BED files generated in the step above, we use
seqkit
to extract the loci of interest from the original assemblies based on the
contig IDs and positions.
For this, we use a custom script:
bin/extract_crispr-cas_from_fasta.sh
.
This script extracts the regions from start to stop, and also with (up to)
5000bp up- and downstream of the region. These are then written to FASTA files
in the subdirectory fasta/
Besides, for each locus it also generates FASTA files with only either the upstream or downstream region. These can be used to infer the genomic neighbourhood and compare between genomes.
$ ls -sh data/tmp/cctyper/batch_22/SAMN36083307/fasta/
total 100K
8.0K Cas_operons-downstream.fasta 12K Cas_operons-with_flanks.fasta 8.0K CRISPR_arrays-upstream.fasta 8.0K CRISPR-Cas.fasta
8.0K Cas_operons.fasta 8.0K CRISPR_arrays-downstream.fasta 12K CRISPR_arrays-with_flanks.fasta 4.0K CRISPR-Cas-upstream.fasta
4.0K Cas_operons-upstream.fasta 4.0K CRISPR_arrays.fasta 8.0K CRISPR-Cas-downstream.fasta 16K CRISPR-Cas-with_flanks.fasta
1.4 Concatenate batches
We currently split our input samples in batches and to facilitate overall
analyses, we concatenate the output files within batches using the rule
collect_cctyper
, which in turn invokes
bin/concatenate_cctyper_output.sh
and
bin/concatenate_cctyper_csv.sh
as well as a cat
command to concatenate the FASTA files of spacer sequences
find $(dirname {input.cctyper}) -mindepth 3 -maxdepth 3 -name "*.fa"\
-exec cat {{}} + > {output.spacers} 2>> {log}
This creates a set of files with the batch name as suffix, for example:
$ ls -sh data/tmp/cctyper/batch_22
4.0K all_spacers-batch_22.fa 4.0K CRISPR-Cas-batch_22.csv 4.0K crisprs_orphan-batch_22.tab
4.0K cas_operons-batch_22.tab 4.0K CRISPR_Cas-batch_22.tab 4.0K crisprs_all-batch_22.tab
4.0K CRISPR_Cas-batch_22.bed 4.0K crisprs_near_cas-batch_22.tab
1.5 Concatenate overall
To then go to one final file for the whole dataset, the files listed above
are further concatenated in rule concatenate_all_spacers
with cat
.
$ ls -sh data/tmp/cctyper/all_spacers.fa
22M data/tmp/cctyper/all_spacers.fa
$ head data/tmp/cctyper/all_spacers.fa
>SAMN37284265.contig00007_1:1
TTATTTTTGTCGCTAATTGCACCTAAAGAC
>SAMN37284265.contig00007_1:2
TTTAGAAGAAGAAATATCTTATCTTCAAAG
>SAMN37284265.contig00007_1:3
ATAAGAAGTTAATTTGCTTACTTTCAAGTA
>SAMN37284265.contig00007_1:4
GTGGGCATAAGCAACTTAATGGCAGGATTT
>SAMN37284265.contig00007_1:5
TAAAGCTGGAAATTTTTCAACTAATTTACC
Warning: missing piece
Note that we currently do not have a separate step to concatenate the CSV files derived from CCTyper.
Cluster spacers: CCTyper
All CRISPR spacer sequences identified by CCTyper are clustered with
CD-HIT to determine the number of unique spacers. This happens in the rule
cluster_unique_spacers
, and generates:
ls -sh data/tmp/cctyper/all_spacers-clustered*
948K data/tmp/cctyper/all_spacers-clustered
20M data/tmp/cctyper/all_spacers-clustered.clstr
4.0K data/tmp/cctyper/all_spacers-clustered-distribution.tsv
The file without extension is the FASTA file that CD-HIT outputs,
all_spacers-clustered.clstr
is the list of clusters as determined
by CD-HIT (in FASTA-like format). And the file
all_spacers-clustered-distribution.tsv
uses the script plot_len1.pl
that
is bundled with CD-HIT to create a tab-separated table of the number of
members in clusters of various sizes.
Also see the user guide on bioinformatics.org for details on CD-HIT's output files and extra tools.
Spacer cluster table
To facilitate further analyses on the spacer level, we have included the
step create_crispr_cluster_table
, which calls
bin/make_cluster_table.py
to parse the output of CD-HIT and create a tab-separated table that looks
like:
$ head data/processed/all_spacers_table.tsv
Genome Contig Locus Cluster Length Cluster_representative Sequence Identity Strand
SAMN30649373 contig00012 SAMN30649373.contig00012_1:1 0 75nt SAMN30649373.contig00012_1:1 TATAGCAGTAAAATAGGCTTTAGAATGTATTTTATAAAAGCGTAAAAAATCAATATTATACCGACATTTTGCGAC NA NA
SAMN30649373 contig00012 SAMN30649373.contig00012_1:2 0 55nt SAMN30649373.contig00012_1:1 TATAGCAGTAAAATAGGCTTTAGAATGTATTTTATAAAAGCGTAAAAAATCAATA 100.00% +
SAMN19302369 contig00024 SAMN19302369.contig00024_2:1 1 57nt SAMN19302369.contig00024_2:1 TATTATATAATATATACTAGTGCATTTTGATAATAATTATCGACTTCAAGTAAAATT NA NA
SAMN36876523 contig00043 SAMN36876523.contig00043_2:1 2 56nt SAMN36876523.contig00043_2:1 ATGCTTGGCTTTTTTGTAGTATATAGCAATGCTATATATCCTAGTATATACTTAAA NA NA
SAMN11571935 contig00011 SAMN11571935.contig00011_2:1 3 56nt SAMN11571935.contig00011_2:1 ATAATGGTAAAATAGTCTTTAGAATGTATTTTAAAAAAGCATAAAAAGCTAATATA NA NA
SAMN32601407 contig00036 SAMN32601407.contig00036_2:1 4 56nt SAMN32601407.contig00036_2:1 ATAATGGTAAAATAGTCTTTAGAATGTATTTTAAATAAACTTAAAAAGCTAATATA NA NA
SAMEA1676287 contig00014 SAMEA1676287.contig00014_1:2 5 56nt SAMEA1676287.contig00014_1:2 TAGAAAATTTTTTATAAAAAATGTATATTTTTTTACCAAAAAATGAGACAAAAGCA NA NA
SAMEA1676287 contig00014 SAMEA1676287.contig00014_1:4 5 56nt SAMEA1676287.contig00014_1:2 TAGAAAATTTTTTATAAAAAATGTATATTTTTTTACCAAAAAATGAGACAAAAGCA 100.00% +
SAMN15582136 contig00006 SAMN15582136.contig00006_1:9 6 55nt SAMN15582136.contig00006_1:9 CTCAAACTTAGATTTATGGTAAAATTCAAATTCAAATATGACATCTAATAGTGAC NA NA
Final output: data/processed/all_spacers_table.tsv
2. CRISPR-Cas refinement
2.1 Evaluation by CRISPRidentify
See the original documentation on CRISPRidentify's output .
(Write a couple of sentences describing the output and maybe include some examples from the command-line.)
Final output: data/processed/all_CRISPRS_with_identify.tab
Multilocus Sequence Typing
The workflow includes classical MLST with pyMLST and the pubMLST database. The output is a simple table with two columns containing the genome's name and its assigned sequence type (if any):
$ head data/processed/mlst_table.tsv
Genome ST
SAMN10081961 48
SAMN09605354 12411
SAMN09343354 8
SAMN09405115 8
SAMN09489478 9111
Final output: data/processed/mlst_table.tsv
Antiviral defence systems
For identifying diverse antiviral defence systems in bacteria, we use PADLOC. The developers have explained its output in table format as described on GitHub.
We concatenate output from all genome batches in one file:
Final output: data/processed/padloc_table.csv
Virus and plasmid predictions
For each contig, we collect chromosome/plasmid/virus predictions derived from geNomad and Jaeger.
For geNomad, we combine the prediction scores with available plasmid or virus information into one comma-separatede dataframe. For example:
$ head data/processed/genomad_predictions.csv
genome,batch,contig,chromosome_score,plasmid_score,virus_score,plasmid_topology,plasmid_genes,conjugation_genes,amr_genes,virus_topology,virus_genes,virus_taxonomy,genomad_prediction
SAMEA114408902,batch_22,SAMEA114408902.contig00032,0.1828,0.2246,0.5926,NA,NA,NA,NA,NA,NA,NA,chromosome
SAMEA114461799,batch_22,SAMEA114461799.contig00001,0.0133,0.0547,0.932,NA,NA,NA,NA,No terminal repeats,62,Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;,virus
SAMEA114461799,batch_22,SAMEA114461799.contig00014,0.7204,0.2192,0.0604,NA,NA,NA,NA,NA,NA,NA,chromosome
SAMEA114461799,batch_22,SAMEA114461799.contig00015,0.8178,0.1384,0.0439,NA,NA,NA,NA,NA,NA,NA,chromosome
SAMEA114461799,batch_22,SAMEA114461799.contig00016,0.0017,0.9941,0.0042,No terminal repeats,19,T_virB9;T_virB8;T_virB6;virb4;T_virB2,NA,NA,NA,NA,plasmid
If geNomad did not assign virus or plasmid information, we classify the contig as 'chromosome'. This is listed in the final column, 'genomad_prediction'.
Final output: data/processed/genomad_predictions.csv
For Jaeger, the output is described in its own documentation.
Final output: data/processed/jaeger_predictions.csv