Prepare genomes

CRISPRscape includes scripts to automatically download bacterial genomes from AllTheBacteria, along with the available metadata (from the European Nucleotide Archive).

This page provides a step-by-step explanation of the bin/prepare_genomes.sh script. The script can be run as:

bash bin/prepare_genomes.sh

(Also see the manual.)

And it does the following things:

Download ATB metadata
Extract species of interest
Filter its respective metadata
Identify the right batches
Download genomes in batches
Remove non-target species
Download functional annotations in batches

1. Download ATB metadata

The script starts by calling a separate script: bin/download_ATB_metadata.sh. This script downloads:

ENA metadata (standard metadata from the European Nucleotide Archive)
- for this, it downloads both the '20240801' and '0.2.20240606' files
Sample list (accession IDs for all samples included in ATB)
Sylph output (taxonomic classification based on the Genome Taxonomy Database)
Assembly statistics (total length, number of contigs, N50, and more)
Assembly quality assessment (CheckM2)
Species calls (an interpretation of the Sylph and CheckM2 output)
- three columns: accession ID, species name and high-quality (T/F)
Two lists of 'all' files
- One is actually a table with accession IDs, species name, which ATB batch file they are part of, batch MD5 checksum and file size
- The other has no sample accessions, but lists all 'projects' within ATB along with its respective filename, URL, MD5 checksum and filesize.

Output files

By default, these are all downloaded to the directory data/ATB/. You then get:

1   ena_metadata.0.2.20240606.tsv.gz
2   ena_metadata.20240801.tsv.gz
3   sample_list.txt.gz
4   sylph.tsv.gz
5   assembly-stats.tsv.gz
6   checkm2.tsv.gz
7   species_calls.tsv.gz
8   file_list.all.20240805.tsv.gz
9   all_atb_files.tsv

2. Look up accession IDs of species of interest

Using the file config/species_of_interest.txt, which contains one species name per line, the script looks up all matches in the species_calls.tsv.gz file from ATB and filters the high-quality assemblies. The accession IDs are stored in a separate file: data/ATB/all_samples_of_interest.txt. (Modify this file if you want different species.)

As an extra, the script reads the total number of selected genomes, which is printed to the command-line (stdout). Also, the number of genomes per species is collected and stored as: data/ATB/number_of_genomes_per_species.txt.

Note: the system works with the GTDB taxonomy and GNU grep. This means it can only find names as defined by the GTDB and can use parts of the names too. E.g., Campylobacter jejuni is known in GTDB as 'Campylobacter_D jejuni' and by using this as search term, the script also matches 'Campylobacter_D jejuni_A' or any other suffix.

3. Filter metadata

The complete metadata file is big (677MiB compressed, 3,112,707 lines). To make this a bit easier to work with, we're extracting only the lines referring to the species of interest and store this as a separate file: data/ATB/enametadata.20240801-filtered.tsv.gz.

4. Find batches that contain species of interest

ATB has created batches of genomes to use clever compression and significantly reduce file sizes. To find which batches contain the species of interest, the script:

reads sample accession IDs from data/ATB/all_samples_of_interest.txt
uses the accession IDs to filter matching lines in data/ATB/file_list.all.20240805.tsv.gz
extracts the column containing file names, download URLs and MD5 checksums
deduplicates them, keeping one copy of each batch containing species of interest
and stores this in a separate file: data/ATB/batches_to_download.tsv

Next to the genomes of interest, these batches also contain lower-quality genomes and sometimes different species are put together in a batch. To remove these 'not of interest'-genomes and exclude them from further analyses, the script also makes a list of their sample accession IDs so that they can be removed after downloading. Also see step 6.

5. Download the genomes

The actual downloading of the genomes happens here! The script calls yet another separate script: bin/download_genomes.sh

download_genomes.sh reads the files that need to be downloaded from data/ATB/batches_to_download.tsv and downloads them one by one to data/tmp/ATB/. It checks file integrity with the MD5 checksum and deletes corrupted files. (It does not retry downloading automatically.)

The files (as batches) are downloaded as XZ archive, which are extracted to subdirectories named after the batch number. This yields FASTA files in a directory called data/tmp/ATB/batch_[number].

Default species: Campylobacter coli and C. jejuni

This workflow has been developed for and tested on Campylobacter genomes. As of 2024-09-19, AllTheBacteria includes 129,080 C. coli and C. jejuni genomes. (Up from 104,146 before the incremental update. That means there are 24,934 extra genomes.)

Quality filtering

Genomes are pre-filtered to include only 'high-quality' genomes. AllTheBacteria has its own quality criteria:

≥99% species abundance (practically pure)
≥90% completeness (CheckM2)
≤5% contaminated (CheckM2)
total length between 100 kbp and 15 Mbp
≤ 2,000 contigs
≥ 2,000 N50

6. Remove other species

Some batches may contain more than only the species of interest. In those cases, FASTA files that contain genomes from other species are deleted. This is based on sample accession IDs listed for each batch: accessions that do not match high-quality genomes of the species of interest are automatically removed. For the curious, a list of the other genomes is stored as: data/tmp/other_genomes-numbers.txt. These samples are removed and this list shows what was in there.

7. Download functional annotations

For each genome assembled in ATB, functional (gene) annotations have also been generated using Bakta. ATB stores only the JSON files, also wrapped in batches, and the download script also automatically downloads them so they may be used in your analyses!

General remarks

The whole process is set up as Bash script, and uses GNU tools such as grep and wget. This should work on most Unix-like systems.
The script can use a command-line option to download genomes from the complete ATB, only the first version, or the incremental update. For this, it uses options 'all', 'original' or 'update', respectively. By default, the script uses 'update', which has the smallest file sizes. The option can be provided as, for example:

bash bin/prepare_genomes.sh all

This option is most relevant to bin/download_genomes.sh (Step 5) and bin/download_bakta_annotations.sh (step 7).

The script reads whether files exist already. If they are already there, they will not be downloaded again.

Next steps

→ Screen CRISPRs