GAMBITcore

This application takes in assemblies, identifies the species, then calculates the completeness of the assemblies against the species core genome. This is a good quality control step. It is acheived by looking at the GAMBIT k-mers in the assembly and comparing them to the GAMBIT k-mers in the core genome. If the assembly is poor quality, it is expected that the completeness of the assembly will be lower. If GAMBIT cannot make a species or subspecies level call, GAMBITcore will be skipped.

Installation

If you want to quickly try out the software, please use the docker container.

Dependencies

Python 3.9 or higher
GAMBIT

Docker

To build the container, run this command from the root of the repository:

docker build -t gambitcore:latest .

To run the software from the container, run this command:

docker run -v /path/to/gambit/database_directory:/gambit -v /path/to/fasta/files:/fasta -it --rm gambitcore:latest gambitcore /gambit /fasta/*.fasta

Usage

gambitcore

This script takes in assemblies and a GAMBIT database and caculates the completeness of the assemblies against the core genome of a species.

usage: gambitcore [options]

How complete is an assembly compared to the core genome of its species?

positional arguments:
  gambit_directory      A directory containing GAMBIT files (database and signatures)
  fasta_filenames       A list of FASTA files of genomes

options:
  -h, --help            show this help message and exit
  --concise, -e         concise output (default: False)
  --cpus CPUS, -p CPUS  Number of cpus to use (default: 1)
  --kmer KMER, -k KMER  Length of the k-mer to use (default: 11)
  --kmer_prefix KMER_PREFIX, -f KMER_PREFIX
                        Kmer prefix (default: ATGAC)
  --max_species_genomes MAX_SPECIES_GENOMES, -t MAX_SPECIES_GENOMES
                        Max number of genomes in a species to consider, ignore all others above this (default: 500)
  --core_proportion CORE_PROPORTION, -c CORE_PROPORTION
                        Proportion of genomes a kmer must be in for a species to be considered core (default: 0.98)
  --num_genomes_per_species NUM_GENOMES_PER_SPECIES, -r NUM_GENOMES_PER_SPECIES
                        Number of genomes to keep for a species (0 means keep all) (default: 1)
  --verbose, -v         Turn on verbose output (default: False)

gambit_directory: The GAMBIT directory must contain a file ending in *.gdb and a file ending in *.gs. If it doesnt have these file suffixes, then the script will not work.

fasta_filenames: This is a list of 1 or more assemblies in FASTA format. They can be gzipped or not.

help: The help text and usage information

concise: This gives a short version of the output. The information is provided in a tab delimited format to standard out.

cpus: The number of CPUs to use for GAMBIT. This is set to 1 by default, but it will have a marginal impact on the overall running time.

kmer: The length of the k-mer to use. This is set to 11 by default. Dont change this because it needs to match the GAMBIT signatures file.

kmer_prefix: The k-mer prefix to use. This is set to ATGAC by default. Dont change this because it needs to match the GAMBIT signatures file.

max_species_genomes: The maximum number of genomes to use for a species. This is set to 500 by default. If there are more than 500 genomes for a species, then the script will ignore all genomes above this number. This is to speed up the script because as you add more genomes to a pangenome, you get diminishing returns, unless it has a very open pangenome. More genomes will also probably mean more random noise in the assemblies which would be captured, leading to a smaller core.

core_proportion: The proportion of genomes a k-mer must be in for a species to be considered core. This is set to 0.98 by default. This means that a k-mer must be in 98% of the genomes for a species to be considered core. This is to remove k-mers that are in a small number of genomes, which are probably not core, whilst also allowing some wiggle room for assembly errors.

num_genomes_per_species: The number of genomes to keep for a species. This is set to 1 by default. This means that if there are multiple genomes for a species, then only the first one will be used. This parameter will probably be removed because setting it to anything other than 1 will probably lead to incorrect results.

verbose: Turn on verbose output. This is set to False by default. This will give you more information about what the script is doing.

Output

gambitcore will then output a tab delimited output to standard out that looks like this:

Filename	Species	Completeness (%)	Assembly core/Species Core	Closest accession	Closest distance	Assembly k-mers	Species k-mers Mean	Species k-mers Std Dev	Assembly QC
test/fasta/GCF_002800775.1.fna.gz	Mycobacteroides abscessus	100.00%	(5296/5296)	GCF_000758385.1	0.0360	10847	10635	403	green

To get a concise output then use the -e flag:

Filename	Species	Completeness (%)
/fasta/file1.fa	Mycobacteroides abscessus	99.75%

The columns are:

Filename: The name of the input FASTA file upon which the analysis was performed.

Species: The predicted species from GAMBIT.

Completeness (%): This is the percentage of core k-mers from the species found in the input assembly. A fully complete assembly should contain 100% of all the core k-mers. It is normal that some k-mers may not be present due to assembly errors, although a good quality assembly should be very close to 100%. The absolute number of core k-mers found, and the number of core k-mers expected, are in brackets.

Closest accession: This is the accession number of the genome from the database which is closest to the input assembly, as determined by GAMBIT. All GAMBIT k-mers are used to calcuate this.

Closest distance: The GAMBIT distance/diameter to the closest accession. This is a decimal number between 0 and 1, which a lower number indicating a closer match. All GAMBIT k-mers are used to calcuate this.

Assembly k-mers: The total number of GAMBIT k-mers in the assembly.

Species k-mers mean: The mean GAMBIT k-mers for the species (all GAMBIT k-mers, not just core ones). This gives you an indication of how large the core is compared to the mean k-mers (roughly the average size).

Species k-mers std dev: The standard deviation of the number of GABMIT k-mers in a sample.

Assembly QC: This is a colour coded output to give you an indication of the quality of the assembly. Green means the Assembly k-mers are within 2 standard deviations (95%) of the species k-mers mean. Amber means the Assembly k-mers are between 2 and 3 standard deviations (99.7%) of the species k-mers mean. Red means the Assembly k-mers are more than 3 standard deviations of the species k-mers mean and something might be very wrong. This is a very rough guide, but it can be useful to quickly identify assemblies that are an unusual size relative to the species.

⚠️ Warning

If GAMBIT failes to make a species- or subspecies-level assignment, GAMBITcore will be skipped with the message "Species could not be identified, skipping core genome assessment".

gambitcore-species

This is a script which takes in a GAMBIT database and calculates the core k-mers for every species in the database. It then outputs the details for each species to a tab delimited file.

The usage for the script is:

usage: gambitcore-species [options]

Kmer statistics for all species in a database. Warning it can take a long time to run.

positional arguments:
  gambit_directory      A directory containing GAMBIT files (database and signatures)

options:
  -h, --help            show this help message and exit
  --species SPECIES, -s SPECIES
                        Provide the name of a single species, default is to use everything in the database (default: None)
  --cpus CPUS, -p CPUS  Number of cpus to use (default: 1)
  --kmer KMER, -k KMER  Length of the k-mer to use (default: 11)
  --kmer_prefix KMER_PREFIX, -f KMER_PREFIX
                        Kmer prefix (default: ATGAC)
  --max_species_genomes MAX_SPECIES_GENOMES, -t MAX_SPECIES_GENOMES
                        Max number of genomes in a species to consider, ignore all others above this (default: 500)
  --core_proportion CORE_PROPORTION, -c CORE_PROPORTION
                        Proportion of genomes a kmer must be in for a species to be considered core (default: 0.98)
  --num_genomes_per_species NUM_GENOMES_PER_SPECIES, -r NUM_GENOMES_PER_SPECIES
                        Number of genomes to keep for a species (0 means keep all) (default: 1)
  --verbose, -v         Turn on verbose output (default: False)

The parameters are the same as above with the following exceptions:

species: This is a single string containing a species name which will be used to generate statistics for. It must be present in the database with the exact same string.

Output

gambitcore-species will then output a tab delimited output to standard out that looks like this:

Species	Core k-mers	Mean k-mers	k-mers Std Dev	Min k-mers	Max k-mers	Available Genomes	Used Genomes
Achromobacter xylosoxidans	4983	11076	365	10366	12167	133	133
Acinetobacter baumannii	3014	8542	221	7907	9410	1276	500
Bifidobacterium bifidum	3709	6218	121	5957	6539	23	23
Campylobacter coli	164	1256	61	1039	1437	252	252
Campylobacter jejuni	380	1306	49	1105	1454	556	500
Salmonella enterica	5319	10616	324	9729	12333	4048	500

The columns are as previous with the following additions:

min k-mers: The lowest number of k-mers in a genome in the database for that species. If the min is far lower than the mean, accounting for the standard deviation, then it could indicate a highly variable species or assembly errors in the genomes in the database.

max k-mers: The highest number of k-mers in a genome in the database for that species. If the max is far higher than the mean, accounting for the standard deviation, then it could indicate a highly variable species or assembly errors in the genomes in the database.

available genomes: This is the number of genomes in the database for the given species.

used genomes: The number of genomes used to generate the core k-mers. This can be lower than the avialable genomes if the user supplied parameter is set. This parameter is used to reduce the overall running time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GAMBITcore

Installation

Dependencies

Docker

Usage

gambitcore

Output

gambitcore-species

Output

Files

README.md

Latest commit

History

README.md

File metadata and controls

GAMBITcore

Installation

Dependencies

Docker

Usage

gambitcore

Output

gambitcore-species

Output