Skip to content

gambit-suite/gambitcore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GAMBITcore

This application takes in assemblies, identifies the species, then calculates the completeness of the assemblies against the species core genome. This is a good quality control step. It is acheived by looking at the GAMBIT k-mers in the assembly and comparing them to the GAMBIT k-mers in the core genome. If the assembly is poor quality, it is expected that the completeness of the assembly will be lower. If GAMBIT cannot make a species or subspecies level call, GAMBITcore will be skipped.

Installation

If you want to quickly try out the software, please use the docker container.

Dependencies

  • Python 3.9 or higher
  • GAMBIT

Docker

To build the container, run this command from the root of the repository:

docker build -t gambitcore:latest .

To run the software from the container, run this command:

docker run -v /path/to/gambit/database_directory:/gambit -v /path/to/fasta/files:/fasta -it --rm gambitcore:latest gambitcore /gambit /fasta/*.fasta

Usage

gambitcore

This script takes in assemblies and a GAMBIT database and caculates the completeness of the assemblies against the core genome of a species.

usage: gambitcore [options]

How complete is an assembly compared to the core genome of its species?

positional arguments:
  gambit_directory      A directory containing GAMBIT files (database and signatures)
  fasta_filenames       A list of FASTA files of genomes

options:
  -h, --help            show this help message and exit
  --concise, -e         concise output (default: False)
  --cpus CPUS, -p CPUS  Number of cpus to use (default: 1)
  --kmer KMER, -k KMER  Length of the k-mer to use (default: 11)
  --kmer_prefix KMER_PREFIX, -f KMER_PREFIX
                        Kmer prefix (default: ATGAC)
  --max_species_genomes MAX_SPECIES_GENOMES, -t MAX_SPECIES_GENOMES
                        Max number of genomes in a species to consider, ignore all others above this (default: 500)
  --core_proportion CORE_PROPORTION, -c CORE_PROPORTION
                        Proportion of genomes a kmer must be in for a species to be considered core (default: 0.98)
  --num_genomes_per_species NUM_GENOMES_PER_SPECIES, -r NUM_GENOMES_PER_SPECIES
                        Number of genomes to keep for a species (0 means keep all) (default: 1)
  --verbose, -v         Turn on verbose output (default: False)

gambit_directory: The GAMBIT directory must contain a file ending in *.gdb and a file ending in *.gs. If it doesnt have these file suffixes, then the script will not work.

fasta_filenames: This is a list of 1 or more assemblies in FASTA format. They can be gzipped or not.

help: The help text and usage information

concise: This gives a short version of the output. The information is provided in a tab delimited format to standard out.

cpus: The number of CPUs to use for GAMBIT. This is set to 1 by default, but it will have a marginal impact on the overall running time.

kmer: The length of the k-mer to use. This is set to 11 by default. Dont change this because it needs to match the GAMBIT signatures file.

kmer_prefix: The k-mer prefix to use. This is set to ATGAC by default. Dont change this because it needs to match the GAMBIT signatures file.

max_species_genomes: The maximum number of genomes to use for a species. This is set to 500 by default. If there are more than 500 genomes for a species, then the script will ignore all genomes above this number. This is to speed up the script because as you add more genomes to a pangenome, you get diminishing returns, unless it has a very open pangenome. More genomes will also probably mean more random noise in the assemblies which would be captured, leading to a smaller core.

core_proportion: The proportion of genomes a k-mer must be in for a species to be considered core. This is set to 0.98 by default. This means that a k-mer must be in 98% of the genomes for a species to be considered core. This is to remove k-mers that are in a small number of genomes, which are probably not core, whilst also allowing some wiggle room for assembly errors.

num_genomes_per_species: The number of genomes to keep for a species. This is set to 1 by default. This means that if there are multiple genomes for a species, then only the first one will be used. This parameter will probably be removed because setting it to anything other than 1 will probably lead to incorrect results.

verbose: Turn on verbose output. This is set to False by default. This will give you more information about what the script is doing.

Output

gambitcore will then output a tab delimited output to standard out that looks like this:

Filename Species Completeness (%) Assembly core/Species Core Closest accession Closest distance Assembly k-mers Species k-mers Mean Species k-mers Std Dev Assembly QC
test/fasta/GCF_002800775.1.fna.gz Mycobacteroides abscessus 100.00% (5296/5296) GCF_000758385.1 0.0360 10847 10635 403 green

To get a concise output then use the -e flag:

Filename Species Completeness (%)
/fasta/file1.fa Mycobacteroides abscessus 99.75%

The columns are:

Filename: The name of the input FASTA file upon which the analysis was performed.

Species: The predicted species from GAMBIT.

Completeness (%): This is the percentage of core k-mers from the species found in the input assembly. A fully complete assembly should contain 100% of all the core k-mers. It is normal that some k-mers may not be present due to assembly errors, although a good quality assembly should be very close to 100%. The absolute number of core k-mers found, and the number of core k-mers expected, are in brackets.

Closest accession: This is the accession number of the genome from the database which is closest to the input assembly, as determined by GAMBIT. All GAMBIT k-mers are used to calcuate this.

Closest distance: The GAMBIT distance/diameter to the closest accession. This is a decimal number between 0 and 1, which a lower number indicating a closer match. All GAMBIT k-mers are used to calcuate this.

Assembly k-mers: The total number of GAMBIT k-mers in the assembly.

Species k-mers mean: The mean GAMBIT k-mers for the species (all GAMBIT k-mers, not just core ones). This gives you an indication of how large the core is compared to the mean k-mers (roughly the average size).

Species k-mers std dev: The standard deviation of the number of GABMIT k-mers in a sample.

Assembly QC: This is a colour coded output to give you an indication of the quality of the assembly. Green means the Assembly k-mers are within 2 standard deviations (95%) of the species k-mers mean. Amber means the Assembly k-mers are between 2 and 3 standard deviations (99.7%) of the species k-mers mean. Red means the Assembly k-mers are more than 3 standard deviations of the species k-mers mean and something might be very wrong. This is a very rough guide, but it can be useful to quickly identify assemblies that are an unusual size relative to the species.

⚠️ Warning

If GAMBIT failes to make a species- or subspecies-level assignment, GAMBITcore will be skipped with the message "Species could not be identified, skipping core genome assessment".

gambitcore-species

This is a script which takes in a GAMBIT database and calculates the core k-mers for every species in the database. It then outputs the details for each species to a tab delimited file.

The usage for the script is:

usage: gambitcore-species [options]

Kmer statistics for all species in a database. Warning it can take a long time to run.

positional arguments:
  gambit_directory      A directory containing GAMBIT files (database and signatures)

options:
  -h, --help            show this help message and exit
  --species SPECIES, -s SPECIES
                        Provide the name of a single species, default is to use everything in the database (default: None)
  --cpus CPUS, -p CPUS  Number of cpus to use (default: 1)
  --kmer KMER, -k KMER  Length of the k-mer to use (default: 11)
  --kmer_prefix KMER_PREFIX, -f KMER_PREFIX
                        Kmer prefix (default: ATGAC)
  --max_species_genomes MAX_SPECIES_GENOMES, -t MAX_SPECIES_GENOMES
                        Max number of genomes in a species to consider, ignore all others above this (default: 500)
  --core_proportion CORE_PROPORTION, -c CORE_PROPORTION
                        Proportion of genomes a kmer must be in for a species to be considered core (default: 0.98)
  --num_genomes_per_species NUM_GENOMES_PER_SPECIES, -r NUM_GENOMES_PER_SPECIES
                        Number of genomes to keep for a species (0 means keep all) (default: 1)
  --verbose, -v         Turn on verbose output (default: False)

The parameters are the same as above with the following exceptions:

species: This is a single string containing a species name which will be used to generate statistics for. It must be present in the database with the exact same string.

Output

gambitcore-species will then output a tab delimited output to standard out that looks like this:

Species Core k-mers Mean k-mers k-mers Std Dev Min k-mers Max k-mers Available Genomes Used Genomes
Achromobacter xylosoxidans 4983 11076 365 10366 12167 133 133
Acinetobacter baumannii 3014 8542 221 7907 9410 1276 500
Bifidobacterium bifidum 3709 6218 121 5957 6539 23 23
Campylobacter coli 164 1256 61 1039 1437 252 252
Campylobacter jejuni 380 1306 49 1105 1454 556 500
Salmonella enterica 5319 10616 324 9729 12333 4048 500

The columns are as previous with the following additions:

min k-mers: The lowest number of k-mers in a genome in the database for that species. If the min is far lower than the mean, accounting for the standard deviation, then it could indicate a highly variable species or assembly errors in the genomes in the database.

max k-mers: The highest number of k-mers in a genome in the database for that species. If the max is far higher than the mean, accounting for the standard deviation, then it could indicate a highly variable species or assembly errors in the genomes in the database.

available genomes: This is the number of genomes in the database for the given species.

used genomes: The number of genomes used to generate the core k-mers. This can be lower than the avialable genomes if the user supplied parameter is set. This parameter is used to reduce the overall running time.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published