Skip to content

Prepare GTDB Tk data

Pierre Chaumeil edited this page Apr 10, 2024 · 14 revisions

1 - Prepare FastANI database

  • Get all genomes used to generate Archaeal and Bacterial tree:

cat data_from_db/gtdb_bac_taxonomy.tsv |awk '{print $1}' > raw_genomes.lst
cat data_from_db/gtdb_arc_taxonomy.tsv |awk '{print $1}' >> raw_genomes.lst

  • Pull the fna files of those genomes in a folder:

gtdb genomes pull --batchfile raw_genomes.lst --genomic --output fastani/

  • Archive all genomes in the postprocessed/fastani folder:

pgzip *.fna

Prepare genome_paths.tsv

Copy the create_genome_paths.sh script from the scripts folder to the fastani folder ( above database/) and run it

2- Prepare untrimmed MSA

GTDB-Tk also need to untrimmed version of each MSA:

gtdb tree create --no_trim --no_tree --genome_batchfile raw_bacterial.lst --guaranteed_batchfile raw_bacterial.lst --output bacterial_msa --marker_set_ids 1 --classic_header
gtdb tree create --no_trim --no_tree --genome_batchfile raw_archaeal.lst --guaranteed_batchfile raw_archaeal.lst --output archaeal_msa --marker_set_ids 19 --classic_header

  • Copy new msa files to GTDB-Tk package

cp bacterial_msa/gtdb_concatenated.faa gtdbtk_package/msa/gtdb_r<#>_bac120.faa
cp archaeal_msa/gtdb_concatenated.faa gtdbtk_package/msa/gtdb_r<#>_ar53.faa

3- Copy the masks to the GTDB-Tk package

Get the original masks from the original run from /srv/projects/gtdb/release207/bacteria/pre_curation/bac120/20211110/msa/gtdb_r207_bac120_mask.txt

3 - Create Metadata document

We are using the original trees ( before being imported in ARB) as the reference trees. ARB rounds up the branch length of the tree from 6 to 4 decimals.

  • Decorated the rooted tree with the taxonomy:
    phylorank decorate gtdb_r207_bac120.rooted.fullids.tree ../../taxonomy/bac120_taxonomy_r207_reps.tsv gtdb_r207_bac120_decorated_fullids.tree --skip_rd_refine
    TODO: convert the original tree (Arc and bac) from canonical ids to full ids.
    phylorank outliers gtdb_r207_bac120_decorated_fullids.tree ../../taxonomy/bac120_taxonomy_r207_reps.tsv phylorank_outliers --skip_mpld3
  • Get the 2 dictionaries from outliers command and paste them in the metadata.txt file
  • Edit version variable

4 - Create pplacer Package:

Pplacer package are created by using the official tree and the official trimmed msa.

  • Optional: remove dummy node using gtdb_validation_tk.
  • gtdb_validation_tk remove_dummy gtdb_<release>_ar_curated.tree gtdb_<release>_ar_no_dummy.tree
  • Strip the taxonomy from the decorated tree:
  • conda activate genometreetk-0.1.8
  • genometreetk strip gtdb_<release>_bac_no_dummy.tree bac120_<release>_stripped.tree
  • genometreetk strip gtdb_<release>_ar_no_dummy.tree ar53_<release>_stripped.tree
  • Use Fasttree to generate a fitting log only for the archaeal tree:
    FastTreeMP -wag -nome -mllen -intree ar53_<release>_stripped.tree -log fitting_stats.log < ar_msa_<release>.faa > ar_<release>_fitted.tree
    We are using the original FastTree log file for the bacterial tree

  • Unroot the tree

hatchet unroot --input_tree gtdb_r207_bac120_decorated_fullids.tree --output_tree gtdb_r207_bac120_decorated_unrooted.tree

  • Remove spaces from gtdb_r207_bac120_decorated_unrooted.tree
  • Generate pkg folder: conda activate taxtastic-0.9.0

taxit create -l gtdbtk.refpkg -P gtdbtk.refpkg --aln-fasta <msa_file> --tree-stats <fasttree_log_file> --tree-file <decorated_unrooted.tree>

  • Copy the pplacer package in GTDB-Tk data folder

5 - Run Hatchet to split the tree

conda activate hatchet-0.0.2 hatchet hatchet_wf -d bac -t ../phylorank/gtdb_r220_bac120.decorated.fullids.tree --msa bac120_msa_r220.faa --tax ../../taxonomy_files_reps/bac120_taxonomy_r220_reps.tsv -o split/ --red_file ../phylorank/phylorank_outliers_bac120/gtdb_r220_bac120.decorated.fullids.node_rd.tsv --original_log gtdb_r220_bac120_fasttree.log --metadata ../../metadata_files/bac120_metadata_r220.tsv

Copy the output directory to the GTDB-Tk package high_level/gtdbtk_package_backbone.refpkg/, high_level/high_red_value.tsv , species_level/gtdbtk.package.*.refpkg/ , species_level/red_value*.tsv_, species_level/tree_mapping.tsv

Prepare gtdb_radii file

cat sp_clusters.tsv | awk 'BEGIN {FS="\t"}; {printf ("%s\t%s\t%s\n", $2, $1, $4)}' > gtdb_radii.tsv

Misc commands

rename versions find . -type l -name 'ar*' -exec rename 's/86/86.1/' {} ;