Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Archaea summary when using ANI screen #508

Closed
avw-adifranco opened this issue Apr 17, 2023 · 12 comments
Closed

Missing Archaea summary when using ANI screen #508

avw-adifranco opened this issue Apr 17, 2023 · 12 comments
Labels
error Help required for a GTDB-Tk error.

Comments

@avw-adifranco
Copy link

Hello,

I've tested the new ANI screen method, using the mash DB, for classify_wf.

I've observed that I am missing a few genome at each run as well as the summary output gtdbtl.ar53.summary.tsv. The missing outputs correspond to Archaean genomes that were identified during the ANI screen as I can find them in classify/ani_screen/gtdbtk.ar53.ani_summary.tsv.

I guess implementation of the ANI screen missed the Archaea part of the pipeline ?

I am using version 2.2.6 in a conda environment created by installing GTDB-Tk from bioconda but I guess it is installation independent. Here is my command : gtdbtk classify_wf --mash_db ./GTDB/gtdb-tk-r207v2.msh --genome_dir ./ALL/ -x fasta --out_dir gtdbtk2_classify --cpus 18 --pplacer_cpus 18 --tmpdir ./tmp --scratch_dir ./pplacer

As it is a bit related, I was wondering if it was possible to consolidate the Archean et Bacterial summary into an unique output for the future release ?

Thank you.

@avw-adifranco avw-adifranco added the error Help required for a GTDB-Tk error. label Apr 17, 2023
@pchaumeil
Copy link
Collaborator

Hi,
Could you please provide the gtdbtk.log file of the run?
Thanks

@avw-adifranco
Copy link
Author

Here is the content of gtdbtk.log

The 3 Archaean genomes are present inside the 164 genomes identified by the ANI Screen.
I do not see any sign of them in the log except mention of the gtdbtk.ar53.ani_summary.tsv file creation.

[2023-04-11 17:57:47] INFO: GTDB-Tk v2.2.6
[2023-04-11 17:57:47] INFO: gtdbtk classify_wf --mash_db /home/adf/DB/GTDB/gtdb-tk-r207v2.msh --genome_dir ./ALL/ -x fasta --out_dir gtdbtk2_classify_metadec --cpus 18 --pplacer_cpus 18 --tmpdir ./tmp --scratch_dir ./pplacer
[2023-04-11 17:57:47] INFO: Using GTDB-Tk reference data version r207: /home/adf/DB/GTDB/release207_v2
[2023-04-11 17:57:47] INFO: Loading reference genomes.
[2023-04-11 17:57:48] INFO: Using Mash version 2.3
[2023-04-11 17:57:48] INFO: Creating Mash sketch file: gtdbtk2_classify_metadec/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2023-04-11 17:57:49] INFO: Completed 269 genomes in 0.94 seconds (285.30 genomes/second).
[2023-04-11 17:57:49] INFO: Loading data from existing Mash sketch file: /home/adf/DB/GTDB/gtdb-tk-r207v2.msh
[2023-04-11 17:57:52] INFO: Calculating Mash distances.
[2023-04-11 17:58:56] INFO: Calculating ANI with FastANI v1.32.
[2023-04-11 17:59:13] INFO: Completed 440 comparisons in 16.66 seconds (26.41 comparisons/second).
[2023-04-11 17:59:15] INFO: Summary of results saved to: gtdbtk2_classify_metadec/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv
[2023-04-11 17:59:15] INFO: Summary of results saved to: gtdbtk2_classify_metadec/classify/ani_screen/gtdbtk.ar53.ani_summary.tsv
[2023-04-11 17:59:15] INFO: 164 genome(s) have been classified using the ANI pre-screening step.
[2023-04-11 17:59:15] INFO: Done.
[2023-04-11 17:59:15] INFO: Identifying markers in 105 genomes with 18 threads.
[2023-04-11 17:59:15] TASK: Running Prodigal V2.6.3 to identify genes.
[2023-04-11 17:59:55] INFO: Completed 105 genomes in 40.69 seconds (2.58 genomes/second).
[2023-04-11 17:59:56] TASK: Identifying TIGRFAM protein families.
[2023-04-11 18:00:12] INFO: Completed 105 genomes in 16.48 seconds (6.37 genomes/second).
[2023-04-11 18:00:12] TASK: Identifying Pfam protein families.
[2023-04-11 18:00:13] INFO: Completed 105 genomes in 1.29 seconds (81.10 genomes/second).
[2023-04-11 18:00:13] INFO: Annotations done using HMMER 3.3.2 (Nov 2020).
[2023-04-11 18:00:13] TASK: Summarising identified marker genes.
[2023-04-11 18:00:14] INFO: Completed 105 genomes in 0.86 seconds (121.56 genomes/second).
[2023-04-11 18:00:14] INFO: Done.
[2023-04-11 18:00:14] INFO: Aligning markers in 105 genomes with 18 CPUs.
[2023-04-11 18:00:14] INFO: Processing 105 genomes identified as bacterial.
[2023-04-11 18:00:20] INFO: Read concatenated alignment for 62,291 GTDB genomes.
[2023-04-11 18:00:20] TASK: Generating concatenated alignment for each marker.
[2023-04-11 18:00:21] INFO: Completed 105 genomes in 0.07 seconds (1,568.95 genomes/second).
[2023-04-11 18:00:21] TASK: Aligning 120 identified markers using hmmalign 3.3.2 (Nov 2020).
[2023-04-11 18:00:27] INFO: Completed 120 markers in 4.80 seconds (25.00 markers/second).
[2023-04-11 18:00:27] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2023-04-11 18:01:35] INFO: Completed 62,396 sequences in 1.14 minutes (54,954.13 sequences/minute).
[2023-04-11 18:01:35] INFO: Masked bacterial alignment from 41,084 to 5,036 AAs.
[2023-04-11 18:01:35] INFO: 16 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2023-04-11 18:01:35] INFO: Creating concatenated alignment for 62,380 bacterial GTDB and user genomes.
[2023-04-11 18:01:50] INFO: Creating concatenated alignment for 89 bacterial user genomes.
[2023-04-11 18:01:50] INFO: Done.
[2023-04-11 18:01:51] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-11 18:01:51] TASK: Placing 89 bacterial genomes into backbone reference tree with pplacer using 18 CPUs (be patient).
[2023-04-11 18:01:51] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2023-04-11 18:04:34] INFO: Calculating RED values based on reference tree.
[2023-04-11 18:04:35] INFO: 89 out of 89 have an class assignments. Those genomes will be reclassified.
[2023-04-11 18:04:35] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-11 18:04:35] TASK: Placing 35 bacterial genomes into class-level reference tree 2 (1/6) with pplacer using 18 CPUs (be patient).
[2023-04-11 18:10:48] INFO: Calculating RED values based on reference tree.
[2023-04-11 18:10:50] TASK: Traversing tree to determine classification method.
[2023-04-11 18:10:50] INFO: Completed 35 genomes in 0.04 seconds (967.49 genomes/second).
[2023-04-11 18:10:50] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-04-11 18:11:27] INFO: Completed 1,190 comparisons in 36.80 seconds (32.33 comparisons/second).
[2023-04-11 18:11:28] INFO: 2 genome(s) have been classified using FastANI and pplacer.
[2023-04-11 18:11:28] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-11 18:11:28] TASK: Placing 19 bacterial genomes into class-level reference tree 3 (2/6) with pplacer using 18 CPUs (be patient).
[2023-04-11 18:17:45] INFO: Calculating RED values based on reference tree.
[2023-04-11 18:17:47] TASK: Traversing tree to determine classification method.
[2023-04-11 18:17:47] INFO: Completed 19 genomes in 0.04 seconds (539.78 genomes/second).
[2023-04-11 18:17:47] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-04-11 18:18:47] INFO: Completed 1,896 comparisons in 1.00 minutes (1,895.25 comparisons/minute).
[2023-04-11 18:18:48] INFO: 1 genome(s) have been classified using FastANI and pplacer.
[2023-04-11 18:18:48] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-11 18:18:48] TASK: Placing 11 bacterial genomes into class-level reference tree 7 (3/6) with pplacer using 18 CPUs (be patient).
[2023-04-11 18:22:45] INFO: Calculating RED values based on reference tree.
[2023-04-11 18:22:46] TASK: Traversing tree to determine classification method.
[2023-04-11 18:22:46] INFO: Completed 11 genomes in 0.00 seconds (19,649.64 genomes/second).
[2023-04-11 18:22:46] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-04-11 18:22:54] INFO: Completed 128 comparisons in 7.30 seconds (17.53 comparisons/second).
[2023-04-11 18:22:54] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2023-04-11 18:22:54] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-11 18:22:54] TASK: Placing 10 bacterial genomes into class-level reference tree 6 (4/6) with pplacer using 18 CPUs (be patient).
[2023-04-11 18:29:27] INFO: Calculating RED values based on reference tree.
[2023-04-11 18:29:29] TASK: Traversing tree to determine classification method.
[2023-04-11 18:29:29] INFO: Completed 10 genomes in 0.00 seconds (3,434.58 genomes/second).
[2023-04-11 18:29:29] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-04-11 18:29:34] INFO: Completed 334 comparisons in 5.49 seconds (60.82 comparisons/second).
[2023-04-11 18:29:35] INFO: 1 genome(s) have been classified using FastANI and pplacer.
[2023-04-11 18:29:35] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-11 18:29:35] TASK: Placing 10 bacterial genomes into class-level reference tree 5 (5/6) with pplacer using 18 CPUs (be patient).
[2023-04-11 18:34:24] INFO: Calculating RED values based on reference tree.
[2023-04-11 18:34:25] TASK: Traversing tree to determine classification method.
[2023-04-11 18:34:25] INFO: Completed 10 genomes in 0.01 seconds (673.56 genomes/second).
[2023-04-11 18:34:25] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-04-11 18:34:40] INFO: Completed 622 comparisons in 15.73 seconds (39.54 comparisons/second).
[2023-04-11 18:34:41] INFO: 1 genome(s) have been classified using FastANI and pplacer.
[2023-04-11 18:34:41] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-11 18:34:41] TASK: Placing 4 bacterial genomes into class-level reference tree 4 (6/6) with pplacer using 18 CPUs (be patient).
[2023-04-11 18:41:26] INFO: Calculating RED values based on reference tree.
[2023-04-11 18:41:27] TASK: Traversing tree to determine classification method.
[2023-04-11 18:41:27] INFO: Completed 4 genomes in 0.00 seconds (1,607.78 genomes/second).
[2023-04-11 18:41:27] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-04-11 18:41:31] INFO: Completed 432 comparisons in 3.60 seconds (119.84 comparisons/second).
[2023-04-11 18:41:31] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2023-04-11 18:41:31] WARNING: 27 of 250 genomes have a warning (see summary file).
[2023-04-11 18:41:31] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2023-04-11 18:41:31] INFO: Done.
[2023-04-11 18:41:31] INFO: Removing intermediate files.
[2023-04-11 18:41:31] INFO: Intermediate files removed.
[2023-04-11 18:41:31] INFO: Done.

@pchaumeil
Copy link
Collaborator

Hi,
Thanks for the log file.
I cant find any error explaining why these archaean genomes are not reported and I cant reproduce the error here with my set of archaea .It seems there is potentially another problem with this set of genomes:

the FastANI/pplacer step should always return 0, but it does not in this log file:

[2023-04-11 18:11:28] INFO: 2 genome(s) have been classified using FastANI and pplacer.
...
[2023-04-11 18:18:48] INFO: 1 genome(s) have been classified using FastANI and pplacer.
...
[2023-04-11 18:29:35] INFO: 1 genome(s) have been classified using FastANI and pplacer.
....

Would you mind sending the genomes you are trying to analyse?

@avw-adifranco
Copy link
Author

Do you need the full set or mainly a mix of those archaean and bacterian genomes assigned with the ani_screen and pplacer ?

@pchaumeil
Copy link
Collaborator

Mainly a mix of bacteria and Archaea classified with FastANI/pplacer and the Archaea genomes missing in the summary file

Thanks

@avw-adifranco
Copy link
Author

Hi,

Here is a link of tar.gz archive with 10 MAGs: 3 archaeans, 4 ani_screen bacterial species and 3 bacterial species identified by pplacer.

https://we.tl/t-a6WzX7GcoG

@avw-adifranco
Copy link
Author

Hi,

I've used the set of genomes I've sent to redo the classify analysis and got the same issues with the archaean genomes

Here is the structure of my output directory :

drwxrwxr-x 2 adf adf    5 avril 27 15:20 align
drwxrwxr-x 3 adf adf    8 avril 27 15:20 classify
lrwxrwxrwx 1 adf adf   34 avril 27 15:20 gtdbtk.bac120.summary.tsv -> classify/gtdbtk.bac120.summary.tsv
-rw-rw-r-- 1 adf adf 4,5K avril 27 15:20 gtdbtk.json
-rw-rw-r-- 1 adf adf 5,9K avril 27 15:20 gtdbtk.log
-rw-rw-r-- 1 adf adf    0 avril 27 15:09 gtdbtk.warnings.log
drwxrwxr-x 2 adf adf    6 avril 27 15:20 identify

and the corresponding gtdb.log

[2023-04-27 15:09:57] INFO: GTDB-Tk v2.2.6
[2023-04-27 15:09:57] INFO: gtdbtk classify_wf --mash_db /home/adf/DB/GTDB/gtdb-tk-r207v2.msh --genome_dir ./bins4debug/ -x fasta --out_dir gtdbtk2_classify_debug --cpus 18 --pplacer_cpus 18 --tmpdir ./tmp --scratch_dir ./pplacer
[2023-04-27 15:09:57] INFO: Using GTDB-Tk reference data version r207: /home/adf/DB/GTDB/release207_v2
[2023-04-27 15:09:57] INFO: Loading reference genomes.
[2023-04-27 15:09:57] INFO: Using Mash version 2.3
[2023-04-27 15:09:57] INFO: Creating Mash sketch file: gtdbtk2_classify_debug/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2023-04-27 15:09:57] INFO: Completed 10 genomes in 0.14 seconds (70.15 genomes/second).
[2023-04-27 15:09:57] INFO: Loading data from existing Mash sketch file: /home/adf/DB/GTDB/gtdb-tk-r207v2.msh
[2023-04-27 15:09:59] INFO: Calculating Mash distances.
[2023-04-27 15:10:03] INFO: Calculating ANI with FastANI v1.32.
[2023-04-27 15:10:04] INFO: Completed 20 comparisons in 0.95 seconds (20.99 comparisons/second).
[2023-04-27 15:10:04] INFO: Summary of results saved to: gtdbtk2_classify_debug/classify/ani_screen/gtdbtk.ar53.ani_summary.tsv
[2023-04-27 15:10:04] INFO: Summary of results saved to: gtdbtk2_classify_debug/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv
[2023-04-27 15:10:04] INFO: 7 genome(s) have been classified using the ANI pre-screening step.
[2023-04-27 15:10:04] INFO: Done.
[2023-04-27 15:10:04] INFO: Identifying markers in 3 genomes with 18 threads.
[2023-04-27 15:10:04] TASK: Running Prodigal V2.6.3 to identify genes.
[2023-04-27 15:10:08] INFO: Completed 3 genomes in 3.64 seconds (1.21 seconds/genome).
[2023-04-27 15:10:08] TASK: Identifying TIGRFAM protein families.
[2023-04-27 15:10:10] INFO: Completed 3 genomes in 2.19 seconds (1.37 genomes/second).
[2023-04-27 15:10:10] TASK: Identifying Pfam protein families.
[2023-04-27 15:10:10] INFO: Completed 3 genomes in 0.13 seconds (22.63 genomes/second).
[2023-04-27 15:10:10] INFO: Annotations done using HMMER 3.3.2 (Nov 2020).
[2023-04-27 15:10:10] TASK: Summarising identified marker genes.
[2023-04-27 15:10:10] INFO: Completed 3 genomes in 0.02 seconds (141.81 genomes/second).
[2023-04-27 15:10:10] INFO: Done.
[2023-04-27 15:10:10] INFO: Aligning markers in 3 genomes with 18 CPUs.
[2023-04-27 15:10:11] INFO: Processing 3 genomes identified as bacterial.
[2023-04-27 15:10:15] INFO: Read concatenated alignment for 62,291 GTDB genomes.
[2023-04-27 15:10:15] TASK: Generating concatenated alignment for each marker.
[2023-04-27 15:10:16] INFO: Completed 3 genomes in 0.07 seconds (44.82 genomes/second).
[2023-04-27 15:10:16] TASK: Aligning 114 identified markers using hmmalign 3.3.2 (Nov 2020).
[2023-04-27 15:10:18] INFO: Completed 114 markers in 1.36 seconds (83.68 markers/second).
[2023-04-27 15:10:18] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2023-04-27 15:11:26] INFO: Completed 62,294 sequences in 1.13 minutes (55,181.42 sequences/minute).
[2023-04-27 15:11:26] INFO: Masked bacterial alignment from 41,084 to 5,036 AAs.
[2023-04-27 15:11:26] INFO: 0 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2023-04-27 15:11:26] INFO: Creating concatenated alignment for 62,294 bacterial GTDB and user genomes.
[2023-04-27 15:11:41] INFO: Creating concatenated alignment for 3 bacterial user genomes.
[2023-04-27 15:11:41] INFO: Done.
[2023-04-27 15:11:41] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-27 15:11:41] TASK: Placing 3 bacterial genomes into backbone reference tree with pplacer using 18 CPUs (be patient).
[2023-04-27 15:11:41] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2023-04-27 15:13:52] INFO: Calculating RED values based on reference tree.
[2023-04-27 15:13:52] INFO: 3 out of 3 have an class assignments. Those genomes will be reclassified.
[2023-04-27 15:13:52] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-27 15:13:52] TASK: Placing 2 bacterial genomes into class-level reference tree 2 (1/2) with pplacer using 18 CPUs (be patient).
[2023-04-27 15:17:04] INFO: Calculating RED values based on reference tree.
[2023-04-27 15:17:05] TASK: Traversing tree to determine classification method.
[2023-04-27 15:17:05] INFO: Completed 2 genomes in 0.00 seconds (16,644.06 genomes/second).
[2023-04-27 15:17:05] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-04-27 15:17:05] INFO: Completed 16 comparisons in 0.47 seconds (34.09 comparisons/second).
[2023-04-27 15:17:05] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2023-04-27 15:17:05] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-04-27 15:17:05] TASK: Placing 1 bacterial genomes into class-level reference tree 7 (2/2) with pplacer using 18 CPUs (be patient).
[2023-04-27 15:20:35] INFO: Calculating RED values based on reference tree.
[2023-04-27 15:20:36] TASK: Traversing tree to determine classification method.
[2023-04-27 15:20:36] INFO: Completed 1 genome in 0.00 seconds (13,189.64 genomes/second).
[2023-04-27 15:20:36] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-04-27 15:20:37] INFO: Completed 6 comparisons in 0.54 seconds (11.05 comparisons/second).
[2023-04-27 15:20:37] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2023-04-27 15:20:37] WARNING: 1 of 7 genome has a warning (see summary file).
[2023-04-27 15:20:37] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2023-04-27 15:20:37] INFO: Done.
[2023-04-27 15:20:37] INFO: Removing intermediate files.
[2023-04-27 15:20:37] INFO: Intermediate files removed.
[2023-04-27 15:20:37] INFO: Done.

It should help to simplify the investigation.

Did you manage to reproduce the issue ?

@pchaumeil
Copy link
Collaborator

Hi , I think I have found the issue , but I need to run extras checks.
I am hoping to have an update next week

@pchaumeil
Copy link
Collaborator

Hi,
I am able to generate the same log as you and I have implemented a patch to solve this.
But in your last log, none of the 7 genomes have been classified with 'FastANI and pplacer'
The log file returns the following line:
0 genome(s) have been classified using FastANI and pplacer.
for each tree

in your first log file you had the following lines:

[2023-04-11 18:11:28] INFO: 2 genome(s) have been classified using FastANI and pplacer.
...
[2023-04-11 18:18:48] INFO: 1 genome(s) have been classified using FastANI and pplacer.
...
[2023-04-11 18:29:35] INFO: 1 genome(s) have been classified using FastANI and pplacer.
....

I am interested to know what are these 4 genomes and why they are not classified with the first ANI step. unfortunately, they don't seem to be in the 10 genomes provided.
Could you please let me know what are this genomes and if they are in the package I have downloaded?
They should have the text "taxonomic classification defined by topology and ANI" in the classification_method column

Thank you

@avw-adifranco
Copy link
Author

Hi Pierre,

Thanks for the patch.

I am checking for those 4 genomes but I have 68 out the 269 input genomes that have the "taxonomic classification defined by topology and ANI" classification_method.
I guess those 4 are bacterial genomes but I did not find a way to pinpoint them at the moment.
Do you have any clue of how I could reduce my search to those ? Maybe by using the class-level reference tree subdivision ?

I'll send those to you once I've figured them out.

@avw-adifranco
Copy link
Author

Hi,

I've managed to find the genomes by looking into classify/gtdbtk.bac120.tree.mapping.tsv

user_genome	is_ani_classification	class_tree_mapped	classification_rule
BBP7947_metadecBins.3	True	6	Rule 7
BBP8560_metadecBins.86	True	5	Rule 7
BBP8561_metadecBins.63	True	2	Rule 7
BBP8562_metadecBins.40	True	3	Rule 7
BBP8562_metadecBins.60	True	2	Rule 7

I've rerun GTDB-tk on it and it still behave in the same way

[2023-05-05 16:01:39] INFO: GTDB-Tk v2.2.6
[2023-05-05 16:01:39] INFO: gtdbtk classify_wf --mash_db /home/adf/DB/GTDB/gtdb-tk-r207v2.msh --genome_dir ./bins4debug2/ -x fasta --out_dir gtdbtk2_classify_debug2 --cpus 18 --pplacer_cpus 18 --tmpdir ./tmp --scratch_dir ./pplacer
[2023-05-05 16:01:39] INFO: Using GTDB-Tk reference data version r207: /home/adf/DB/GTDB/release207_v2
[2023-05-05 16:01:39] INFO: Loading reference genomes.
[2023-05-05 16:01:39] INFO: Using Mash version 2.3
[2023-05-05 16:01:39] INFO: Creating Mash sketch file: gtdbtk2_classify_debug2/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2023-05-05 16:01:39] INFO: Completed 5 genomes in 0.06 seconds (85.51 genomes/second).
[2023-05-05 16:01:39] INFO: Loading data from existing Mash sketch file: /home/adf/DB/GTDB/gtdb-tk-r207v2.msh
[2023-05-05 16:01:41] INFO: Calculating Mash distances.
[2023-05-05 16:01:44] INFO: Calculating ANI with FastANI v1.32.
[2023-05-05 16:01:44] INFO: 0 genome(s) have been classified using the ANI pre-screening step.
[2023-05-05 16:01:44] INFO: Done.
[2023-05-05 16:01:44] INFO: Identifying markers in 5 genomes with 18 threads.
[2023-05-05 16:01:44] TASK: Running Prodigal V2.6.3 to identify genes.
[2023-05-05 16:01:46] INFO: Completed 5 genomes in 1.68 seconds (2.98 genomes/second).
[2023-05-05 16:01:46] TASK: Identifying TIGRFAM protein families.
[2023-05-05 16:01:46] INFO: Completed 5 genomes in 0.90 seconds (5.56 genomes/second).
[2023-05-05 16:01:46] TASK: Identifying Pfam protein families.
[2023-05-05 16:01:47] INFO: Completed 5 genomes in 0.07 seconds (74.92 genomes/second).
[2023-05-05 16:01:47] INFO: Annotations done using HMMER 3.3.2 (Nov 2020).
[2023-05-05 16:01:47] TASK: Summarising identified marker genes.
[2023-05-05 16:01:47] INFO: Completed 5 genomes in 0.01 seconds (449.76 genomes/second).
[2023-05-05 16:01:47] INFO: Done.
[2023-05-05 16:01:47] INFO: Aligning markers in 5 genomes with 18 CPUs.
[2023-05-05 16:01:47] INFO: Processing 5 genomes identified as bacterial.
[2023-05-05 16:01:51] INFO: Read concatenated alignment for 62,291 GTDB genomes.
[2023-05-05 16:01:51] TASK: Generating concatenated alignment for each marker.
[2023-05-05 16:01:52] INFO: Completed 5 genomes in 0.00 seconds (1,524.98 genomes/second).
[2023-05-05 16:01:52] TASK: Aligning 80 identified markers using hmmalign 3.3.2 (Nov 2020).
[2023-05-05 16:01:53] INFO: Completed 80 markers in 0.80 seconds (99.68 markers/second).
[2023-05-05 16:01:53] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2023-05-05 16:03:01] INFO: Completed 62,296 sequences in 1.13 minutes (55,275.13 sequences/minute).
[2023-05-05 16:03:01] INFO: Masked bacterial alignment from 41,084 to 5,036 AAs.
[2023-05-05 16:03:01] INFO: 0 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2023-05-05 16:03:01] INFO: Creating concatenated alignment for 62,296 bacterial GTDB and user genomes.
[2023-05-05 16:03:15] INFO: Creating concatenated alignment for 5 bacterial user genomes.
[2023-05-05 16:03:15] INFO: Done.
[2023-05-05 16:03:16] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-05-05 16:03:16] TASK: Placing 5 bacterial genomes into backbone reference tree with pplacer using 18 CPUs (be patient).
[2023-05-05 16:03:16] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2023-05-05 16:04:49] INFO: Calculating RED values based on reference tree.
[2023-05-05 16:04:50] INFO: 5 out of 5 have an class assignments. Those genomes will be reclassified.
[2023-05-05 16:04:50] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-05-05 16:04:50] TASK: Placing 2 bacterial genomes into class-level reference tree 2 (1/4) with pplacer using 18 CPUs (be patient).
[2023-05-05 16:06:26] INFO: Calculating RED values based on reference tree.
[2023-05-05 16:06:27] TASK: Traversing tree to determine classification method.
[2023-05-05 16:06:27] INFO: Completed 2 genomes in 0.00 seconds (19,418.07 genomes/second).
[2023-05-05 16:06:27] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-05-05 16:06:28] INFO: Completed 12 comparisons in 0.38 seconds (31.31 comparisons/second).
[2023-05-05 16:06:28] INFO: 2 genome(s) have been classified using FastANI and pplacer.
[2023-05-05 16:06:28] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-05-05 16:06:28] TASK: Placing 1 bacterial genomes into class-level reference tree 6 (2/4) with pplacer using 18 CPUs (be patient).
[2023-05-05 16:07:44] INFO: Calculating RED values based on reference tree.
[2023-05-05 16:07:45] TASK: Traversing tree to determine classification method.
[2023-05-05 16:07:45] INFO: Completed 1 genome in 0.00 seconds (1,976.58 genomes/second).
[2023-05-05 16:07:45] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-05-05 16:07:46] INFO: Completed 108 comparisons in 1.49 seconds (72.27 comparisons/second).
[2023-05-05 16:07:46] INFO: 1 genome(s) have been classified using FastANI and pplacer.
[2023-05-05 16:07:47] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-05-05 16:07:47] TASK: Placing 1 bacterial genomes into class-level reference tree 5 (3/4) with pplacer using 18 CPUs (be patient).
[2023-05-05 16:08:55] INFO: Calculating RED values based on reference tree.
[2023-05-05 16:08:56] TASK: Traversing tree to determine classification method.
[2023-05-05 16:08:56] INFO: Completed 1 genome in 0.00 seconds (776.15 genomes/second).
[2023-05-05 16:08:56] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-05-05 16:09:00] INFO: Completed 200 comparisons in 3.47 seconds (57.60 comparisons/second).
[2023-05-05 16:09:00] INFO: 1 genome(s) have been classified using FastANI and pplacer.
[2023-05-05 16:09:00] INFO: Using a scratch file for pplacer allocations. This decreases memory usage and performance.
[2023-05-05 16:09:00] TASK: Placing 1 bacterial genomes into class-level reference tree 3 (4/4) with pplacer using 18 CPUs (be patient).
[2023-05-05 16:10:54] INFO: Calculating RED values based on reference tree.
[2023-05-05 16:10:55] TASK: Traversing tree to determine classification method.
[2023-05-05 16:10:55] INFO: Completed 1 genome in 0.00 seconds (1,633.93 genomes/second).
[2023-05-05 16:10:55] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2023-05-05 16:10:58] INFO: Completed 120 comparisons in 2.76 seconds (43.41 comparisons/second).
[2023-05-05 16:10:58] INFO: 1 genome(s) have been classified using FastANI and pplacer.
[2023-05-05 16:10:58] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2023-05-05 16:10:58] INFO: Done.
[2023-05-05 16:10:58] INFO: Removing intermediate files.
[2023-05-05 16:10:58] INFO: Intermediate files removed.
[2023-05-05 16:10:58] INFO: Done.

Here are the MAGs producing the weird behavior
https://we.tl/t-Lm2Ss8xgfy

@pchaumeil
Copy link
Collaborator

pchaumeil commented May 10, 2023

Hello,
we have released GTDB-Tk v2.3.
This version should fix the problem of missing genomes
We have also increased the maximum mash distance used for the ANI step. This will allows all genomes to be classified with the first ANI step instead of being classified once placed in the tree.

@pchaumeil pchaumeil reopened this May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error Help required for a GTDB-Tk error.
Projects
None yet
Development

No branches or pull requests

2 participants