Alignments complete with only x's #101

DanielleMStevens · 2025-01-24T18:50:11Z

Hi! Love this tool and how easy it is to use. I've previously used Gtotree to make large phylogenies of 1000s of genomes. As long as I had the compute capacity in the past, I've never had any issues. Recently I wanted to rerun some old analysis with a larger genome dataset (about 100,000 genomes). I tried to run Gtotree with the genomes (all from NCBI refseq) on our hpc (should more be than enough) and it completes in a reasonable time 36-48 hours. However, both times I tried, the individual alignments appear empty and the concatenated alignments are just a bunch of XXXs despite Genbank_genomes_summary_info.tsv providing info on gene hits and filtering.

GToTree -g filenames.txt -H Bacteria -n 24 -j 10 -k -X -G 0.2 -N -o GToTree_output

My best guess is maybe a small number of genomes are trash (bad cds or something) and causing the issue. But figured it would be good to check and see if you have any insight.

Thanks!!

The text was updated successfully, but these errors were encountered:

AstrobioMike · 2025-01-24T20:50:06Z

Hey there, Danielle!

Thanks so much for the kind words about GToTree :)

And holy smokes! 100,000 genomes!?! I'm so sorry it's running that whole time and then silently breaking/failing, that's super-annoying :/

I unfortunately don't have any useful insight at the ready.

I too suspect there are some problematic genomes in there causing some kind of problem, and that it would happen with fewer genomes if those problematic ones were included. But with GToTree not giving you any useful info at this point, finding them is going to be a pain.

It's also possible i have a command somewhere that's breaking things due to not being able to grab everything because it's just too many files for a normal unix command or something. I've tried to purge all of those in the past, but maybe i missed one.

Either way, I'm not sure how to efficiently start looking into this other than trying some runs on like 100,000 genomes too and hoping it does the same thing, then digging into things.

So it may take me a while to get back to you on this, but i will as soon as i can figure anything out.

Thanks for letting me know about it!

Oh and i doubt i'll see anything useful in there you missed, but can you send me the log file from one of the failed runs anyway to MikeLee<at>bmsis<dot>org? And also the filenames.txt if that's ok? (Assuming it has the GCF accessions listed in there somewhere, so i could use the exact same refseq genomes)

DanielleMStevens · 2025-01-24T21:05:56Z

Hi Mike!

Thanks for the quick response! Maybe 100,000 sounds too crazy but I've been able to do 5-10K easy no issue. And no need to apologize! I am honestly impressed it is able to process (even if it failed under the hood) in that timeline. I kinda expect everything takes longer.

Sounds good! Yeah, unfortunately just one of the battles when using other people's data at this scale. I'm going to first pull and check the ncbi logs for bunco/ani values and see if anything is that could explain it. Sure, I can send the log and filenames.txt (I'll just remove the institution specific file path). If you want anything else, I am happy to provide. :)

Thanks again for your help!!
Dani

AstrobioMike · 2025-01-31T15:47:44Z

adding a note on this, it seems to be due to muscle running out of memory during the alignment step, and GToTree doing nothing at all to gracefully deal with that. Adding logic to hopefully catch, exit, and report when this happens

AstrobioMike · 2025-01-31T18:44:17Z

this particular situation should be addressed (2ad2d27) as of v1.8.9

Thank you for bringing this to my attention, @DanielleMStevens! I'll get back to you in email about other things :)

AstrobioMike pushed a commit that referenced this issue Jan 31, 2025

adding some logic to catch if muscle doesn't produce an alignment (#101)

2ad2d27

AstrobioMike closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignments complete with only x's #101

Alignments complete with only x's #101

DanielleMStevens commented Jan 24, 2025

AstrobioMike commented Jan 24, 2025

DanielleMStevens commented Jan 24, 2025

AstrobioMike commented Jan 31, 2025

AstrobioMike commented Jan 31, 2025

Alignments complete with only x's #101

Alignments complete with only x's #101

Comments

DanielleMStevens commented Jan 24, 2025

AstrobioMike commented Jan 24, 2025

DanielleMStevens commented Jan 24, 2025

AstrobioMike commented Jan 31, 2025

AstrobioMike commented Jan 31, 2025