-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignments complete with only x's #101
Comments
Hey there, Danielle! Thanks so much for the kind words about GToTree :) And holy smokes! 100,000 genomes!?! I'm so sorry it's running that whole time and then silently breaking/failing, that's super-annoying :/ I unfortunately don't have any useful insight at the ready. I too suspect there are some problematic genomes in there causing some kind of problem, and that it would happen with fewer genomes if those problematic ones were included. But with GToTree not giving you any useful info at this point, finding them is going to be a pain. It's also possible i have a command somewhere that's breaking things due to not being able to grab everything because it's just too many files for a normal unix command or something. I've tried to purge all of those in the past, but maybe i missed one. Either way, I'm not sure how to efficiently start looking into this other than trying some runs on like 100,000 genomes too and hoping it does the same thing, then digging into things. So it may take me a while to get back to you on this, but i will as soon as i can figure anything out. Thanks for letting me know about it! Oh and i doubt i'll see anything useful in there you missed, but can you send me the log file from one of the failed runs anyway to MikeLee<at>bmsis<dot>org? And also the filenames.txt if that's ok? (Assuming it has the GCF accessions listed in there somewhere, so i could use the exact same refseq genomes) |
Hi Mike! Thanks for the quick response! Maybe 100,000 sounds too crazy but I've been able to do 5-10K easy no issue. And no need to apologize! I am honestly impressed it is able to process (even if it failed under the hood) in that timeline. I kinda expect everything takes longer. Sounds good! Yeah, unfortunately just one of the battles when using other people's data at this scale. I'm going to first pull and check the ncbi logs for bunco/ani values and see if anything is that could explain it. Sure, I can send the log and filenames.txt (I'll just remove the institution specific file path). If you want anything else, I am happy to provide. :) Thanks again for your help!! |
adding a note on this, it seems to be due to muscle running out of memory during the alignment step, and GToTree doing nothing at all to gracefully deal with that. Adding logic to hopefully catch, exit, and report when this happens |
this particular situation should be addressed (2ad2d27) as of v1.8.9 Thank you for bringing this to my attention, @DanielleMStevens! I'll get back to you in email about other things :) |
Hi! Love this tool and how easy it is to use. I've previously used Gtotree to make large phylogenies of 1000s of genomes. As long as I had the compute capacity in the past, I've never had any issues. Recently I wanted to rerun some old analysis with a larger genome dataset (about 100,000 genomes). I tried to run Gtotree with the genomes (all from NCBI refseq) on our hpc (should more be than enough) and it completes in a reasonable time 36-48 hours. However, both times I tried, the individual alignments appear empty and the concatenated alignments are just a bunch of XXXs despite Genbank_genomes_summary_info.tsv providing info on gene hits and filtering.
My best guess is maybe a small number of genomes are trash (bad cds or something) and causing the issue. But figured it would be good to check and see if you have any insight.
Thanks!!
The text was updated successfully, but these errors were encountered: