-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent output across runs of all-versus-all ANI computation #67
Comments
Thanks for sharing this problem. I've noticed multiple issues that highlight this problem (also see #37 and #58 ); however i've failed to reproduce this issue on our compute clusters unfortunately. I'm willing to invest time into this problem, but need help so I can reproduce this behavior at my end for debugging. Are you able to provide more details (e.g, mac/linux, gcc version, input data files) etc.. ? |
I got this issue with both the Conda version (I believe they use GCC 7.*) and a statically compiled version in my personal computer (master branch, Ubuntu 16.04, GCC 7.5.0). I executed the runs in a cluster with SUSE Linux Enterprise Server 15. By the way, I had a bug while compiling FastANI in my PC and I submitted a PR fixing it: #68 I don't think I can share this specific dataset because it isn't mine. But I'll try to replicate the issue with my own genomes so I can send you the data. I can't promise that I'll be able to do that in the next few days, though. Just to illustrate the extend of the inconsistency: I executed the all-versus-all comparison eight times and each run had ~16 comparisons that were not found in any of the other ones. I also noticed that this greatly influenced the definition of species (using an algorithm similar to the one used by GTDB). |
Got it. Since one of the issue filed previously involved use of SLURM; curious if you too are using SLURM? |
Is this is a locally owned cluster? Wondering if you can arrange a temporary account for me (perhaps for a week) ? |
Yes, I'm using SLURM. Unfortunately it is a big shared cluster and I have no control over it, otherwise I'd be happy to give you access to it. |
Thanks! I guess the bug might be related to SLURM. When you get chance, can you send me your slurm job script/commands and job output log while running:
|
cc'ing @luke-dt |
#!/bin/bash
#SBATCH --job-name=fastani
#SBATCH --account=fnglanot
#SBATCH --qos=genepool
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --constraint=haswell
ls -1 -d $PWD/Genomes/*.fna > genome_list.txt
srun --cpus-per-task=64 --ntasks=1 /usr/bin/time fastANI --rl genome_list.txt --ql genome_list.txt -t 64 -o fastani_output.txt
printenv Here's the log: slurm-31935719.txt |
I executed the same command twice in a different cluster that uses PBS. For some reason it took much longer for FastANI to finish, but the outputs were different anyway.
|
Here is the script that I run with sbatch: #SBATCH --job-name=fastani
#SBATCH --mem=30G
#SBATCH --cpus-per-task=4
#SBATCH --output=slurm_out/fastani/z_fastani_%A.out
#SBATCH --error=slurm_out/fastani/z_fastani_%A.out
module load fastani/1.3.1a
basedir="$PWD"
outdir="${basedir}/d03_species_analysis/fastani"
fastANI --ql ${outdir}/genomepaths.txt \
--rl ${outdir}/genomepaths.txt \
-o ${outdir}/fastani_out.txt \
-t ${SLURM_CPUS_PER_TASK} log file for 4 threads (analysis worked) |
After executing with 4 cores I got consistent outputs. However there are some missing comparisons. The output I got from the execution with 4 cores has 3149 lines and a file that I built by aggregating multiple executions has 3271 lines. Here are ANI vs. % aligned plots for these two files: It seems that most of the missing comparisons are from pairs with high ANI and low % aligned. |
Thanks! I'm able to reproduce inconsistent output at my end on a cluster with SLURM, which is good! Will reach out if I need more info. I replicated a single publicly-available genome and did a all-to-all among them. For a few pairs, I do see <100% ANI reported in an inconsistent manner. Please give me some time to investigate. |
You're welcome! For further context: to build the first figure I executed FastANI with 64 cores in a PBS cluster and aggregated the results into a single file. For lines in which the first and second genomes were the same but a different ANI was reported, I chose the one with the highest % aligned (which usually corresponded to the lowest ANI). |
Hi @apcamargo , @luke-dt , Thanks again for your help! There was a bug in my code associated with file-io. I've committed the fix to master branch. When you get chance, please run the code again, and let me know if also fixes the issue at your end. I will create a new fastANI version after I hear from you. |
Hi guys (@apcamargo , @luke-dt) |
Hi @cjain7! |
The SLURM cluster I have access to is in maintenance, so I executed fastANI in a PBS cluster with
The bug seems to be fixed! Thank you @cjain7! |
Good to know. Thanks! |
Hey @cjain7 Even though the results are now consistent across runs, I noticed that there are still many comparisons missing from the output. I know that fastANI won't report comparisons of genomes with low % of alignment, but some of the missing comparisons were present in previous runs. Is this behaviour expected? |
yeah, i think (or at least I hope) that output will be consistent from now onwards. Those cases you mention are probably border-line cases which cleared the ~80% cutoff by a small margin due to previous bug. |
The strange thing is that the number of genomes in the output (520) is less than the total number of genomes (522), meaning that there are two genomes that are not being compared with themselves (certainly more than 80%) |
Can you check if they have same file names? |
Please create a new issue with more information (e.g., log files, input command etc.) if you would like me to look further. |
I was just preparing a bug report and a noticed that the bug was in the script I was using to process the output. Sorry for the trouble! |
Hello, I'm having the same inconsistency problem but I'm not running FastANI via slurm. I'm running it on a Ubuntu machine and installed it using the compiled version from the master branch.
The output table has different ANI values for the same compared genomes. |
Hi @cjain7!
I'm using FastANI to compare a set of approximately 500 MAGs. To do that, I'm executing:
Across multiple runs I observed that the output varies significantly. For instance, in some cases a comparison of a genome with itself would (a) have a low aligned fraction (~40%), (b) have ~100% of the genome aligned, or (c) wouldn't even show in the output (presumably due to low coverage of the alignment). I've also seen different genomes with high ANI between them (~98%) sometimes appear in the output and sometimes not.
In all my 1 vs. 1 comparisons the output was consistent. The discrepant results appeared only when comparing two lists (in this case, the same list was used as both query and reference).
Here are the output of two independent runs:
dereplicated_mags_ani_raw_1.txt
dereplicated_mags_ani_raw_2.txt
EDIT: I performed a new test using the master branch. The results are still inconsistent and comparisons are missing from all the outputs I'm obtaining.
The text was updated successfully, but these errors were encountered: