multi-threading not working? #74

adamfreedman · 2017-03-11T13:45:36Z

Both for my own projects, and to do testing so as to provide guidelines for our cluster users, I've been running a handful of analyses with angsd on a fairly big data set, and allocating 16 threads (-P 16). Every time I've checked running processes with htop, angsd is only using one core. Not sure why this would be the case.

ANGSD · 2017-03-11T14:49:10Z

Dear Adam, the bottleneck is most likely the disk IO then. In angsd only the analysis part is threaded not the file reading.

…

On Sat, Mar 11, 2017 at 2:45 PM, Adam H. Freedman ***@***.***> wrote: Both for my own projects, and to do testing so as to provide guidelines for our cluster users, I've been running a handful of analyses with angsd on a fairly big data set, and allocating 16 threads (-P 16). Every time I've checked running processes with htop, angsd is only using one core. Not sure why this would be the case. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#74>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGDo7jqpMPhU31cpIMyo83PcEFKbxNzDks5rkqWBgaJpZM4MaNR5> .

hmoral · 2017-03-15T11:51:56Z

Hello, I'm pretty new to angsd.
I have the same problem when running the following command:
angsd -P 24 -b ALL.bamlist.OutlierHW -ref $REF -out Results/ALL.OutlierHW \ -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \ -minMapQ 20 -minQ 20 -minInd 108 -setMinDepth 1 -setMaxDepth 80 -doCounts 1 \ -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 \ -SNP_pval 0.001 \ -doGeno 32 -doPost 1

I never get more than one process running. So this for 207 smallish bam files takes 5 hours!
I'm running a 64 core (AMD Opteron) machine with precise1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

Is this likely to be related to my system or is an expected angsd behaviour? Are there other analysis that should correctly run multi-threading so I can test?

claudiuskerth · 2017-05-13T18:23:37Z

Hi,

I have encountered the same problem last year. The bottleneck is NOT just disk IO, since the process of HWE/MAF/SAF calculation can be considerably sped up with the following workaround:

split your regions file into a few dozen files with the Unixsplit command
spawn a separate angsd process for each of these small regions files with GNU parallel
combine the resulting output files

Example:

split -l 500 keep.rf SPLIT_RF/
ls SPLIT_RF/* | parallel -j 12 "angsd -rf {} -bam bamfile.list -ref Big_Data_ref.fa
-out PCA/GlobalMAFprior/EryPar.{/} -only_proper_pairs 0 -sites keep.sites -minMapQ 5
-baq 1 -doCounts 1 -GL 1 -domajorminor 1 -doMaf 1 -skipTriallelic 1
-SNP_pval 1e-3 -doGeno 32 -doPost 1"
combine output files with either Unix cat (*geno and *maf files) or realSFS cat (*saf files, but see issue realSFS cat #60)

claudius

ANGSD · 2017-05-14T15:24:30Z

Hi Claudius it is only the analysis part in angsd that is threaded, not the filereading. Internally in the program the main process is the process that does the filereading, whenever a chunk has been read across all files it will spawn a thread that will do the analysis followed by printing. I completely agree that a huge speedup can be achieved by parallelizing the filereading itself, and this is something we are planning to do at some point. Thanks for writing us, userinput is always appreciated. Best

…

On 14 May 2017, at 02.23, Claudius Kerth ***@***.***> wrote:. Hi, I have encountered the same problem last year. The bottleneck is NOT just disk IO, since the process of HWE/MAF/SAF calculation can be considerably sped up with the following workaround: split your regions file into a few dozen files with the Unixsplit command spawn a separate angsd process for each of these small regions files with GNU parallel combine the resulting output files Example: split -l 500 keep.rf SPLIT_RF/ ls SPLIT_RF/* | parallel -j 12 "angsd -rf {} -bam bamfile.list -ref Big_Data_ref.fa -out PCA/GlobalMAFprior/EryPar.{/} -only_proper_pairs 0 -sites keep.sites -minMapQ 5 -baq 1 -doCounts 1 -GL 1 -domajorminor 1 -doMaf 1 -skipTriallelic 1 -SNP_pval 1e-3 -doGeno 32 -doPost 1" combine output files with either Unix cat (*geno and *maf files) or realSFS cat (*saf files, but see issue #60 <#60>) claudius — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#74 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGDo7hviX3E0M6774mNWRhEYzHX8Auiwks5r5fUpgaJpZM4MaNR5>.

claudiuskerth · 2017-05-14T19:03:49Z

Hi ANGSD,

I completely agree that a huge speedup can be achieved by parallelizing the filereading itself, and this is something we are planning to do at some point.

Sure. The GNU parallel hack should help to get by in the meantime.

Since I don't know how to submit a feature request on github: phasing and LD estimation from genotype likelihoods would be awesome.

claudius

ANGSD · 2017-06-15T12:30:16Z

Im closing this issue, feel free to reopen if needed.

SethMusker · 2021-06-07T12:13:46Z

For those looking for an equivalent to @claudiuskerth's solution for beagle likelihoods, the follwing worked for me (e.g. using 40 threads):

mkdir SPLIT_RF
split -a 3 -l 500 regions.regions SPLIT_RF/
mkdir temp_likes
ls SPLIT_RF/* | parallel -j 40 "angsd -GL 1 -rf {} -out temp_likes/{/} -P 1 -uniqueOnly 1 -minMapQ 20 -only_proper_pairs 1 -baq 1 -ref reference.fa -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 2e-6 -minInd 45 -bam bamlist.txt"
zgrep -m1 '^marker' temp_likes/aaa.beagle.gz | gzip > likes.beagle.gz
ls temp_likes/*.beagle.gz | parallel -j 40 --keep-order "zgrep -v '^marker' {} | gzip" >> likes.beagle.gz

ANGSD closed this as completed Jun 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-threading not working? #74

multi-threading not working? #74

adamfreedman commented Mar 11, 2017

ANGSD commented Mar 11, 2017 via email

hmoral commented Mar 15, 2017

claudiuskerth commented May 13, 2017

ANGSD commented May 14, 2017 via email

claudiuskerth commented May 14, 2017

ANGSD commented Jun 15, 2017

SethMusker commented Jun 7, 2021

multi-threading not working? #74

multi-threading not working? #74

Comments

adamfreedman commented Mar 11, 2017

ANGSD commented Mar 11, 2017 via email

hmoral commented Mar 15, 2017

claudiuskerth commented May 13, 2017

ANGSD commented May 14, 2017 via email

claudiuskerth commented May 14, 2017

ANGSD commented Jun 15, 2017

SethMusker commented Jun 7, 2021