Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-threading not working? #74

Closed
adamfreedman opened this issue Mar 11, 2017 · 7 comments
Closed

multi-threading not working? #74

adamfreedman opened this issue Mar 11, 2017 · 7 comments

Comments

@adamfreedman
Copy link

Both for my own projects, and to do testing so as to provide guidelines for our cluster users, I've been running a handful of analyses with angsd on a fairly big data set, and allocating 16 threads (-P 16). Every time I've checked running processes with htop, angsd is only using one core. Not sure why this would be the case.

@ANGSD
Copy link
Owner

ANGSD commented Mar 11, 2017 via email

@hmoral
Copy link

hmoral commented Mar 15, 2017

Hello, I'm pretty new to angsd.
I have the same problem when running the following command:
angsd -P 24 -b ALL.bamlist.OutlierHW -ref $REF -out Results/ALL.OutlierHW \ -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \ -minMapQ 20 -minQ 20 -minInd 108 -setMinDepth 1 -setMaxDepth 80 -doCounts 1 \ -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 \ -SNP_pval 0.001 \ -doGeno 32 -doPost 1

I never get more than one process running. So this for 207 smallish bam files takes 5 hours!
I'm running a 64 core (AMD Opteron) machine with precise1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

Is this likely to be related to my system or is an expected angsd behaviour? Are there other analysis that should correctly run multi-threading so I can test?

@claudiuskerth
Copy link

Hi,

I have encountered the same problem last year. The bottleneck is NOT just disk IO, since the process of HWE/MAF/SAF calculation can be considerably sped up with the following workaround:

  1. split your regions file into a few dozen files with the Unixsplit command
  2. spawn a separate angsd process for each of these small regions files with GNU parallel
  3. combine the resulting output files

Example:

  1. split -l 500 keep.rf SPLIT_RF/

  2. ls SPLIT_RF/* | parallel -j 12 "angsd -rf {} -bam bamfile.list -ref Big_Data_ref.fa
    -out PCA/GlobalMAFprior/EryPar.{/} -only_proper_pairs 0 -sites keep.sites -minMapQ 5
    -baq 1 -doCounts 1 -GL 1 -domajorminor 1 -doMaf 1 -skipTriallelic 1
    -SNP_pval 1e-3 -doGeno 32 -doPost 1"

  3. combine output files with either Unix cat (*geno and *maf files) or realSFS cat (*saf files, but see issue realSFS cat  #60)

claudius

@ANGSD
Copy link
Owner

ANGSD commented May 14, 2017 via email

@claudiuskerth
Copy link

Hi ANGSD,

I completely agree that a huge speedup can be achieved by parallelizing the filereading itself, and this is something we are planning to do at some point.

Sure. The GNU parallel hack should help to get by in the meantime.

Since I don't know how to submit a feature request on github: phasing and LD estimation from genotype likelihoods would be awesome.

claudius

@ANGSD
Copy link
Owner

ANGSD commented Jun 15, 2017

Im closing this issue, feel free to reopen if needed.

@ANGSD ANGSD closed this as completed Jun 15, 2017
@SethMusker
Copy link

For those looking for an equivalent to @claudiuskerth's solution for beagle likelihoods, the follwing worked for me (e.g. using 40 threads):

mkdir SPLIT_RF
split -a 3 -l 500 regions.regions SPLIT_RF/
mkdir temp_likes
ls SPLIT_RF/* | parallel -j 40 "angsd -GL 1 -rf {} -out temp_likes/{/} -P 1 -uniqueOnly 1 -minMapQ 20 -only_proper_pairs 1 -baq 1 -ref reference.fa -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 2e-6 -minInd 45 -bam bamlist.txt"
zgrep -m1 '^marker' temp_likes/aaa.beagle.gz | gzip > likes.beagle.gz
ls temp_likes/*.beagle.gz | parallel -j 40 --keep-order "zgrep -v '^marker' {} | gzip" >> likes.beagle.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants