Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel deduplication #203

Open
1 of 6 tasks
TomSmithCGAT opened this issue Oct 19, 2017 · 5 comments
Open
1 of 6 tasks

Parallel deduplication #203

TomSmithCGAT opened this issue Oct 19, 2017 · 5 comments
Assignees

Comments

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Oct 19, 2017

We've discussed parallel de-duplication a few times to avoid occasional run time issues (e.g #31) and previously tried implementing parallelisation of the edit_distance calculation (#69). Although we have made multiple modifications to reduce run time bit by bit, this issue still crops up (#173). I suggest we therefore revisit parallelisation.

I've started trying to implement parallelisation of the read bundle deduplication step but ran into the same issue with pickling pysam.AlignedSegmen objects described by @IanSudbery in #69. I therefore suggest the following order of tasks:

To do

  • 1. Implement contig-based parallisation
  • 2. Benchmark (runtime + memory)
  • 3. Design a bundle-based approach
  • 4. Implement 3.
  • 5. Benchmark 4.
  • 6. Merge and release!

I've implemented the simplest form of parallelisation: one process per contig. This has the major advantage that only the infile name and contig need to be pickled - each process makes a separate call to pysam.fetch() and then writes out the deduped reads to outfile, using multiprocessing.Lock to ensure only one process is writing out at a time.

The use of multiple processes is available on {TS}-para via --num-processes=. There a small number of issues to resolve still including logging the number of parsed and outputted reads and outputting stats, which has been intentionally broken for the time being. On the stats front, arguably, we shouldn't allow use of stats and parallel contig processing since the null distribution of UMI edit distances will be poorly estimated for very short contigs, and nonsense when run with --per-gene since we will calculate stats for each gene separately! Also, we need to deal cases where the following options are combined: --per-gene --per-contig --gene-transcript-map since this currently initiates use of a specialised fetcher so that all transcripts from the same gene are processes at the same time - This should be straightforward but just wanted to mention here so I don't forget!

So far I've just quickly checked runtime on a couple of inputs:

  1. Input with a single chromosome but multiple contigs according to Samfile header: Time taken (34s) is the same with 1-8 processes. E.g no clear increase in run-time from unnecessary processes and calls to pysam.fetch().
  2. Input with multiple chromosomes: Below is a table of runtime for 1-8 processes (I've got 8 CPUs on my desktop). For processes = 1, multiprocessing isn't used at all - this is current performance.

Running command (note no sorting output and outputting sam):
umi_tools dedup --out-sam --random-seed=123456789 --method=adjacency --stdin=hgmm_100_hg_mm.bam.featureCounts_sorted.bam --per-gene --gene-tag=XT --processes=[NUM_PROCESSES] -S test_10.out --no-sort-output

Values below are approximations of complete run times from umi_tools log

# Processes Runtime (s)
1 131-139
2 112-118
3 75-84
4 62-68
8 54-65

So not linear by any means but ~2-fold decrease in runtime with 4 processes seems pretty good to me. It's possible we could get an even better reduction in run time for larger inputs by writing out to temp SAMs (avoiding locking issue which is probably preventing improvement from 4-8 processes?) and then concatenate afterwards.

We could do with some more appropriate files for benchmarking (more reads, greater depth/# unique UMIs per position). Then benchmark runtime and memory. There's also an issue that the output when using multiprocessing is not identical. I assume this relates to the random choice of representative reads but I haven't formally checked this.

@TomSmithCGAT
Copy link
Member Author

Note: It appears the writing out is not working yet since the output sam is truncated

@TomSmithCGAT
Copy link
Member Author

For the time being, I've switched the contig-based parallelisation to output each contig to a separate outfile to avoid the issues with concurrent writing. The run time with 4 processes on the above input is now ~75s

@Poshi
Copy link

Poshi commented Oct 15, 2018

This parallelization will also include umi_tools group? It looks like this step is time-dominant over the whole FastQ->BAM process.

@fanglingcloud
Copy link

I used the umitools (version: 1.0.0) for scRNA seq data. It took more than one hour to handle with 58 million reads (it just 20% of my real data). I want to use --processes to speed up the dedup, but it returns:
dedup: error: no such option: --processes
Can someone helps me?

@seahyfs
Copy link

seahyfs commented Feb 15, 2024

Hi there,
Just curious if this issue has been addressed. I'm also finding that both umi_tools group and umi_tools dedup are very time-consuming in my pipeline. Would be keen to test out parallelization to try to reduce these times. Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants