Parallel deduplication #203

TomSmithCGAT · 2017-10-19T15:27:31Z

We've discussed parallel de-duplication a few times to avoid occasional run time issues (e.g #31) and previously tried implementing parallelisation of the edit_distance calculation (#69). Although we have made multiple modifications to reduce run time bit by bit, this issue still crops up (#173). I suggest we therefore revisit parallelisation.

I've started trying to implement parallelisation of the read bundle deduplication step but ran into the same issue with pickling pysam.AlignedSegmen objects described by @IanSudbery in #69. I therefore suggest the following order of tasks:

To do

I've implemented the simplest form of parallelisation: one process per contig. This has the major advantage that only the infile name and contig need to be pickled - each process makes a separate call to pysam.fetch() and then writes out the deduped reads to outfile, using multiprocessing.Lock to ensure only one process is writing out at a time.

The use of multiple processes is available on {TS}-para via --num-processes=. There a small number of issues to resolve still including logging the number of parsed and outputted reads and outputting stats, which has been intentionally broken for the time being. On the stats front, arguably, we shouldn't allow use of stats and parallel contig processing since the null distribution of UMI edit distances will be poorly estimated for very short contigs, and nonsense when run with --per-gene since we will calculate stats for each gene separately! Also, we need to deal cases where the following options are combined: --per-gene --per-contig --gene-transcript-map since this currently initiates use of a specialised fetcher so that all transcripts from the same gene are processes at the same time - This should be straightforward but just wanted to mention here so I don't forget!

So far I've just quickly checked runtime on a couple of inputs:

Input with a single chromosome but multiple contigs according to Samfile header: Time taken (34s) is the same with 1-8 processes. E.g no clear increase in run-time from unnecessary processes and calls to pysam.fetch().
Input with multiple chromosomes: Below is a table of runtime for 1-8 processes (I've got 8 CPUs on my desktop). For processes = 1, multiprocessing isn't used at all - this is current performance.

Running command (note no sorting output and outputting sam):
umi_tools dedup --out-sam --random-seed=123456789 --method=adjacency --stdin=hgmm_100_hg_mm.bam.featureCounts_sorted.bam --per-gene --gene-tag=XT --processes=[NUM_PROCESSES] -S test_10.out --no-sort-output

Values below are approximations of complete run times from umi_tools log

# Processes	Runtime (s)
1	131-139
2	112-118
3	75-84
4	62-68
8	54-65

So not linear by any means but ~2-fold decrease in runtime with 4 processes seems pretty good to me. It's possible we could get an even better reduction in run time for larger inputs by writing out to temp SAMs (avoiding locking issue which is probably preventing improvement from 4-8 processes?) and then concatenate afterwards.

We could do with some more appropriate files for benchmarking (more reads, greater depth/# unique UMIs per position). Then benchmark runtime and memory. There's also an issue that the output when using multiprocessing is not identical. I assume this relates to the random choice of representative reads but I haven't formally checked this.

The text was updated successfully, but these errors were encountered:

TomSmithCGAT · 2017-10-19T16:09:24Z

Note: It appears the writing out is not working yet since the output sam is truncated

TomSmithCGAT · 2017-10-20T09:00:11Z

For the time being, I've switched the contig-based parallelisation to output each contig to a separate outfile to avoid the issues with concurrent writing. The run time with 4 processes on the above input is now ~75s

Poshi · 2018-10-15T10:33:47Z

This parallelization will also include umi_tools group? It looks like this step is time-dominant over the whole FastQ->BAM process.

fanglingcloud · 2019-06-20T01:59:41Z

I used the umitools (version: 1.0.0) for scRNA seq data. It took more than one hour to handle with 58 million reads (it just 20% of my real data). I want to use --processes to speed up the dedup, but it returns:
dedup: error: no such option: --processes
Can someone helps me?

seahyfs · 2024-02-15T01:45:05Z

Hi there,
Just curious if this issue has been addressed. I'm also finding that both umi_tools group and umi_tools dedup are very time-consuming in my pipeline. Would be keen to test out parallelization to try to reduce these times. Thank you :)

TomSmithCGAT self-assigned this Nov 29, 2017

TomSmithCGAT added the Next release label Nov 29, 2017

TomSmithCGAT removed the Next release label Feb 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel deduplication #203

Parallel deduplication #203

TomSmithCGAT commented Oct 19, 2017 •

edited

Loading

TomSmithCGAT commented Oct 19, 2017

TomSmithCGAT commented Oct 20, 2017

Poshi commented Oct 15, 2018

fanglingcloud commented Jun 20, 2019

seahyfs commented Feb 15, 2024

Parallel deduplication #203

Parallel deduplication #203

Comments

TomSmithCGAT commented Oct 19, 2017 • edited Loading

TomSmithCGAT commented Oct 19, 2017

TomSmithCGAT commented Oct 20, 2017

Poshi commented Oct 15, 2018

fanglingcloud commented Jun 20, 2019

seahyfs commented Feb 15, 2024

TomSmithCGAT commented Oct 19, 2017 •

edited

Loading