-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel deduplication #203
Comments
Note: It appears the writing out is not working yet since the output sam is truncated |
For the time being, I've switched the contig-based parallelisation to output each contig to a separate outfile to avoid the issues with concurrent writing. The run time with 4 processes on the above input is now ~75s |
This parallelization will also include umi_tools group? It looks like this step is time-dominant over the whole FastQ->BAM process. |
I used the umitools (version: 1.0.0) for scRNA seq data. It took more than one hour to handle with 58 million reads (it just 20% of my real data). I want to use --processes to speed up the dedup, but it returns: |
Hi there, |
We've discussed parallel de-duplication a few times to avoid occasional run time issues (e.g #31) and previously tried implementing parallelisation of the
edit_distance
calculation (#69). Although we have made multiple modifications to reduce run time bit by bit, this issue still crops up (#173). I suggest we therefore revisit parallelisation.I've started trying to implement parallelisation of the read bundle deduplication step but ran into the same issue with pickling
pysam.AlignedSegmen
objects described by @IanSudbery in #69. I therefore suggest the following order of tasks:To do
I've implemented the simplest form of parallelisation: one process per contig. This has the major advantage that only the infile name and contig need to be pickled - each process makes a separate call to
pysam.fetch()
and then writes out the deduped reads tooutfile
, usingmultiprocessing.Lock
to ensure only one process is writing out at a time.The use of multiple processes is available on {TS}-para via
--num-processes=
. There a small number of issues to resolve still including logging the number of parsed and outputted reads and outputting stats, which has been intentionally broken for the time being. On the stats front, arguably, we shouldn't allow use of stats and parallel contig processing since the null distribution of UMI edit distances will be poorly estimated for very short contigs, and nonsense when run with--per-gene
since we will calculate stats for each gene separately! Also, we need to deal cases where the following options are combined:--per-gene --per-contig --gene-transcript-map
since this currently initiates use of a specialised fetcher so that all transcripts from the same gene are processes at the same time - This should be straightforward but just wanted to mention here so I don't forget!So far I've just quickly checked runtime on a couple of inputs:
pysam.fetch()
.multiprocessing
isn't used at all - this is current performance.Running command (note no sorting output and outputting sam):
umi_tools dedup --out-sam --random-seed=123456789 --method=adjacency --stdin=hgmm_100_hg_mm.bam.featureCounts_sorted.bam --per-gene --gene-tag=XT --processes=[NUM_PROCESSES] -S test_10.out --no-sort-output
Values below are approximations of complete run times from
umi_tools
logSo not linear by any means but ~2-fold decrease in runtime with 4 processes seems pretty good to me. It's possible we could get an even better reduction in run time for larger inputs by writing out to temp SAMs (avoiding locking issue which is probably preventing improvement from 4-8 processes?) and then concatenate afterwards.
We could do with some more appropriate files for benchmarking (more reads, greater depth/# unique UMIs per position). Then benchmark runtime and memory. There's also an issue that the output when using multiprocessing is not identical. I assume this relates to the random choice of representative reads but I haven't formally checked this.
The text was updated successfully, but these errors were encountered: