-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hamming distance implementation with Numba #512
Conversation
for more information, see https://pre-commit.ci
Awesome! I assume we can close #481 instead?
|
Yes, we can close #481 instead, i will port the changes related to the normalized hamming distance to this PR. |
… into numba_hamming
for more information, see https://pre-commit.ci
… into numba_hamming
for more information, see https://pre-commit.ci
@grst I just pushed a version of the hamming distance that uses numba for the parallelization by using parallel=True for the JIT compiler. I could run 1 million cells in 40 seconds (80 with job_lib) and 8 million cells in 2400 seconds with 64 cores. That way threads are used instead of processes and i only needed 128GB of RAM for 8 million cells. |
I've been thinking about this before, but wouldn't have thought that there is so much to gain since the blocks were already quite large. The only downside I can see is that out-of-machine parallelization is not possible that way anymore. 2400s is impressive for this number of cells, but if you have the compute power you could just split it to several nodes and be faster. But probably we'd have to resolve other bottlenecks first before this becomes relevant. |
@grst We could just introduce 2 parameters number_of_processes (joblib jobs) and number_of_threads_per_process (numba threads) or something like that instead of n_jobs, because everything is already set up for it anyway. That way we could get the best of both worlds and the user can decide :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea! I would maybe rather control the number of blocks via parameter rather than the number of jobs. When using dask, maybe smaller blocks are benefitial to balance the load, since some workers might be faster than others. Then the final call would look something like
with joblib.parallel_config(backend="dask", n_jobs=200, verbose=10):
ir.pp.ir_dist(
metric="hamming",
n_jobs=8, # jobs per worker
n_blocks = 2000, # number of blocks sent to dask
)
To document how to do this properly, I'd like to setup a "large dataset tutorial" (#479) at some point.
src/scirpy/ir_dist/metrics.py
Outdated
arguments = [(split_seqs[x], seqs2, is_symmetric, start_columns[x]) for x in range(n_blocks)] | ||
|
||
delayed_jobs = [joblib.delayed(self._calc_dist_mat_block)(*args) for args in arguments] | ||
results = list(_parallelize_with_joblib(delayed_jobs, total=len(arguments), n_jobs=self.n_jobs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could directly use
Parallel(return_as="list", n_jobs=self.n_jobs)(delayed_jobs)
here. The _parallelize_with_joblib
wrapper is only there for the progressbar - and a progressbar doesn't make sense with this small number of jobs and it is anyway not compatible with the dask backend.
…tcrdist distance metrics
… into numba_hamming
for more information, see https://pre-commit.ci
… into numba_hamming
… into numba_hamming
for more information, see https://pre-commit.ci
… into numba_hamming
for more information, see https://pre-commit.ci
… into numba_hamming
for more information, see https://pre-commit.ci
… into numba_hamming
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #512 +/- ##
==========================================
+ Coverage 80.19% 81.58% +1.39%
==========================================
Files 49 49
Lines 4079 4204 +125
==========================================
+ Hits 3271 3430 +159
+ Misses 808 774 -34 ☔ View full report in Codecov by Sentry. |
I implemented the requested changes now. The histogram feature might need some adaptions in a future pull request. |
Hi @felixpetschko, still anything pending from your side here? I'll try it out locally one more time and then I'd merge this. |
There is nothing else pending from my side :) |
Close #256
Hamming distance implementation with Numba in the same fashion as the improved TCRdist calculator.