Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hamming distance implementation with Numba #512

Merged
merged 80 commits into from
Aug 19, 2024

Conversation

felixpetschko
Copy link
Collaborator

@felixpetschko felixpetschko commented Apr 29, 2024

Close #256

Hamming distance implementation with Numba in the same fashion as the improved TCRdist calculator.

@grst
Copy link
Collaborator

grst commented Apr 29, 2024

Awesome! I assume we can close #481 instead?

@felixpetschko
Copy link
Collaborator Author

Awesome! I assume we can close #481 instead?

Yes, we can close #481 instead, i will port the changes related to the normalized hamming distance to this PR.
This PR still needs some performance related adaptions that will follow soon. Today I tested a numba version that could run 1 million cells in 80 seconds with 64 cores with the hamming distance (the tcrdist takes around 210 seconds). This is around 8 times faster than the current implementation of the hamming distance in scirpy. The implementation in #481 was fast with lower ressources and smaller inputs, but I think it couldn't really outperform the current hamming distance in scirpy on the cluster. I also have a pure numpy implementation that wasn't shown in any PR so far which is around 25% slower than the upcoming numba version.

@felixpetschko
Copy link
Collaborator Author

felixpetschko commented May 2, 2024

@grst I just pushed a version of the hamming distance that uses numba for the parallelization by using parallel=True for the JIT compiler. I could run 1 million cells in 40 seconds (80 with job_lib) and 8 million cells in 2400 seconds with 64 cores. That way threads are used instead of processes and i only needed 128GB of RAM for 8 million cells.
What do you think about using numba parallel threads instead of job_lib?
The advantages:
Speed: better optimization by numba, threads start faster than processes, less copying stuff around;
Memory: shared memory between threads -> less redundancy -> lower memory consumption

@grst
Copy link
Collaborator

grst commented May 2, 2024

I've been thinking about this before, but wouldn't have thought that there is so much to gain since the blocks were already quite large. The only downside I can see is that out-of-machine parallelization is not possible that way anymore. 2400s is impressive for this number of cells, but if you have the compute power you could just split it to several nodes and be faster.

But probably we'd have to resolve other bottlenecks first before this becomes relevant.

@felixpetschko
Copy link
Collaborator Author

@grst We could just introduce 2 parameters number_of_processes (joblib jobs) and number_of_threads_per_process (numba threads) or something like that instead of n_jobs, because everything is already set up for it anyway. That way we could get the best of both worlds and the user can decide :)

Copy link
Collaborator

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea! I would maybe rather control the number of blocks via parameter rather than the number of jobs. When using dask, maybe smaller blocks are benefitial to balance the load, since some workers might be faster than others. Then the final call would look something like

with joblib.parallel_config(backend="dask", n_jobs=200, verbose=10):
    ir.pp.ir_dist(
        metric="hamming",
        n_jobs=8, # jobs per worker
        n_blocks = 2000, # number of blocks sent to dask
    )

To document how to do this properly, I'd like to setup a "large dataset tutorial" (#479) at some point.

arguments = [(split_seqs[x], seqs2, is_symmetric, start_columns[x]) for x in range(n_blocks)]

delayed_jobs = [joblib.delayed(self._calc_dist_mat_block)(*args) for args in arguments]
results = list(_parallelize_with_joblib(delayed_jobs, total=len(arguments), n_jobs=self.n_jobs))
Copy link
Collaborator

@grst grst May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could directly use

Parallel(return_as="list", n_jobs=self.n_jobs)(delayed_jobs)

here. The _parallelize_with_joblib wrapper is only there for the progressbar - and a progressbar doesn't make sense with this small number of jobs and it is anyway not compatible with the dask backend.

Copy link

codecov bot commented Aug 16, 2024

Codecov Report

Attention: Patch coverage is 53.69458% with 94 lines in your changes missing coverage. Please review.

Project coverage is 81.58%. Comparing base (d1db848) to head (f566211).
Report is 18 commits behind head on main.

Files Patch % Lines
src/scirpy/ir_dist/metrics.py 52.06% 93 Missing ⚠️
src/scirpy/util/__init__.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #512      +/-   ##
==========================================
+ Coverage   80.19%   81.58%   +1.39%     
==========================================
  Files          49       49              
  Lines        4079     4204     +125     
==========================================
+ Hits         3271     3430     +159     
+ Misses        808      774      -34     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@felixpetschko felixpetschko mentioned this pull request Aug 16, 2024
3 tasks
@felixpetschko
Copy link
Collaborator Author

felixpetschko commented Aug 16, 2024

I implemented the requested changes now. The histogram feature might need some adaptions in a future pull request.
The failing test case with codecov/patch doesn't make sense here, because the numba functions are not recognized as covered with tests eventhough they are (indirectly). From my side this pull request could be integrated. Afterwards I would start a new pull request to show a draft version of a GPU implementation for the hamming distance.

@grst
Copy link
Collaborator

grst commented Aug 19, 2024

Hi @felixpetschko,

still anything pending from your side here? I'll try it out locally one more time and then I'd merge this.

@felixpetschko
Copy link
Collaborator Author

There is nothing else pending from my side :)

@grst grst enabled auto-merge (squash) August 19, 2024 19:20
@grst grst disabled auto-merge August 19, 2024 20:08
@grst grst merged commit 2b8b8e6 into scverse:main Aug 19, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalized Hamming Distance
2 participants