Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use Dask to parallelize document learning #182

Closed
wants to merge 1 commit into from

Conversation

dlqqq
Copy link
Member

@dlqqq dlqqq commented May 22, 2023

Description

Uses Dask to parallelize document learning via /learn. The current approach sees roughly a 2-3x performance boost.

Before:

(LearnActor pid=13273) [/learn] Finished chunking documents. Time: 1308ms
(LearnActor pid=13273) [/learn] Completed. Time: 8117ms

After:

(LearnActor pid=33141) [/learn] Finished chunking documents. Time: 2685ms
(LearnActor pid=33141) [/learn] Finished computing embeddings. Time: 4571ms
(LearnActor pid=33141) [/learn] Complete. Time: 4573ms

Notes

The majority of the time spent by Dask is actually from the overhead of spawning a process. The grey bars represent deserialization of pickled data in each worker process:

Screen Shot 2023-05-22 at 12 07 49 PM

There are some substantial improvements that can be made here.

@dlqqq dlqqq added the enhancement New feature or request label May 22, 2023
@3coins
Copy link
Collaborator

3coins commented May 22, 2023

@dlqqq
Thanks for putting this together. This looks great. Can you explore how the actual request/reply will work with dask setup.

@dlqqq dlqqq mentioned this pull request Jun 23, 2023
@dlqqq
Copy link
Member Author

dlqqq commented Jun 27, 2023

Superseded by #244.

@dlqqq dlqqq closed this Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants