use Dask to parallelize document learning #182

dlqqq · 2023-05-22T19:43:38Z

Description

Uses Dask to parallelize document learning via /learn. The current approach sees roughly a 2-3x performance boost.

Before:

(LearnActor pid=13273) [/learn] Finished chunking documents. Time: 1308ms
(LearnActor pid=13273) [/learn] Completed. Time: 8117ms

After:

(LearnActor pid=33141) [/learn] Finished chunking documents. Time: 2685ms
(LearnActor pid=33141) [/learn] Finished computing embeddings. Time: 4571ms
(LearnActor pid=33141) [/learn] Complete. Time: 4573ms

Notes

The majority of the time spent by Dask is actually from the overhead of spawning a process. The grey bars represent deserialization of pickled data in each worker process:

There are some substantial improvements that can be made here.

3coins · 2023-05-22T19:50:05Z

@dlqqq
Thanks for putting this together. This looks great. Can you explore how the actual request/reply will work with dask setup.

dlqqq · 2023-06-27T18:00:10Z

Superseded by #244.

use dask to parallelize document learning

7b9db6d

dlqqq added the enhancement New feature or request label May 22, 2023

dlqqq mentioned this pull request Jun 23, 2023

Migrate to Dask #242

Closed

dlqqq closed this Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use Dask to parallelize document learning #182

use Dask to parallelize document learning #182

dlqqq commented May 22, 2023 •

edited

Loading

3coins commented May 22, 2023

dlqqq commented Jun 27, 2023

use Dask to parallelize document learning #182

use Dask to parallelize document learning #182

Conversation

dlqqq commented May 22, 2023 • edited Loading

Description

Notes

3coins commented May 22, 2023

dlqqq commented Jun 27, 2023

dlqqq commented May 22, 2023 •

edited

Loading