Minimize copying in `maybe_compress` & `byte_sample` #6273

jakirkham · 2022-05-05T09:27:32Z

Currently there are a bunch of copies that occur in maybe_compress and byte_sample. Some of these are explicit (like calling ensure_bytes) and some are implicit (like slicing). In either case it would be good to avoid additional memory allocation and copying in these functions when it is not needed. After all these code paths can be triggered when sending data over the wire or spilling to disk (either could be occurring due to memory pressure that we don't want to add to).

Tests added / passed
Passes pre-commit run --all-files

Otherwise `memoryview` will raise a `TypeError`.

No need to pay the cost for copying here. Just use an empty `bytes` object for the `memoryview`. Should be faster in this case and saves us a check in the `cast` case.

This is converted to `int` here, but is unused below. So go ahead and drop it as it doesn't seem to be needed.

Also rename `nbytes` variable to `payload_nbytes` for clarity.

To allow more efficient accessing of the `payload` (like when selecting portions in `byte_sample`), take a `memoryview` of the data. Ensure that is 1-D contiguous `uint8` data. This makes it very similar to `bytes`, which will work well in `byte_sample` and compressors that handle only a narrow form of the Python Buffer Protocol. This allows us to drop various `ensure_bytes` calls in compression that would otherwise copy the data. Should reduce memory usage when serializing as part of transmission or spilling.

jakirkham · 2022-05-05T10:18:39Z

cc @dask/maintenance (in case anyone has thoughts on this)

Also cc-ing @madsbk given the ensure_memoryview changes

github-actions · 2022-05-05T11:14:18Z

Unit Test Results

      16 files ±  0       16 suites ±0 7h 40m 45s ⏱️ + 19m 41s
  2 762 tests +  5   2 683 ✔️ +  4     78 💤 ±0 1 ❌ +1
22 058 runs +40 21 037 ✔️ +39 1 020 💤 ±0 1 ❌ +1

For more details on these failures, see this check.

Results for commit 8661564. ± Comparison against base commit 2286896.

♻️ This comment has been updated with latest results.

martindurant

I just have a couple of thoughts, but it's definitely a good thing to do.

distributed/protocol/compression.py

distributed/utils.py

Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>

Go ahead and exit immediately in this case before doing anything else.

This is a bit clearer while being just as fast.

jakirkham · 2022-05-06T00:51:54Z

Planning to merge end of day tomorrow if no comments

As comparisons were effectively flipped from how they were before, these should have `=`s as a condition as well.

This can be quite a bit faster than `append`ing each value (particularly if resizing of the underlying array needs to occur).

These are basically unused and are expected to be `int`s internally. So just pick default values that are `int`s to start.

Avoid repeated copies while testing that don't add value here.

madsbk

LGTM

jakirkham · 2022-05-06T15:22:04Z

Thanks Mads! 🙏

jakirkham · 2022-05-06T15:23:14Z

One of the CI failures is a known flaky test that was very recently addressed ( #6233 ). The other CI failure is a new flaky test so filed as issue ( #6292 ).

jakirkham · 2022-05-06T23:46:00Z

Thanks all! 🙏

Going to get this in. If anything else comes up, happy to follow up separately.

jakirkham force-pushed the avoid_memcpy_compress branch 4 times, most recently from 2981f08 to 705399f Compare May 5, 2022 09:46

jakirkham added 17 commits May 5, 2022 02:51

Move ensure_memoryview to distributed.utils

8e3d41f

Replace ret with mv for clarity

20b3f9f

Coerce obj to memoryview only if needed

6fd7179

Require contiguous data for .cast("B")

007c8d1

Otherwise `memoryview` will raise a `TypeError`.

Copy to bytes first in non-contiguous case

ec4220c

Shortcut trivial memoryview case

0a63b8d

No need to pay the cost for copying here. Just use an empty `bytes` object for the `memoryview`. Should be faster in this case and saves us a check in the `cast` case.

Fill out docstring & add comments

5fa88cf

Drop unused conversion of min_size

e932c42

This is converted to `int` here, but is unused below. So go ahead and drop it as it doesn't seem to be needed.

Join if cases together

c3297f7

Use nbytes to get payload size

288bdb4

Also rename `nbytes` variable to `payload_nbytes` for clarity.

Consolidate payload size checks

7871181

Add tests of ensure_memoryview

d5b9c69

Test compression with memoryview

8a8125d

Drop blank line & add a comment

bbed121

Unwrap comment

2658a4e

Test empty bytes with memoryview

c4299c1

jakirkham force-pushed the avoid_memcpy_compress branch from 705399f to c4299c1 Compare May 5, 2022 09:51

jakirkham added 2 commits May 5, 2022 03:19

Coerce b to memoryview to avoid copies

6ef14e7

Add blank line

e74d3c8

jakirkham mentioned this pull request May 5, 2022

Cleanup old compression workarounds #6259

Merged

2 tasks

martindurant reviewed May 5, 2022

View reviewed changes

distributed/protocol/compression.py Outdated Show resolved Hide resolved

distributed/protocol/compression.py Outdated Show resolved Hide resolved

distributed/utils.py Show resolved Hide resolved

jakirkham mentioned this pull request May 5, 2022

Compress larger buffers #6286

Open

Special case fewer parts in byte_sample

f99b523

Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>

jakirkham added 16 commits May 5, 2022 16:27

From random just import randint

689c8a5

Fast path not compression case

a25fb6c

Go ahead and exit immediately in this case before doing anything else.

Normalize args after size check

72dbad0

Use mv.nbytes in compression check

3fd096a

This is a bit clearer while being just as fast.

Consolidate size check code

2dab947

Consolidate size & n handling

d02a4e9

Compute largest start once

dcb8c60

Consolidate fast paths

63d94e0

Simplify final comment

5d6aa1c

Fuse loops in byte_sample to make parts

cf760cc

Set start to next_start at end

8612caa

Tidy comments

883f43f

Tweak wording

8de6296

Also note shape change in comment

e4e86bc

Clarify size given sample selection behavior

57df6fd

Tweak comment

3aded1e

jakirkham added 5 commits May 5, 2022 23:43

Fix comparisons

8aaf04d

As comparisons were effectively flipped from how they were before, these should have `=`s as a condition as well.

Shorten docstring in ensure_memoryview

dc04e4c

Preallocate parts to match intended size

75213a4

This can be quite a bit faster than `append`ing each value (particularly if resizing of the underlying array needs to occur).

Just use ints for min_size & sample_size

a2d891a

These are basically unused and are expected to be `int`s internally. So just pick default values that are `int`s to start.

Call x.tobytes() once and assign it

8661564

Avoid repeated copies while testing that don't add value here.

jakirkham force-pushed the avoid_memcpy_compress branch from 46c1b8f to 8661564 Compare May 6, 2022 07:29

madsbk approved these changes May 6, 2022

View reviewed changes

jakirkham merged commit 4d6a438 into dask:main May 6, 2022

jakirkham deleted the avoid_memcpy_compress branch May 6, 2022 23:45

jakirkham mentioned this pull request May 7, 2022

Use ensure_memoryview in array deserialization #6300

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize copying in `maybe_compress` & `byte_sample` #6273

Minimize copying in `maybe_compress` & `byte_sample` #6273

jakirkham commented May 5, 2022

jakirkham commented May 5, 2022

github-actions bot commented May 5, 2022 •

edited

Loading

martindurant left a comment

jakirkham commented May 6, 2022

madsbk left a comment

jakirkham commented May 6, 2022

jakirkham commented May 6, 2022

jakirkham commented May 6, 2022

Minimize copying in maybe_compress & byte_sample #6273

Minimize copying in maybe_compress & byte_sample #6273

Conversation

jakirkham commented May 5, 2022

jakirkham commented May 5, 2022

github-actions bot commented May 5, 2022 • edited Loading

Unit Test Results

martindurant left a comment

Choose a reason for hiding this comment

jakirkham commented May 6, 2022

madsbk left a comment

Choose a reason for hiding this comment

jakirkham commented May 6, 2022

jakirkham commented May 6, 2022

jakirkham commented May 6, 2022

Minimize copying in `maybe_compress` & `byte_sample` #6273

Minimize copying in `maybe_compress` & `byte_sample` #6273

github-actions bot commented May 5, 2022 •

edited

Loading