Reduce allocations & accept bytes and bytearray inputs #22

milesgranger · 2021-02-13T22:17:30Z

Here's the start of it. 🏃‍♂️

Problem: 📓

The timings I showed earlier used the pre-allocated PyBytes::new_with which allowed writing the de/compressed bytes to a single allocated buffer back to Python. However, one cannot know the exact length of de/compression output so we were stuck with dangling null values in the byte string output. ie.

>>> snappy_compress(b'bytes')
b'\xff\x06\x00\x00sNaPpY\x01\t\x00\x00\xb5\x8b\xa8\xdbbytes\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

And it seemed quite impossible at the moment to pass the Vec allocated on the Rust side directly to Python without creating a custom Python exported Rust type which would govern the ownership of those bytes; and resizing the resulting byte string also seemed ill-advised.

Solution?: ✏️

We can use bytearray/PyByteArray for Python/Rust types respectively as well as bytes/PyBytes, so if passed bytes we'd see the current benchmark timings, while if passed bytearray we would see the benchmarks you see below. This is because bytearray can be resized from Rust. So we only need to overallocate to the estimated size of the de/compressed bytes and then resize to the actual; keeping a single allocation for the bytes. Is it acceptable you think? If so, I'll update the remaining options which should further improve their performance as well.

Thus, in the future, if a way it worked out to avoid double allocating buffers when using bytes objects, then the improvement would require no external API changes.

Benchmarks 🏋️

--------------------------------------------------------------------------------------------------------- benchmark: 24 tests ---------------------------------------------------------------------------------------------------------
Name (time in us)                                         Min                    Max                  Mean              StdDev                Median                 IQR            Outliers          OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_snappy[Mark.Twain-Tom.Sawyer.txt-cramjam]        67.3280 (4.51)        132.1710 (3.13)        71.2730 (4.50)       7.1238 (3.52)        68.6950 (4.45)       1.7293 (8.12)      653;883  14,030.5583 (0.22)       5961           1
test_snappy[Mark.Twain-Tom.Sawyer.txt-snappy]         52.7370 (3.53)        109.9360 (2.60)        55.6604 (3.52)       5.2637 (2.60)        53.8750 (3.49)       0.8910 (4.18)    1399;2602  17,966.1021 (0.28)      14300           1
test_snappy[alice29.txt-cramjam]                     292.9650 (19.60)       463.7460 (10.97)      307.4758 (19.42)     25.5281 (12.60)      295.1015 (19.12)     18.5890 (87.27)     313;207   3,252.2885 (0.05)       2798           1
test_snappy[alice29.txt-snappy]                      600.6720 (40.19)       926.6740 (21.92)      648.9096 (40.98)     63.7229 (31.45)      625.4750 (40.53)     52.1817 (244.98)    176;151   1,541.0468 (0.02)       1487           1
test_snappy[asyoulik.txt-cramjam]                    311.4210 (20.84)       448.3640 (10.61)      321.6205 (20.31)     14.8783 (7.34)       313.8235 (20.34)     11.6220 (54.56)     351;301   3,109.2548 (0.05)       3026           1
test_snappy[asyoulik.txt-snappy]                     532.7030 (35.65)       794.6070 (18.80)      551.8453 (34.85)     35.5147 (17.53)      536.2410 (34.75)     21.4005 (100.47)      90;90   1,812.1019 (0.03)       1321           1
test_snappy[fireworks.jpeg-cramjam]                   42.1390 (2.82)        121.4490 (2.87)        47.2472 (2.98)       6.3711 (3.14)        45.1470 (2.93)       2.7930 (13.11)   2082;2469  21,165.2932 (0.34)      13886           1
test_snappy[fireworks.jpeg-snappy]                    14.9440 (1.0)          42.2700 (1.0)         15.8342 (1.0)        2.0263 (1.0)         15.4320 (1.0)        0.3100 (1.46)    1478;2811  63,154.3509 (1.0)       37304           1
test_snappy[geo.protodata-cramjam]                   108.2500 (7.24)        203.1130 (4.81)       116.1689 (7.34)      12.4194 (6.13)       112.2710 (7.28)       6.2570 (29.38)     682;798   8,608.1584 (0.14)       5617           1
test_snappy[geo.protodata-snappy]                    143.4240 (9.60)        271.0080 (6.41)       152.2461 (9.62)      12.5653 (6.20)       147.0180 (9.53)       7.7725 (36.49)     727;737   6,568.3114 (0.10)       5828           1
test_snappy[html-cramjam]                            147.4370 (9.87)        282.2160 (6.68)       168.3078 (10.63)     22.2651 (10.99)      157.7880 (10.22)     21.6680 (101.73)    905;552   5,941.4963 (0.09)       5653           1
test_snappy[html-snappy]                             156.2900 (10.46)       311.5290 (7.37)       172.9241 (10.92)     20.5035 (10.12)      162.7250 (10.54)     16.7462 (78.62)     610;456   5,782.8828 (0.09)       4459           1
test_snappy[html_x_4-cramjam]                        158.2180 (10.59)       273.7340 (6.48)       163.7128 (10.34)      9.7208 (4.80)       159.6970 (10.35)      4.5480 (21.35)     338;362   6,108.2588 (0.10)       3200           1
test_snappy[html_x_4-snappy]                         633.3160 (42.38)       895.1340 (21.18)      663.2439 (41.89)     35.0721 (17.31)      654.7660 (42.43)     39.6860 (186.32)     205;73   1,507.7409 (0.02)       1534           1
test_snappy[kppkn.gtb-cramjam]                       203.7440 (13.63)       371.9820 (8.80)       224.6235 (14.19)     24.4348 (12.06)      213.2410 (13.82)     22.7395 (106.76)    547;336   4,451.8939 (0.07)       3793           1
test_snappy[kppkn.gtb-snappy]                        503.5790 (33.70)       761.1290 (18.01)      554.5705 (35.02)     46.7447 (23.07)      539.8105 (34.98)     46.4450 (218.05)    327;152   1,803.1971 (0.03)       1866           1
test_snappy[lcet10.txt-cramjam]                      283.4160 (18.97)       483.2450 (11.43)      294.2689 (18.58)     19.0632 (9.41)       285.3810 (18.49)     10.8567 (50.97)     338;345   3,398.2519 (0.05)       2979           1
test_snappy[lcet10.txt-snappy]                     1,591.0730 (106.47)    2,333.1600 (55.20)    1,694.1689 (106.99)   115.2361 (56.87)    1,661.4130 (107.66)    93.0265 (436.74)      56;40     590.2599 (0.01)        461           1
test_snappy[paper-100k.pdf-cramjam]                   48.3040 (3.23)        107.3130 (2.54)        50.5630 (3.19)       5.2656 (2.60)        48.9860 (3.17)       0.7368 (3.46)    1181;1873  19,777.3166 (0.31)      13659           1
test_snappy[paper-100k.pdf-snappy]                    19.8800 (1.33)         58.6310 (1.39)        21.0168 (1.33)       2.5887 (1.28)        20.4370 (1.32)       0.2130 (1.0)     1420;4174  47,580.9635 (0.75)      27046           1
test_snappy[plrabn12.txt-cramjam]                    520.6660 (34.84)       758.5590 (17.95)      536.5952 (33.89)     28.4559 (14.04)      523.2170 (33.90)     19.7690 (92.81)     159;115   1,863.6023 (0.03)       1682           1
test_snappy[plrabn12.txt-snappy]                   2,311.4000 (154.67)    3,600.0620 (85.17)    2,573.6428 (162.54)   303.2057 (149.63)   2,484.4860 (161.00)   258.8528 (>1000.0)     70;54     388.5543 (0.01)        421           1
test_snappy[urls.10K-cramjam]                        476.6270 (31.89)    12,832.7190 (303.59)     660.0966 (41.69)    338.7897 (167.19)     693.0820 (44.91)    160.3975 (753.04)        5;5   1,514.9299 (0.02)       1808           1
test_snappy[urls.10K-snappy]                       1,934.0540 (129.42)    3,372.8020 (79.79)    2,345.1850 (148.11)   377.9098 (186.50)   2,155.1240 (139.65)   720.5043 (>1000.0)     127;0     426.4056 (0.01)        367           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

martindurant · 2021-02-14T18:23:17Z

It may be worth asking ~~pyo3 whether they want to expose PyBytes_Resize specifically to make this simpler~~ (EDIT: I see the conversation now where it explains this is actually a reallocation). Probably the real way to do this is with the buffer protocol, which is also referenced extensively in the pyo3 issue.

Failing that, bytesarray would be totally fine, it's drop-in in most places in python where you would use a bytes object (or memoryview).

Note that for parquet, we do always know the output uncompressed size. If we refine for this, I expect everything can be made to work fine for the current code, right - in fact better, since we never over-allocate, and we don't have to ask the decompressor how long the output might need to be. Obviously this is the case for decompression only.

martindurant · 2021-02-14T18:37:18Z

Another nice feature might be a decompress_into, for the case that the output is already allocated, e.g., a numpy array. Am I asking too much? These kind of nice-to-haves can have a huge difference on performance, but are rarely seen in any compression library, at least as seen from python.

milesgranger · 2021-02-14T19:13:33Z

I'm uncertain about the buffer protocol, as from my understanding, would mean exposing a Rust type to Python which would own the de/compressed bytes; and would significantly alter the API. But, can dedicate more time to seeing what details lie beneath.

Anyway, the points I'm getting here to complete this desired speed up is:

All compression variants should accept bytes or bytearray.
Add an optional parameter where, from python, one can specify the exact length of the de/compressed buffer, allowing both PyBytes and PyByteArray to do a single allocation for the output.

Would this work for you?

As a side note, I don't think decompress_into would be very difficult, just need to pass the pointer and length of the buffer to Rust, from there we can simply hand the resulting slice to the underlying de/compression variant. (but guess that's a separate feature/issue from this) and I'm assuming it's then not for performance reasons? Whether it's python/numpy doing the allocation or Rust is really splitting hairs; albeit unless one is wanting to re-use a buffer??

Interesting point about parquet output size. Is this only for snappy, or is that regardless of compression type in parquet?

martindurant · 2021-02-14T19:29:22Z

Parquet always gives the uncompressed size for any codec. Preallocated numpy is important, because this is how fastparquet gets around creating dataframes by concatenation - it would avoid a copy. Rather specific to the use case, but i think still a great feature.

…

On February 14, 2021 2:13:49 PM EST, Miles Granger ***@***.***> wrote: I'm uncertain about the buffer protocol, as from my understanding, would mean exposing a Rust type to Python which would own the decompressed bytes; and would significantly alter the API. But anyway, can dedicate more time to seeing what details lie beneath. Anyway, the points I'm getting here to complete this desired speed up is: - [ ] All compression variants should accept `bytes` or `bytearray`s. - [ ] Add an optional parameter where, from python, one can specify the exact length of the de/compressed buffer, allowing both `PyBytes` and `PyByteArray` to do a single allocation for the output. Would this work for you? As a side note, I don't think `decompress_into` would be very difficult, just need to pass the pointer and length of the buffer to Rust, from there we can simply hand the resulting slice to the underlying de/compression variant. (but guess that's a separate feature/issue from this) and I'm assuming it's then not for performance reasons? Whether it's python/numpy doing the allocation or Rust is really splitting hairs; albeit unless one is wanting to re-use a buffer?? Interesting point about parquet output size. Is this only for snappy, or is that regardless of compression type in parquet? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #22 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

milesgranger · 2021-02-15T14:40:41Z

@martindurant You are welcome to review if you'd like, but as it is now:

Every variant is now in it's own module. ie. cramjam.snappy_compress is now cramjam.snappy.compress
Every regular compress/decompress variant supports output_len for both bytes/bytearray
Each accepts bytes and bytearray
Snappy is much faster using bytearray even without output_len but with bytes and output_len=None it matches the old benchmarks (pretty much as fast, but always slightly behind)

I've made a separate issue for de/compressing into a preallocated buffer #23

martindurant · 2021-02-15T18:48:28Z

That's excellent news!

Obviously, the inline comment docs will also need to be updated, including all of the variants that now exist. Actually, there is repeated code between the algorithms that could be simplified. So do I understand that if you pass bytes, you get bytes back, and if you pass bytearray, you get bytearray back? I'm not certain there's a good reason for ever having bytes anymore.

I notice you redid the gzip benchmarks with the size parameter (although, I'm not sure what it means for the non-cramjam version). Do you have numbers for all the algorithms.

milesgranger · 2021-02-15T19:13:47Z

the inline comment docs will also need to be updated, including all of the variants that now exist

Right-o, that's me being lazy.. :-) Will fix

Actually, there is repeated code between the algorithms that could be simplified

I agree as there is a TODO in there for this, I was using a macro, as it's basically the same code, but there are subtle differences in some places that made it annoying enough for me to put it in a different PR/issue; it'll be functionally the same.

So do I understand that if you pass bytes, you get bytes back, and if you pass bytearray, you get bytearray back?
I'm not certain there's a good reason for ever having bytes anymore.

I would think that since most (all?) of the existing python variants at least take bytes we should keep support for both. And it is the default as I understand for what Python returns when reading file objects as bytes.

I notice you redid the gzip benchmarks with the size parameter (although, I'm not sure what it means for the non-cramjam version). Do you have numbers for all the algorithms.

Just saving time/space; w/ snappy there is hardly a difference, and gave gzip as an example of using the output_len; could go through all the variants I suppose, but I'm going to bunt it and put it into another issue. As for what it means for the non-cramjam version, good question, it's not a supported parameter, so I wanted to make it pretty clear that it's an option supported here, and probably not a fair comparison across to the existing gzip

martindurant · 2021-02-15T19:15:10Z

I would think that since most (all?) of the existing python variants at least take bytes we should keep support for both. And it is the default as I understand for what Python returns when reading file objects as bytes.

Agreed for the input - but we might choose to consistently return bytearrays. Just a thought.

milesgranger · 2021-02-15T19:16:33Z

Wouldn't that be somewhat unexpected behavior when things like gzip.compress(b'bytes') gives back bytes? I was hoping we could adhere to an eventual 'drop-in' replacement for some variants.

martindurant · 2021-02-15T19:17:48Z

That's a fair argument

milesgranger added 6 commits February 12, 2021 11:37

chkpt: working, but decompressed output is not truncated

5331978

fixup: return pybytes

397deb7

chkpt: working using bytearray with PyByteArray

05617a0

chkpt: work with enum return, accept both bytes and bytearray

5a539da

fixup: working bytearray or byte string input for snappy

eb2064b

Revert back to main stream pyo3

c979f95

milesgranger added 5 commits February 15, 2021 13:37

All variants take bytes or bytearray and accept output_len

09fd3d2

Move compression variants to own submodules

bdd4abc

Update README.md

2b3cbc0

Update benchmarks README.md

52bd828

Add python 3.9 and other ci updates

2cdacc0

milesgranger marked this pull request as ready for review February 15, 2021 14:40

This was referenced Feb 15, 2021

Update all benchmarks with output_len set #24

Closed

Macro or other simplification for code #25

Closed

Update docs

2d95dda

milesgranger changed the title ~~Reduce allocations on Rust side~~ Reduce allocations & accept bytes/bytearray inputs Feb 15, 2021

milesgranger changed the title ~~Reduce allocations & accept bytes/bytearray inputs~~ Reduce allocations & accept bytes and bytearray inputs Feb 15, 2021

milesgranger merged commit a7c41df into master Feb 15, 2021

milesgranger deleted the reduce-allocations branch February 15, 2021 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocations & accept bytes and bytearray inputs #22

Reduce allocations & accept bytes and bytearray inputs #22

milesgranger commented Feb 13, 2021 •

edited

Loading

martindurant commented Feb 14, 2021 •

edited

Loading

martindurant commented Feb 14, 2021

milesgranger commented Feb 14, 2021 •

edited

Loading

martindurant commented Feb 14, 2021 via email

milesgranger commented Feb 15, 2021 •

edited

Loading

martindurant commented Feb 15, 2021

milesgranger commented Feb 15, 2021

martindurant commented Feb 15, 2021

milesgranger commented Feb 15, 2021

martindurant commented Feb 15, 2021

Reduce allocations & accept bytes and bytearray inputs #22

Reduce allocations & accept bytes and bytearray inputs #22

Conversation

milesgranger commented Feb 13, 2021 • edited Loading

Problem: 📓

Solution?: ✏️

Benchmarks 🏋️

martindurant commented Feb 14, 2021 • edited Loading

martindurant commented Feb 14, 2021

milesgranger commented Feb 14, 2021 • edited Loading

martindurant commented Feb 14, 2021 via email

milesgranger commented Feb 15, 2021 • edited Loading

martindurant commented Feb 15, 2021

milesgranger commented Feb 15, 2021

martindurant commented Feb 15, 2021

milesgranger commented Feb 15, 2021

martindurant commented Feb 15, 2021

milesgranger commented Feb 13, 2021 •

edited

Loading

martindurant commented Feb 14, 2021 •

edited

Loading

milesgranger commented Feb 14, 2021 •

edited

Loading

milesgranger commented Feb 15, 2021 •

edited

Loading