Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations & accept bytes and bytearray inputs #22

Merged
merged 12 commits into from
Feb 15, 2021

Conversation

milesgranger
Copy link
Owner

@milesgranger milesgranger commented Feb 13, 2021

@martindurant

Here's the start of it. πŸƒβ€β™‚οΈ

Problem: πŸ““

The timings I showed earlier used the pre-allocated PyBytes::new_with which allowed writing the de/compressed bytes to a single allocated buffer back to Python. However, one cannot know the exact length of de/compression output so we were stuck with dangling null values in the byte string output. ie.

>>> snappy_compress(b'bytes')
b'\xff\x06\x00\x00sNaPpY\x01\t\x00\x00\xb5\x8b\xa8\xdbbytes\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

And it seemed quite impossible at the moment to pass the Vec allocated on the Rust side directly to Python without creating a custom Python exported Rust type which would govern the ownership of those bytes; and resizing the resulting byte string also seemed ill-advised.

Solution?: ✏️

We can use bytearray/PyByteArray for Python/Rust types respectively as well as bytes/PyBytes, so if passed bytes we'd see the current benchmark timings, while if passed bytearray we would see the benchmarks you see below. This is because bytearray can be resized from Rust. So we only need to overallocate to the estimated size of the de/compressed bytes and then resize to the actual; keeping a single allocation for the bytes. Is it acceptable you think? If so, I'll update the remaining options which should further improve their performance as well.

Thus, in the future, if a way it worked out to avoid double allocating buffers when using bytes objects, then the improvement would require no external API changes.

Benchmarks πŸ‹οΈ

--------------------------------------------------------------------------------------------------------- benchmark: 24 tests ---------------------------------------------------------------------------------------------------------
Name (time in us)                                         Min                    Max                  Mean              StdDev                Median                 IQR            Outliers          OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_snappy[Mark.Twain-Tom.Sawyer.txt-cramjam]        67.3280 (4.51)        132.1710 (3.13)        71.2730 (4.50)       7.1238 (3.52)        68.6950 (4.45)       1.7293 (8.12)      653;883  14,030.5583 (0.22)       5961           1
test_snappy[Mark.Twain-Tom.Sawyer.txt-snappy]         52.7370 (3.53)        109.9360 (2.60)        55.6604 (3.52)       5.2637 (2.60)        53.8750 (3.49)       0.8910 (4.18)    1399;2602  17,966.1021 (0.28)      14300           1
test_snappy[alice29.txt-cramjam]                     292.9650 (19.60)       463.7460 (10.97)      307.4758 (19.42)     25.5281 (12.60)      295.1015 (19.12)     18.5890 (87.27)     313;207   3,252.2885 (0.05)       2798           1
test_snappy[alice29.txt-snappy]                      600.6720 (40.19)       926.6740 (21.92)      648.9096 (40.98)     63.7229 (31.45)      625.4750 (40.53)     52.1817 (244.98)    176;151   1,541.0468 (0.02)       1487           1
test_snappy[asyoulik.txt-cramjam]                    311.4210 (20.84)       448.3640 (10.61)      321.6205 (20.31)     14.8783 (7.34)       313.8235 (20.34)     11.6220 (54.56)     351;301   3,109.2548 (0.05)       3026           1
test_snappy[asyoulik.txt-snappy]                     532.7030 (35.65)       794.6070 (18.80)      551.8453 (34.85)     35.5147 (17.53)      536.2410 (34.75)     21.4005 (100.47)      90;90   1,812.1019 (0.03)       1321           1
test_snappy[fireworks.jpeg-cramjam]                   42.1390 (2.82)        121.4490 (2.87)        47.2472 (2.98)       6.3711 (3.14)        45.1470 (2.93)       2.7930 (13.11)   2082;2469  21,165.2932 (0.34)      13886           1
test_snappy[fireworks.jpeg-snappy]                    14.9440 (1.0)          42.2700 (1.0)         15.8342 (1.0)        2.0263 (1.0)         15.4320 (1.0)        0.3100 (1.46)    1478;2811  63,154.3509 (1.0)       37304           1
test_snappy[geo.protodata-cramjam]                   108.2500 (7.24)        203.1130 (4.81)       116.1689 (7.34)      12.4194 (6.13)       112.2710 (7.28)       6.2570 (29.38)     682;798   8,608.1584 (0.14)       5617           1
test_snappy[geo.protodata-snappy]                    143.4240 (9.60)        271.0080 (6.41)       152.2461 (9.62)      12.5653 (6.20)       147.0180 (9.53)       7.7725 (36.49)     727;737   6,568.3114 (0.10)       5828           1
test_snappy[html-cramjam]                            147.4370 (9.87)        282.2160 (6.68)       168.3078 (10.63)     22.2651 (10.99)      157.7880 (10.22)     21.6680 (101.73)    905;552   5,941.4963 (0.09)       5653           1
test_snappy[html-snappy]                             156.2900 (10.46)       311.5290 (7.37)       172.9241 (10.92)     20.5035 (10.12)      162.7250 (10.54)     16.7462 (78.62)     610;456   5,782.8828 (0.09)       4459           1
test_snappy[html_x_4-cramjam]                        158.2180 (10.59)       273.7340 (6.48)       163.7128 (10.34)      9.7208 (4.80)       159.6970 (10.35)      4.5480 (21.35)     338;362   6,108.2588 (0.10)       3200           1
test_snappy[html_x_4-snappy]                         633.3160 (42.38)       895.1340 (21.18)      663.2439 (41.89)     35.0721 (17.31)      654.7660 (42.43)     39.6860 (186.32)     205;73   1,507.7409 (0.02)       1534           1
test_snappy[kppkn.gtb-cramjam]                       203.7440 (13.63)       371.9820 (8.80)       224.6235 (14.19)     24.4348 (12.06)      213.2410 (13.82)     22.7395 (106.76)    547;336   4,451.8939 (0.07)       3793           1
test_snappy[kppkn.gtb-snappy]                        503.5790 (33.70)       761.1290 (18.01)      554.5705 (35.02)     46.7447 (23.07)      539.8105 (34.98)     46.4450 (218.05)    327;152   1,803.1971 (0.03)       1866           1
test_snappy[lcet10.txt-cramjam]                      283.4160 (18.97)       483.2450 (11.43)      294.2689 (18.58)     19.0632 (9.41)       285.3810 (18.49)     10.8567 (50.97)     338;345   3,398.2519 (0.05)       2979           1
test_snappy[lcet10.txt-snappy]                     1,591.0730 (106.47)    2,333.1600 (55.20)    1,694.1689 (106.99)   115.2361 (56.87)    1,661.4130 (107.66)    93.0265 (436.74)      56;40     590.2599 (0.01)        461           1
test_snappy[paper-100k.pdf-cramjam]                   48.3040 (3.23)        107.3130 (2.54)        50.5630 (3.19)       5.2656 (2.60)        48.9860 (3.17)       0.7368 (3.46)    1181;1873  19,777.3166 (0.31)      13659           1
test_snappy[paper-100k.pdf-snappy]                    19.8800 (1.33)         58.6310 (1.39)        21.0168 (1.33)       2.5887 (1.28)        20.4370 (1.32)       0.2130 (1.0)     1420;4174  47,580.9635 (0.75)      27046           1
test_snappy[plrabn12.txt-cramjam]                    520.6660 (34.84)       758.5590 (17.95)      536.5952 (33.89)     28.4559 (14.04)      523.2170 (33.90)     19.7690 (92.81)     159;115   1,863.6023 (0.03)       1682           1
test_snappy[plrabn12.txt-snappy]                   2,311.4000 (154.67)    3,600.0620 (85.17)    2,573.6428 (162.54)   303.2057 (149.63)   2,484.4860 (161.00)   258.8528 (>1000.0)     70;54     388.5543 (0.01)        421           1
test_snappy[urls.10K-cramjam]                        476.6270 (31.89)    12,832.7190 (303.59)     660.0966 (41.69)    338.7897 (167.19)     693.0820 (44.91)    160.3975 (753.04)        5;5   1,514.9299 (0.02)       1808           1
test_snappy[urls.10K-snappy]                       1,934.0540 (129.42)    3,372.8020 (79.79)    2,345.1850 (148.11)   377.9098 (186.50)   2,155.1240 (139.65)   720.5043 (>1000.0)     127;0     426.4056 (0.01)        367           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@martindurant
Copy link

martindurant commented Feb 14, 2021

It may be worth asking pyo3 whether they want to expose PyBytes_Resize specifically to make this simpler (EDIT: I see the conversation now where it explains this is actually a reallocation). Probably the real way to do this is with the buffer protocol, which is also referenced extensively in the pyo3 issue.

Failing that, bytesarray would be totally fine, it's drop-in in most places in python where you would use a bytes object (or memoryview).

Note that for parquet, we do always know the output uncompressed size. If we refine for this, I expect everything can be made to work fine for the current code, right - in fact better, since we never over-allocate, and we don't have to ask the decompressor how long the output might need to be. Obviously this is the case for decompression only.

@martindurant
Copy link

Another nice feature might be a decompress_into, for the case that the output is already allocated, e.g., a numpy array. Am I asking too much? These kind of nice-to-haves can have a huge difference on performance, but are rarely seen in any compression library, at least as seen from python.

@milesgranger
Copy link
Owner Author

milesgranger commented Feb 14, 2021

I'm uncertain about the buffer protocol, as from my understanding, would mean exposing a Rust type to Python which would own the de/compressed bytes; and would significantly alter the API. But, can dedicate more time to seeing what details lie beneath.

Anyway, the points I'm getting here to complete this desired speed up is:

  • All compression variants should accept bytes or bytearray.
  • Add an optional parameter where, from python, one can specify the exact length of the de/compressed buffer, allowing both PyBytes and PyByteArray to do a single allocation for the output.

Would this work for you?

As a side note, I don't think decompress_into would be very difficult, just need to pass the pointer and length of the buffer to Rust, from there we can simply hand the resulting slice to the underlying de/compression variant. (but guess that's a separate feature/issue from this) and I'm assuming it's then not for performance reasons? Whether it's python/numpy doing the allocation or Rust is really splitting hairs; albeit unless one is wanting to re-use a buffer??

Interesting point about parquet output size. Is this only for snappy, or is that regardless of compression type in parquet?

@martindurant
Copy link

martindurant commented Feb 14, 2021 via email

@milesgranger
Copy link
Owner Author

milesgranger commented Feb 15, 2021

@martindurant You are welcome to review if you'd like, but as it is now:

  • Every variant is now in it's own module. ie. cramjam.snappy_compress is now cramjam.snappy.compress
  • Every regular compress/decompress variant supports output_len for both bytes/bytearray
  • Each accepts bytes and bytearray
  • Snappy is much faster using bytearray even without output_len but with bytes and output_len=None it matches the old benchmarks (pretty much as fast, but always slightly behind)

I've made a separate issue for de/compressing into a preallocated buffer #23

@milesgranger milesgranger marked this pull request as ready for review February 15, 2021 14:40
@martindurant
Copy link

That's excellent news!

Obviously, the inline comment docs will also need to be updated, including all of the variants that now exist. Actually, there is repeated code between the algorithms that could be simplified. So do I understand that if you pass bytes, you get bytes back, and if you pass bytearray, you get bytearray back? I'm not certain there's a good reason for ever having bytes anymore.

I notice you redid the gzip benchmarks with the size parameter (although, I'm not sure what it means for the non-cramjam version). Do you have numbers for all the algorithms.

@milesgranger
Copy link
Owner Author

the inline comment docs will also need to be updated, including all of the variants that now exist

Right-o, that's me being lazy.. :-) Will fix

Actually, there is repeated code between the algorithms that could be simplified

I agree as there is a TODO in there for this, I was using a macro, as it's basically the same code, but there are subtle differences in some places that made it annoying enough for me to put it in a different PR/issue; it'll be functionally the same.

So do I understand that if you pass bytes, you get bytes back, and if you pass bytearray, you get bytearray back?
I'm not certain there's a good reason for ever having bytes anymore.

I would think that since most (all?) of the existing python variants at least take bytes we should keep support for both. And it is the default as I understand for what Python returns when reading file objects as bytes.

I notice you redid the gzip benchmarks with the size parameter (although, I'm not sure what it means for the non-cramjam version). Do you have numbers for all the algorithms.

Just saving time/space; w/ snappy there is hardly a difference, and gave gzip as an example of using the output_len; could go through all the variants I suppose, but I'm going to bunt it and put it into another issue. As for what it means for the non-cramjam version, good question, it's not a supported parameter, so I wanted to make it pretty clear that it's an option supported here, and probably not a fair comparison across to the existing gzip

@martindurant
Copy link

I would think that since most (all?) of the existing python variants at least take bytes we should keep support for both. And it is the default as I understand for what Python returns when reading file objects as bytes.

Agreed for the input - but we might choose to consistently return bytearrays. Just a thought.

@milesgranger
Copy link
Owner Author

Wouldn't that be somewhat unexpected behavior when things like gzip.compress(b'bytes') gives back bytes? I was hoping we could adhere to an eventual 'drop-in' replacement for some variants.

@martindurant
Copy link

That's a fair argument

@milesgranger milesgranger changed the title Reduce allocations on Rust side Reduce allocations & accept bytes/bytearray inputs Feb 15, 2021
@milesgranger milesgranger changed the title Reduce allocations & accept bytes/bytearray inputs Reduce allocations & accept bytes and bytearray inputs Feb 15, 2021
@milesgranger milesgranger merged commit a7c41df into master Feb 15, 2021
@milesgranger milesgranger deleted the reduce-allocations branch February 15, 2021 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants