SAS7BDAT parser: Speed up RLE/RDC decompression #47405

jonashaag · 2022-06-17T11:22:59Z

Speed up RLE/RDC decompression. Brings a 30-50% performance improvement on SAS7BDAT files using compression.

Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.

Also adds a bunch of assert statements to avoid illegal reads/writes. These slow the code down considerably; I will try to improve on that in a future PR.

Alternatives considered:

Fast NumPy array creation: Didn't find a way to do it.
Using Python's bytearray: Much slower.
Using array.array: Much slower. Cython has a fast path but it is incompatible with PyPy.
closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jonashaag · 2022-06-27T18:55:44Z

@jbrockmendel mind reviewing this as well?

jbrockmendel · 2022-07-06T20:29:38Z

@jonashaag thanks for your patience; im just coming off of a semi-vacation, starting to dig into the ping backlog now.

jbrockmendel · 2022-07-06T20:31:08Z

asv_bench/benchmarks/io/sas.py


 from pandas import read_sas

+ROOT = Path(__file__).parents[3] / "pandas" / "tests" / "io" / "sas" / "data"


IIUC this is a style choice orthogonal to the rest of the PR? no real problem with it, but in general best to minimize these to make it easier to focus on the important bits

The ASV file would’ve been very confusing if I left the old code because my additions can’t use the old code and then we’d end up with two almost identical but different versions.

jbrockmendel · 2022-07-06T20:31:34Z

pandas/io/sas/sas.pyx

-# cython: profile=False
-# cython: boundscheck=False, initializedcheck=False
+# cython: language_level=3, initializedcheck=False
+# cython: warn.undeclared=True, warn.maybe_uninitialized=True, warn.unused=True


arent these the defaults?

No, see

https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L182

https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L184

https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L226

https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L228

https://github.com/cython/cython/blob/827e5188cadb006d85b31702e32993c70f909bc2/Cython/Compiler/Options.py#L229

I set language_level because I was getting warnings from Cython, and removed profile because it defaults to False.

pandas/tests/io/sas/test_sas7bdat.py

jbrockmendel · 2022-07-06T20:33:21Z

pandas/io/sas/sas.pyx

+
+
+cdef inline uint8_t buf_get(Buffer buf, size_t offset) except? 0:
+    assert offset < buf.length, f"Out of bounds read"


do these assertions get expensive?

Negligible, Cython compiles it to just one check + jump (no Python involved). That said, it’s not free, but leaving them out sacrifices robustness and security.

jbrockmendel · 2022-07-06T20:33:47Z

pandas/io/sas/sas.pyx

+    size_t length
+
+
+cdef inline uint8_t buf_get(Buffer buf, size_t offset) except? 0:


why 0 instead of -1 (which we use elsewhere)

Because size_t is unsigned. Could also use a signed type here but it doesn’t feel right to me.

But actually I guess it makes sense to use some other value because null bytes are pretty common in SAS files

Rather than using a sentinel mixed into the return value you may be better of passing a separate error argument and checking if that is set or not

Not sure I understand this suggestion entirely. This is using the recommended Cython error signalling machinery.

The idea would be to have a signature that looks like this:

cdef inline uint8_t buf_get(Buffer buf, size_t offset, int *error):

Then within the function do something like:

if something_bad_happened: *error = 1

At least in C. I'm not as familiar with Cython semantics to know how that works. The caller passes error as an argument by address (&error) and then check after the call if it was set to 1 or not

This is just a generic approach; if it matters or not in your current design comes back to whether or not a sentinel can safely be reserved or not

Yeah, Cython has its own error signaling that doesn't use error output variables.

Thanks for clarifying. I see the ? making the difference here to help disambiguate an error from a valid return value

jbrockmendel · 2022-07-06T20:37:04Z

Fast NumPy array creation: Didn't find a way to do it.

Which usage would you need to replace?

jbrockmendel · 2022-07-06T20:39:39Z

cc @WillAyd

jonashaag · 2022-07-07T06:39:54Z

Fast NumPy array creation: Didn't find a way to do it.

Which usage would you need to replace?

Essentially the call to calloc. Cython will always call into NumPy and that will be done thousands/millions of times for a SAS file.

jreback

not really in love with 'custom buffer stuff' as this can cause a lot of mental overhead for code readers; but i get the perf is worth it and its not that crazy to understand. prob worth adding add comments around L20 about why and what is happening.

jreback · 2022-07-08T23:02:32Z

can you also add a whatsnew note

jonashaag · 2022-08-08T07:34:27Z

@jbrockmendel mind to review this? thanks! :)

jbrockmendel · 2022-08-25T23:05:33Z

Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.

where is the ndarray creation that is so expensive? i dont have any real objection here, but am not wild about introducing a new class/struct whose methods are glorified getitem/setitem.

jonashaag · 2022-08-25T23:31:09Z

pandas/io/sas/sas.pyx

            int64_t[:] column_types
            int64_t[:] lengths
            int64_t[:] offsets
            uint8_t[:, :] byte_chunk
            object[:, :] string_chunk
-
-        source = np.frombuffer(


@jbrockmendel here

jonashaag · 2022-08-25T23:31:28Z

pandas/io/sas/sas.pyx

-    if <Py_ssize_t>len(result) != <Py_ssize_t>result_length:
-        raise ValueError(f"RLE: {len(result)} != {result_length}")
-
-    return np.asarray(result)


@jbrockmendel here

jonashaag · 2022-08-25T23:31:40Z

pandas/io/sas/sas.pyx

-    if <Py_ssize_t>len(outbuff) != <Py_ssize_t>result_length:
-        raise ValueError(f"RDC: {len(outbuff)} != {result_length}\n")
-
-    return np.asarray(outbuff)


@jbrockmendel here

jbrockmendel · 2022-08-29T18:57:40Z

fine by me

jonashaag · 2022-09-09T18:48:32Z

@mroeschke FYI the What's New for 1.5 already include this PR and #47403, but we haven't merged so far.

mroeschke · 2022-09-12T19:10:08Z

Sorry this and the other PR flew under the radar during the 1.5.0.rc release. I agree with @datapythonista as mentioned in #47403 (comment) and I think these would be more suitable for 1.6/2.0

jonashaag · 2022-09-30T11:14:53Z

@mroeschke can we please merge this together with #47403 and #47656

mroeschke

Could you add a whatsnew note for 1.6.0.rst?

jonashaag · 2022-10-03T18:07:09Z

It’s in the other Pr

mroeschke · 2022-10-03T18:23:18Z

It’s in the other Pr

Any particular order these PRs should be reviewed/merged? I haven't been in the loop with these PR much and it seems like they contain items relevant to other PRs (like that whatsnew). If they are completely independent (including the whatnew), I think it might be easier to review

jonashaag · 2022-10-03T20:37:00Z

Feel free to merge in any order. I can fix any conflicts. Making separate what’s new will require a conflict resolution on each PR after each merge

jonashaag · 2022-10-03T20:37:52Z

Code changes are independent, just the what’s new is in one PR to avoid conflicts

mroeschke · 2022-10-03T21:03:20Z

Thanks @jonashaag

* Speed up RLE/RDC decompression * Update tests * ssize_t -> size_t * Update sas.pyx * Don't use null byte as except value * Nit * Simplify condition * Review feedback * Docstring -> comment * Revert "Simplify condition" This reverts commit 263aea6. * Lint * Speed up some Cython `except` * Typo

Speed up RLE/RDC decompression

0e02b8d

jonashaag mentioned this pull request Jun 17, 2022

Meta issue: SAS7BDAT parser improvements #47339

Open

jonashaag added 4 commits June 17, 2022 16:02

Update tests

eca0db4

ssize_t -> size_t

041a04b

Merge branch 'main' into sas/decompress3

0451c31

Update sas.pyx

f2c8b0e

jonashaag added 3 commits July 2, 2022 18:17

Merge branch 'main' into sas/decompress3

17c72f8

Merge branch 'main' into sas/decompress3

91f8436

Merge branch 'main' into sas/decompress3

221f20c

jbrockmendel reviewed Jul 6, 2022

View reviewed changes

pandas/tests/io/sas/test_sas7bdat.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Jul 6, 2022

View reviewed changes

jonashaag added 2 commits July 7, 2022 22:57

Don't use null byte as except value

213b08f

Nit

4b24773

jreback added IO SAS SAS: read_sas Performance Memory or execution speed performance labels Jul 8, 2022

jreback added this to the 1.5 milestone Jul 8, 2022

jreback approved these changes Jul 8, 2022

View reviewed changes

jonashaag mentioned this pull request Jul 9, 2022

SAS7BDAT parser: Fast byteswap #47403

Merged

5 tasks

jonashaag added 3 commits July 9, 2022 10:02

Simplify condition

263aea6

Review feedback

785f752

Docstring -> comment

1f36f99

jonashaag added 2 commits July 10, 2022 22:19

Merge branch 'main' into sas/decompress3

afdfc1c

Merge branch 'main' into sas/decompress3

6a3fd55

mroeschke removed this from the 1.5 milestone Aug 22, 2022

jonashaag commented Aug 25, 2022

View reviewed changes

jonashaag added 2 commits August 29, 2022 22:39

Merge branch 'main' into sas/decompress3

fc5621b

Merge branch 'main' into sas/decompress3

0d3daa8

jonashaag added 4 commits September 15, 2022 10:26

Merge branch 'main' into sas/decompress3

0588d18

Lint

21ba0b2

Speed up some Cython except

55cceb7

Typo

ba9b019

mroeschke reviewed Oct 3, 2022

View reviewed changes

mroeschke added this to the 1.6 milestone Oct 3, 2022

mroeschke approved these changes Oct 3, 2022

View reviewed changes

mroeschke merged commit 053305f into pandas-dev:main Oct 3, 2022

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022


		from pandas import read_sas

		ROOT = Path(__file__).parents[3] / "pandas" / "tests" / "io" / "sas" / "data"



		cdef inline uint8_t buf_get(Buffer buf, size_t offset) except? 0:
		assert offset < buf.length, f"Out of bounds read"

		size_t length


		cdef inline uint8_t buf_get(Buffer buf, size_t offset) except? 0:

SAS7BDAT parser: Speed up RLE/RDC decompression #47405

SAS7BDAT parser: Speed up RLE/RDC decompression #47405

Conversation

jonashaag commented Jun 17, 2022

jonashaag commented Jun 27, 2022

jbrockmendel commented Jul 6, 2022

Choose a reason for hiding this comment

jonashaag Jul 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jul 6, 2022

jbrockmendel commented Jul 6, 2022

jonashaag commented Jul 7, 2022

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jul 8, 2022

jonashaag commented Aug 8, 2022

jbrockmendel commented Aug 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Aug 29, 2022

jonashaag commented Sep 9, 2022

mroeschke commented Sep 12, 2022

jonashaag commented Sep 30, 2022

mroeschke left a comment

Choose a reason for hiding this comment

jonashaag commented Oct 3, 2022

mroeschke commented Oct 3, 2022

jonashaag commented Oct 3, 2022

jonashaag commented Oct 3, 2022

mroeschke commented Oct 3, 2022

jonashaag Jul 7, 2022 •

edited

Loading

WillAyd Jul 19, 2022 •

edited

Loading