Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAS7BDAT parser: Speed up RLE/RDC decompression #47405

Merged
merged 22 commits into from
Oct 3, 2022

Conversation

jonashaag
Copy link
Contributor

Speed up RLE/RDC decompression. Brings a 30-50% performance improvement on SAS7BDAT files using compression.

Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.

Also adds a bunch of assert statements to avoid illegal reads/writes. These slow the code down considerably; I will try to improve on that in a future PR.

Alternatives considered:

  • Fast NumPy array creation: Didn't find a way to do it.

  • Using Python's bytearray: Much slower.

  • Using array.array: Much slower. Cython has a fast path but it is incompatible with PyPy.

  • closes #xxxx (Replace xxxx with the Github issue number)

  • Tests added and passed if fixing a bug or adding a new feature

  • All code checks passed.

  • Added type annotations to new arguments/methods/functions.

  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jonashaag
Copy link
Contributor Author

@jbrockmendel mind reviewing this as well?

@jbrockmendel
Copy link
Member

@jonashaag thanks for your patience; im just coming off of a semi-vacation, starting to dig into the ping backlog now.


from pandas import read_sas

ROOT = Path(__file__).parents[3] / "pandas" / "tests" / "io" / "sas" / "data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is a style choice orthogonal to the rest of the PR? no real problem with it, but in general best to minimize these to make it easier to focus on the important bits

Copy link
Contributor Author

@jonashaag jonashaag Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ASV file would’ve been very confusing if I left the old code because my additions can’t use the old code and then we’d end up with two almost identical but different versions.

# cython: profile=False
# cython: boundscheck=False, initializedcheck=False
# cython: language_level=3, initializedcheck=False
# cython: warn.undeclared=True, warn.maybe_uninitialized=True, warn.unused=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arent these the defaults?

Copy link
Contributor Author



cdef inline uint8_t buf_get(Buffer buf, size_t offset) except? 0:
assert offset < buf.length, f"Out of bounds read"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these assertions get expensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negligible, Cython compiles it to just one check + jump (no Python involved). That said, it’s not free, but leaving them out sacrifices robustness and security.

size_t length


cdef inline uint8_t buf_get(Buffer buf, size_t offset) except? 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 0 instead of -1 (which we use elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because size_t is unsigned. Could also use a signed type here but it doesn’t feel right to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually I guess it makes sense to use some other value because null bytes are pretty common in SAS files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using a sentinel mixed into the return value you may be better of passing a separate error argument and checking if that is set or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this suggestion entirely. This is using the recommended Cython error signalling machinery.

Copy link
Member

@WillAyd WillAyd Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be to have a signature that looks like this:

cdef inline uint8_t buf_get(Buffer buf, size_t offset, int *error):

Then within the function do something like:

if something_bad_happened:
  *error = 1

At least in C. I'm not as familiar with Cython semantics to know how that works. The caller passes error as an argument by address (&error) and then check after the call if it was set to 1 or not

This is just a generic approach; if it matters or not in your current design comes back to whether or not a sentinel can safely be reserved or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Cython has its own error signaling that doesn't use error output variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I see the ? making the difference here to help disambiguate an error from a valid return value

@jbrockmendel
Copy link
Member

Fast NumPy array creation: Didn't find a way to do it.

Which usage would you need to replace?

@jbrockmendel
Copy link
Member

cc @WillAyd

@jonashaag
Copy link
Contributor Author

Fast NumPy array creation: Didn't find a way to do it.

Which usage would you need to replace?

Essentially the call to calloc. Cython will always call into NumPy and that will be done thousands/millions of times for a SAS file.

@jreback jreback added IO SAS SAS: read_sas Performance Memory or execution speed performance labels Jul 8, 2022
@jreback jreback added this to the 1.5 milestone Jul 8, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really in love with 'custom buffer stuff' as this can cause a lot of mental overhead for code readers; but i get the perf is worth it and its not that crazy to understand. prob worth adding add comments around L20 about why and what is happening.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2022

can you also add a whatsnew note

@jonashaag jonashaag mentioned this pull request Jul 9, 2022
5 tasks
@jonashaag
Copy link
Contributor Author

@jbrockmendel mind to review this? thanks! :)

@mroeschke mroeschke removed this from the 1.5 milestone Aug 22, 2022
@jbrockmendel
Copy link
Member

Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.

where is the ndarray creation that is so expensive? i dont have any real objection here, but am not wild about introducing a new class/struct whose methods are glorified getitem/setitem.

int64_t[:] column_types
int64_t[:] lengths
int64_t[:] offsets
uint8_t[:, :] byte_chunk
object[:, :] string_chunk

source = np.frombuffer(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if <Py_ssize_t>len(result) != <Py_ssize_t>result_length:
raise ValueError(f"RLE: {len(result)} != {result_length}")

return np.asarray(result)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if <Py_ssize_t>len(outbuff) != <Py_ssize_t>result_length:
raise ValueError(f"RDC: {len(outbuff)} != {result_length}\n")

return np.asarray(outbuff)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Copy link
Member

fine by me

@jonashaag
Copy link
Contributor Author

@mroeschke FYI the What's New for 1.5 already include this PR and #47403, but we haven't merged so far.

@mroeschke
Copy link
Member

Sorry this and the other PR flew under the radar during the 1.5.0.rc release. I agree with @datapythonista as mentioned in #47403 (comment) and I think these would be more suitable for 1.6/2.0

@jonashaag
Copy link
Contributor Author

@mroeschke can we please merge this together with #47403 and #47656

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a whatsnew note for 1.6.0.rst?

@jonashaag
Copy link
Contributor Author

It’s in the other Pr

@mroeschke
Copy link
Member

It’s in the other Pr

Any particular order these PRs should be reviewed/merged? I haven't been in the loop with these PR much and it seems like they contain items relevant to other PRs (like that whatsnew). If they are completely independent (including the whatnew), I think it might be easier to review

@jonashaag
Copy link
Contributor Author

Feel free to merge in any order. I can fix any conflicts. Making separate what’s new will require a conflict resolution on each PR after each merge

@jonashaag
Copy link
Contributor Author

Code changes are independent, just the what’s new is in one PR to avoid conflicts

@mroeschke mroeschke added this to the 1.6 milestone Oct 3, 2022
@mroeschke mroeschke merged commit 053305f into pandas-dev:main Oct 3, 2022
@mroeschke
Copy link
Member

Thanks @jonashaag

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* Speed up RLE/RDC decompression

* Update tests

* ssize_t -> size_t

* Update sas.pyx

* Don't use null byte as except value

* Nit

* Simplify condition

* Review feedback

* Docstring -> comment

* Revert "Simplify condition"

This reverts commit 263aea6.

* Lint

* Speed up some Cython `except`

* Typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants