-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAS7BDAT parser: Faster string parsing #47404
Conversation
string_chunk[js, current_row] = np.array(source[start:( | ||
start + lngt)]).tobytes().rstrip(b"\x00 ") | ||
# Skip trailing whitespace | ||
while lngt > 0 and source[start+lngt-1] in b"\x00 ": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source[start+lngt-1] is a uint8_t isnt it? am i wrong to think this looks weird?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read this as: check if the last byte is a null byte or white space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah i get what it means, i just suspect there's an implicit casting going on in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of casting? Actually I think it doesn’t matter because the both characters are on the positive side of char
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cython will compile this to a switch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch ((*((__pyx_t_6pandas_2io_3sas_4_sas_uint8_t const *) ( /* dim=0 */ (__pyx_v_source.data + __pyx_t_17 * __pyx_v_source.strides[0]) )))) {
case '\x00':
case ' ':
__pyx_t_8 = 1;
break;
default:
__pyx_t_8 = 0;
break;
}
with
typedef unsigned char __pyx_t_6pandas_2io_3sas_4_sas_uint8_t
@@ -426,8 +426,10 @@ cdef class Parser: | |||
jb += 1 | |||
elif column_types[j] == column_type_string: | |||
# string | |||
string_chunk[js, current_row] = np.array(source[start:( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the perf bottleneck in the np.array
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like both the .tobytes and the .rstrip are non-optimized. could cython optimize it if the bytes object were declared as such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check. IIRC tobytes will never be optimized and it’s very slow. The call to np.array is completely useless so I suggest to remove that in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some benchmarks on 2 files, times in ms for min duration of 10 reads of each file:
- Baseline: 7.6 / 15
- Remove redundant
np.array()
: 6.6 / 12.7 (0.87x / 0.85x) - Additionally decare the slice as
cdef bytes
: 6.5 / 12.5 - Additionally use
rfind
plus manually slicing instead ofrstrip
: 6.4 / 11.5 - My solution: 6.2 / 10.6 (0.82x / 0.71x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason for slowness is that calls to bytes.xyz()
are not optimized by Cython.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for looking into this. is it worth opening an issue in cython about optimizing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/io/sas/sas.pyx
Outdated
@@ -426,8 +426,10 @@ cdef class Parser: | |||
jb += 1 | |||
elif column_types[j] == column_type_string: | |||
# string | |||
string_chunk[js, current_row] = np.array(source[start:( | |||
start + lngt)]).tobytes().rstrip(b"\x00 ") | |||
# Skip trailing whitespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some comments here (e.g. this is like ..... but slower so we are doing xyz). also do we have asv's for this case? and can you add a whatsnew note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a what's new for all 3 PRs here, that ok?
I've ran against all SAS7BDAT test files in the repo, here's the ASV results.
Run 1
Run 2
Shall I add those to the ASV suite (or simply all of the files)? |
thanks @jonashaag |
Speed up SAS7BDAT string reading.
Today this brings a modest 10% performance improvement. But together with the other changes I will be proposing it will be a major bottleneck.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.