Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAS7BDAT parser: Faster string parsing #47404

Merged
merged 4 commits into from
Jul 10, 2022
Merged

Conversation

jonashaag
Copy link
Contributor

Speed up SAS7BDAT string reading.

Today this brings a modest 10% performance improvement. But together with the other changes I will be proposing it will be a major bottleneck.

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

string_chunk[js, current_row] = np.array(source[start:(
start + lngt)]).tobytes().rstrip(b"\x00 ")
# Skip trailing whitespace
while lngt > 0 and source[start+lngt-1] in b"\x00 ":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source[start+lngt-1] is a uint8_t isnt it? am i wrong to think this looks weird?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read this as: check if the last byte is a null byte or white space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah i get what it means, i just suspect there's an implicit casting going on in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of casting? Actually I think it doesn’t matter because the both characters are on the positive side of char

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython will compile this to a switch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        switch ((*((__pyx_t_6pandas_2io_3sas_4_sas_uint8_t const  *) ( /* dim=0 */ (__pyx_v_source.data + __pyx_t_17 * __pyx_v_source.strides[0]) )))) {
          case '\x00':
          case ' ':
          __pyx_t_8 = 1;
          break;
          default:
          __pyx_t_8 = 0;
          break;
        }

with

typedef unsigned char __pyx_t_6pandas_2io_3sas_4_sas_uint8_t

@@ -426,8 +426,10 @@ cdef class Parser:
jb += 1
elif column_types[j] == column_type_string:
# string
string_chunk[js, current_row] = np.array(source[start:(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the perf bottleneck in the np.array call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like both the .tobytes and the .rstrip are non-optimized. could cython optimize it if the bytes object were declared as such?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check. IIRC tobytes will never be optimized and it’s very slow. The call to np.array is completely useless so I suggest to remove that in any case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some benchmarks on 2 files, times in ms for min duration of 10 reads of each file:

  • Baseline: 7.6 / 15
  • Remove redundant np.array(): 6.6 / 12.7 (0.87x / 0.85x)
  • Additionally decare the slice as cdef bytes: 6.5 / 12.5
  • Additionally use rfind plus manually slicing instead of rstrip: 6.4 / 11.5
  • My solution: 6.2 / 10.6 (0.82x / 0.71x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for slowness is that calls to bytes.xyz() are not optimized by Cython.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this. is it worth opening an issue in cython about optimizing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback jreback added the IO SAS SAS: read_sas label Jul 8, 2022
@jreback jreback added this to the 1.5 milestone Jul 8, 2022
@jreback jreback added the Performance Memory or execution speed performance label Jul 8, 2022
@@ -426,8 +426,10 @@ cdef class Parser:
jb += 1
elif column_types[j] == column_type_string:
# string
string_chunk[js, current_row] = np.array(source[start:(
start + lngt)]).tobytes().rstrip(b"\x00 ")
# Skip trailing whitespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments here (e.g. this is like ..... but slower so we are doing xyz). also do we have asv's for this case? and can you add a whatsnew note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a what's new for all 3 PRs here, that ok?

@jonashaag
Copy link
Contributor Author

jonashaag commented Jul 9, 2022

I've ran against all SAS7BDAT test files in the repo, here's the ASV results.

asv continuous main sas/rstrip -b sas

Run 1

       before           after         ratio
     [e915b0a4]       [e20cf659]
     <main>           <sas/rstrip>
-     8.13±0.03ms      5.02±0.02ms     0.62  io.sas.SAS.time_productsalessas7bdat
-      14.9±0.3ms      7.32±0.05ms     0.49  io.sas.SAS.time_load_logsas7bdat

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Run 2

-      8.35±0.2ms      4.96±0.02ms     0.59  io.sas.SAS.time_productsalessas7bdat
-      14.8±0.2ms      7.29±0.02ms     0.49  io.sas.SAS.time_load_logsas7bdat

Shall I add those to the ASV suite (or simply all of the files)?

@jreback jreback merged commit 56dc719 into pandas-dev:main Jul 10, 2022
@jreback
Copy link
Contributor

jreback commented Jul 10, 2022

thanks @jonashaag

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
@jonashaag jonashaag mentioned this pull request Sep 10, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants