SAS7BDAT parser: Faster string parsing #47404

jonashaag · 2022-06-17T11:17:00Z

Speed up SAS7BDAT string reading.

Today this brings a modest 10% performance improvement. But together with the other changes I will be proposing it will be a major bottleneck.

closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2022-06-17T17:57:10Z

pandas/io/sas/sas.pyx

-                string_chunk[js, current_row] = np.array(source[start:(
-                    start + lngt)]).tobytes().rstrip(b"\x00 ")
+                # Skip trailing whitespace
+                while lngt > 0 and source[start+lngt-1] in b"\x00 ":


source[start+lngt-1] is a uint8_t isnt it? am i wrong to think this looks weird?

Read this as: check if the last byte is a null byte or white space.

yah i get what it means, i just suspect there's an implicit casting going on in here

What kind of casting? Actually I think it doesn’t matter because the both characters are on the positive side of char

Cython will compile this to a switch

switch ((*((__pyx_t_6pandas_2io_3sas_4_sas_uint8_t const *) ( /* dim=0 */ (__pyx_v_source.data + __pyx_t_17 * __pyx_v_source.strides[0]) )))) { case '\x00': case ' ': __pyx_t_8 = 1; break; default: __pyx_t_8 = 0; break; }

with

typedef unsigned char __pyx_t_6pandas_2io_3sas_4_sas_uint8_t

jbrockmendel · 2022-07-06T20:42:38Z

pandas/io/sas/sas.pyx

@@ -426,8 +426,10 @@ cdef class Parser:
                jb += 1
            elif column_types[j] == column_type_string:
                # string
-                string_chunk[js, current_row] = np.array(source[start:(


is the perf bottleneck in the np.array call?

looks like both the .tobytes and the .rstrip are non-optimized. could cython optimize it if the bytes object were declared as such?

Let me check. IIRC tobytes will never be optimized and it’s very slow. The call to np.array is completely useless so I suggest to remove that in any case.

Here are some benchmarks on 2 files, times in ms for min duration of 10 reads of each file:

Baseline: 7.6 / 15

Remove redundant np.array(): 6.6 / 12.7 (0.87x / 0.85x)

Additionally decare the slice as cdef bytes: 6.5 / 12.5

Additionally use rfind plus manually slicing instead of rstrip: 6.4 / 11.5

My solution: 6.2 / 10.6 (0.82x / 0.71x)

Reason for slowness is that calls to bytes.xyz() are not optimized by Cython.

thanks for looking into this. is it worth opening an issue in cython about optimizing this?

cython/cython#4884

jreback · 2022-07-08T23:02:15Z

pandas/io/sas/sas.pyx

@@ -426,8 +426,10 @@ cdef class Parser:
                jb += 1
            elif column_types[j] == column_type_string:
                # string
-                string_chunk[js, current_row] = np.array(source[start:(
-                    start + lngt)]).tobytes().rstrip(b"\x00 ")
+                # Skip trailing whitespace


can you add some comments here (e.g. this is like ..... but slower so we are doing xyz). also do we have asv's for this case? and can you add a whatsnew note.

I've added a what's new for all 3 PRs here, that ok?

jonashaag · 2022-07-09T07:31:33Z

I've ran against all SAS7BDAT test files in the repo, here's the ASV results.

asv continuous main sas/rstrip -b sas

Run 1

       before           after         ratio
     [e915b0a4]       [e20cf659]
     <main>           <sas/rstrip>
-     8.13±0.03ms      5.02±0.02ms     0.62  io.sas.SAS.time_productsalessas7bdat
-      14.9±0.3ms      7.32±0.05ms     0.49  io.sas.SAS.time_load_logsas7bdat

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Run 2

-      8.35±0.2ms      4.96±0.02ms     0.59  io.sas.SAS.time_productsalessas7bdat
-      14.8±0.2ms      7.29±0.02ms     0.49  io.sas.SAS.time_load_logsas7bdat

Shall I add those to the ASV suite (or simply all of the files)?

jreback · 2022-07-10T15:55:17Z

thanks @jonashaag

jonashaag mentioned this pull request Jun 17, 2022

Meta issue: SAS7BDAT parser improvements #47339

Open

jbrockmendel reviewed Jun 17, 2022

View reviewed changes

jonashaag force-pushed the sas/rstrip branch from 6e0e82b to 399abdb Compare June 25, 2022 14:05

jonashaag requested a review from jbrockmendel June 27, 2022 18:55

jonashaag added 2 commits July 2, 2022 18:17

Speedup

9c61918

Fix

010a75b

jonashaag force-pushed the sas/rstrip branch from 399abdb to 010a75b Compare July 2, 2022 16:17

jbrockmendel reviewed Jul 6, 2022

View reviewed changes

jreback added the IO SAS SAS: read_sas label Jul 8, 2022

jreback added this to the 1.5 milestone Jul 8, 2022

jreback added the Performance Memory or execution speed performance label Jul 8, 2022

jreback requested changes Jul 8, 2022

View reviewed changes

jonashaag added 2 commits July 9, 2022 09:08

Merge branch 'main' into sas/rstrip

e46a522

Review feedback

e20cf65

jonashaag mentioned this pull request Jul 9, 2022

SAS7BDAT parser: Improve subheader lookup performance #47656

Merged

5 tasks

jreback approved these changes Jul 10, 2022

View reviewed changes

jreback merged commit 56dc719 into pandas-dev:main Jul 10, 2022

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

SAS7BDAT parser: Faster string parsing (pandas-dev#47404)

dd4165b

jonashaag mentioned this pull request Sep 10, 2022

SAS7BDAT parser: Fast byteswap #47403

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAS7BDAT parser: Faster string parsing #47404

SAS7BDAT parser: Faster string parsing #47404

jonashaag commented Jun 17, 2022

jbrockmendel Jun 17, 2022

jonashaag Jun 17, 2022

jbrockmendel Jul 6, 2022

jonashaag Jul 7, 2022

jonashaag Jul 7, 2022

jonashaag Jul 7, 2022

jbrockmendel Jul 6, 2022

jbrockmendel Jul 6, 2022

jonashaag Jul 7, 2022

jonashaag Jul 7, 2022

jonashaag Jul 7, 2022

jbrockmendel Jul 11, 2022

jonashaag Jul 11, 2022

jonashaag Jul 11, 2022

jreback Jul 8, 2022

jonashaag Jul 9, 2022

jonashaag commented Jul 9, 2022 •

edited

Loading

jreback commented Jul 10, 2022

SAS7BDAT parser: Faster string parsing #47404

SAS7BDAT parser: Faster string parsing #47404

Conversation

jonashaag commented Jun 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonashaag commented Jul 9, 2022 • edited Loading

jreback commented Jul 10, 2022

jonashaag commented Jul 9, 2022 •

edited

Loading