REF: use public pandas API in dataframe.empty #571

jbrockmendel · 2021-03-13T18:56:37Z

The current usage of non-public API is breaking the docbuild on a pandas PR: https://github.com/pandas-dev/pandas/pull/40149/checks?check_run_id=2102847175#step:5:152

See also: pandas-dev/pandas#40226

A slightly more performant implementation may become possible following pandas-dev/pandas#39776

martindurant · 2021-03-15T17:12:58Z

Thanks for taking an interest and helping out here!

Unfortunately, the existence of this function and the convoluted code it contains, is because of pandas' poor performance in forming dataframes to be assigned into. Benchmarking, I find:

# main branch
In [2]: %timeit out = fastparquet.dataframe.empty('i4,u2,f4,f2,f4,M8,category', 1000000, cats={6: 5})
903 µs ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# this branch
In [3]: %timeit out = fastparquet.dataframe.empty('i4,u2,f4,f2,f4,M8,category', 1000000, cats={6: 5})
43.6 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastparquet does not want this factor of 50 slowdown! I really do hope that Pandas itself creates this functionality, so that we don't have to.

jbrockmendel · 2021-03-15T17:22:52Z

Yep, thats a pretty big slowdown. I'll see what I can do upstream, and take another try at this in the interim.

martindurant · 2021-03-15T17:48:14Z

Thanks, @jbrockmendel !

Interestingly, I wrote about this all the way back in 2017. @jreback helped write this initially, and @TomAugspurger has helped keep this module in sync with pandas changes.

…-proof

jbrockmendel · 2021-03-15T19:09:44Z

Updated with an implementation that slightly outperforms the status quo

In [3]: %timeit out = fastparquet.dataframe.empty('i4,u2,f4,f2,f4,M8,category', 1000000, cats={6: 5})
741 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- main
712 µs ± 4.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- PR

This still accesses internals in a way that is not officially supported, but should be at least future-proof to pandas-dev/pandas#40149 (though not ArrayManager)

martindurant · 2021-03-15T19:27:33Z

Perfect - that persuades me. I'll leave this a day or so to see if there are other ocmments.

martindurant · 2021-03-16T12:47:01Z

Thanks @jbrockmendel !

jbrockmendel · 2021-03-28T00:09:00Z

Some of what I'm working on in pandas would be simplified if this made it into a released version of fastparquet. Is that on the horizon?

martindurant · 2021-03-29T17:44:15Z

Not in the next day or two, but yes.

jbrockmendel · 2021-05-07T15:01:41Z

fastparquet/dataframe.py

-            new_block = block.make_block_same_class(values=values)
+            # Note: this will break on any ExtensionDtype other than
+            #  Categorical and DatetimeTZ
+            values = np.empty(shape=shape, dtype=bvalues.dtype)


i think the fix to pandas-dev/pandas#41366 (comment) is to add here

if not isinstance(bvalues, np.ndarray): # e.g. DatetimeLikeBlock backed by DatetimeArray/TimedeltaArray values = type(bvalues)._from_sequence(values)

i'll try to reproduce the problem locally after some caffeine

Thanks for looking into it. Happy to push a point release to fix pandas' CI, if needed.

I hope the line above doesn't make a copy!

if its a 3rd-party EA then there's no telling, but for the dt64 case this should be copy-free

reproduced the failure locally and the edit above fixes 2 of the 8 tests. the others look pytz-related.

It it's a copy, fastparquet will be filling out an array that's no longer connected to the actual dataframe storage?

It it's a copy, fastparquet will be filling out an array that's no longer connected to the actual dataframe storage?

No, correctness wouldn't be affected.

jbrockmendel added 2 commits March 13, 2021 10:52

REF: use public pandas API in dataframe.empty

5d156ef

Address failing tests

7dc397f

REF/PERF: faster implementation, equally kludgy, somewhat more future…

55abb87

…-proof

martindurant merged commit 0597805 into dask:main Mar 16, 2021

jbrockmendel deleted the compat-pd branch March 16, 2021 15:03

martindurant mentioned this pull request May 7, 2021

CI: Fastparquet release broke ci pandas-dev/pandas#41366

Closed

jbrockmendel commented May 7, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: use public pandas API in dataframe.empty #571

REF: use public pandas API in dataframe.empty #571

jbrockmendel commented Mar 13, 2021

martindurant commented Mar 15, 2021

jbrockmendel commented Mar 15, 2021

martindurant commented Mar 15, 2021

jbrockmendel commented Mar 15, 2021

martindurant commented Mar 15, 2021

martindurant commented Mar 16, 2021

jbrockmendel commented Mar 28, 2021

martindurant commented Mar 29, 2021

jbrockmendel May 7, 2021 •

edited

Loading

martindurant May 7, 2021

jbrockmendel May 7, 2021

jbrockmendel May 7, 2021

martindurant May 7, 2021

jbrockmendel May 7, 2021

REF: use public pandas API in dataframe.empty #571

REF: use public pandas API in dataframe.empty #571

Conversation

jbrockmendel commented Mar 13, 2021

martindurant commented Mar 15, 2021

jbrockmendel commented Mar 15, 2021

martindurant commented Mar 15, 2021

jbrockmendel commented Mar 15, 2021

martindurant commented Mar 15, 2021

martindurant commented Mar 16, 2021

jbrockmendel commented Mar 28, 2021

martindurant commented Mar 29, 2021

jbrockmendel May 7, 2021 • edited Loading

Choose a reason for hiding this comment

martindurant May 7, 2021

Choose a reason for hiding this comment

jbrockmendel May 7, 2021

Choose a reason for hiding this comment

jbrockmendel May 7, 2021

Choose a reason for hiding this comment

martindurant May 7, 2021

Choose a reason for hiding this comment

jbrockmendel May 7, 2021

Choose a reason for hiding this comment

jbrockmendel May 7, 2021 •

edited

Loading