Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: use public pandas API in dataframe.empty #571

Merged
merged 3 commits into from
Mar 16, 2021

Conversation

jbrockmendel
Copy link

The current usage of non-public API is breaking the docbuild on a pandas PR: https://github.com/pandas-dev/pandas/pull/40149/checks?check_run_id=2102847175#step:5:152

See also: pandas-dev/pandas#40226

A slightly more performant implementation may become possible following pandas-dev/pandas#39776

@martindurant
Copy link
Member

Thanks for taking an interest and helping out here!

Unfortunately, the existence of this function and the convoluted code it contains, is because of pandas' poor performance in forming dataframes to be assigned into. Benchmarking, I find:

# main branch
In [2]: %timeit out = fastparquet.dataframe.empty('i4,u2,f4,f2,f4,M8,category', 1000000, cats={6: 5})
903 µs ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# this branch
In [3]: %timeit out = fastparquet.dataframe.empty('i4,u2,f4,f2,f4,M8,category', 1000000, cats={6: 5})
43.6 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastparquet does not want this factor of 50 slowdown! I really do hope that Pandas itself creates this functionality, so that we don't have to.

@jbrockmendel
Copy link
Author

Yep, thats a pretty big slowdown. I'll see what I can do upstream, and take another try at this in the interim.

@martindurant
Copy link
Member

Thanks, @jbrockmendel !

Interestingly, I wrote about this all the way back in 2017. @jreback helped write this initially, and @TomAugspurger has helped keep this module in sync with pandas changes.

@jbrockmendel
Copy link
Author

Updated with an implementation that slightly outperforms the status quo

In [3]: %timeit out = fastparquet.dataframe.empty('i4,u2,f4,f2,f4,M8,category', 1000000, cats={6: 5})
741 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- main
712 µs ± 4.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- PR

This still accesses internals in a way that is not officially supported, but should be at least future-proof to pandas-dev/pandas#40149 (though not ArrayManager)

@martindurant
Copy link
Member

Perfect - that persuades me. I'll leave this a day or so to see if there are other ocmments.

@martindurant martindurant merged commit 0597805 into dask:main Mar 16, 2021
@martindurant
Copy link
Member

Thanks @jbrockmendel !

@jbrockmendel jbrockmendel deleted the compat-pd branch March 16, 2021 15:03
@jbrockmendel
Copy link
Author

Some of what I'm working on in pandas would be simplified if this made it into a released version of fastparquet. Is that on the horizon?

@martindurant
Copy link
Member

Not in the next day or two, but yes.

new_block = block.make_block_same_class(values=values)
# Note: this will break on any ExtensionDtype other than
# Categorical and DatetimeTZ
values = np.empty(shape=shape, dtype=bvalues.dtype)
Copy link
Author

@jbrockmendel jbrockmendel May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the fix to pandas-dev/pandas#41366 (comment) is to add here

            if not isinstance(bvalues, np.ndarray):
                # e.g. DatetimeLikeBlock backed by DatetimeArray/TimedeltaArray
                values = type(bvalues)._from_sequence(values)

i'll try to reproduce the problem locally after some caffeine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it. Happy to push a point release to fix pandas' CI, if needed.

I hope the line above doesn't make a copy!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if its a 3rd-party EA then there's no telling, but for the dt64 case this should be copy-free

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reproduced the failure locally and the edit above fixes 2 of the 8 tests. the others look pytz-related.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it's a copy, fastparquet will be filling out an array that's no longer connected to the actual dataframe storage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it's a copy, fastparquet will be filling out an array that's no longer connected to the actual dataframe storage?

No, correctness wouldn't be affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants