Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only interchange necessary columns #4286

Merged
merged 5 commits into from
Jul 21, 2023

Conversation

MarcoGorelli
Copy link
Contributor

Trying to address this comment: #3901 (comment)

If it's not a wide plot, then only interchange the columns which are needed

@MarcoGorelli MarcoGorelli marked this pull request as ready for review July 19, 2023 17:13
@MarcoGorelli
Copy link
Contributor Author

🤔 bit confused by the CI failure, it fails on "install chrome driver"?

@alexcjohnson
Copy link
Collaborator

@LiamConnors would you mind making another "pin chrome" PR in this repo? (Not in this PR but so this and other PRs can update and succeed again!)

@nicolaskruchten
Copy link
Contributor

Wow, I'm excited that someone is biting the bullet on this one, thank you!

Would be nice for this to be reused for the jankier to_pandas() path as well if possible too, for the shorter term :)

@LiamConnors
Copy link
Contributor

@LiamConnors would you mind making another "pin chrome" PR in this repo? (Not in this PR but so this and other PRs can update and succeed again!)

Opened a PR here to fix it: #4288 @alexcjohnson

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jul 19, 2023

Would be nice for this to be reused for the jankier to_pandas() path as well if possible too, for the shorter term :)

Sure, but there's no guarantee of what the API to do that would be, right? Before having called to_pandas, the object could in theory be anything, with any API to select columns by name. (I might be missing something though, sorry)

df_pandas = df_not_pandas.to_pandas()
args["data_frame"] = df_pandas
args["data_frame"] = df_not_pandas.__dataframe__()
columns = args["data_frame"].column_names()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be 100% sure anything returned by __dataframe__() will have column_names and select_columns_by_name methods? If there's any chance an object will come in with either of these missing we should fall back on interchanging the whole thing up front.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I know they're in the spec, but I also know not everyone follows a spec to the letter 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can help highlight shortcomings in their implementation then 😉 I tried it out with polars and it works fine there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'll know how to respond when we see:
AttributeError: 'MyDataFrame' object has no attribute 'select_columns_by_name'
And in principle you're right that it's not our problem, but we'll be the ones responding to the issue and having to tell our users "don't use this dataframe directly until they fix it." Whereas if we caught this case explicitly we could emit a warning like "This dataframe only partially implements the dataframe interchange protocol. Falling back on a slower full-copy algorithm" so it wouldn't affect usage in px, only performance, and it would be clear where the issue needs to be raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining - OK I've added a condition so it'll only use select_columns_by_name if that attribute is present

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I took a look at also adding a fallback for missing column_names and that would be pretty awkward... but if someone has a partial implementation of the protocol presumably column_names is an easy piece so would get included early, whereas select_columns_by_name could be trickier. So let's leave it as you have it now. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah if they don't have column_names then from_dataframe wouldn't work either, as it uses that internally

https://github.com/pandas-dev/pandas/blob/92792ec063031ae41443dabeb9d12f8daaac3ef1/pandas/core/interchange/from_dataframe.py#L112

@nicolaskruchten
Copy link
Contributor

Sure, but there's no guarantee of what the API to do that would be, right? Before having called to_pandas, the object could in theory be anything, with any API to select columns by name. (I might be missing something though, sorry)

Heh, no, I think it's me that's forgotten that this is exactly why we have the data-interchange protocol, you're right ;)

Copy link
Collaborator

@alexcjohnson alexcjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💃 Great work @MarcoGorelli, lovely tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants