-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available #1408
Conversation
Can this be closed? |
My opinion is to merge this, but I had the feeling nobody else was feeling strongly in favor of it. See the top-level post for my reasoning. |
OK, if @cpcloud could take a look at this and advise (since he worked on this code most recently) I'm fine with merging |
Looking now. |
------- | ||
name : str | ||
""" | ||
if index.name is not None and index.name not in column_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be concerned about the linear search for index.name not in column_names
? If so, let's create a set outside the loop below that we can check so that we don't need to do a full scan of the column names for every index column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some timings, and conversion to a set typically takes twice the time of a single search in the list. So you already need to have 3 index levels to benefit from this, and I don't think this is the typical use case?
So I would personally leave it as is, but can certainly also easily add the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me.
LGTM other than the comment. Should be rebased to run tests against current master. |
Rebased |
54c5bf5
to
b7e0560
Compare
LGTM |
Regarding the PR backlog, given the comments above I think there was agreement to merge this. |
@jorisvandenbossche Yep that's a good idea, I can merge on green. |
Seems like this could be a stale merge -- doesn't look like it got the ARROW-2062 patch |
I see the ARROW-2062 commit in the history of this branch: https://github.com/jorisvandenbossche/arrow/commits/index-names (I fetched upstream master just before I merged / pushed) But, it is failing on travis (amongst others, a timeout for the first (gcc) build), is that the reason you were thinking this is not up to date? |
…name if available Change-Id: I68ca058b7d038a9f30d265aeaad192d0f86757cc
It looked a lot like the failure that was happening before ARROW-2062, I triggered a new build to see if it's transient |
Hmm, still timing out on the first one (but the other failures seems resolved) |
No problem, I'm merging this, thanks @jorisvandenbossche! |
Thanks for merging! |
Related to the discussion about the pandas metadata specification in pandas-dev/pandas#18201, and an alternative to #1271.
I don't open this PR because it should necessarily be merged, I just want to show that it is not that difficult to both fix ARROW-1754 and preserve index names as field names when possible (as this was mentioned in pandas-dev/pandas#18201 as the reason to make this change to not preserve index names).
The diff is partly a revert of #1271, but then adapted to the current codebase.
Main reasons I prefer to preserve index names: 1) usability in pyarrow itself (if you would want to work with pyarrow Tables created from pandas) and 2) when interchanging parquet files with other people / other non-pandas systems, then it would be much nicer to not have
__index_level_n__
column names if possible.