Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available #1408

Closed

Conversation

jorisvandenbossche
Copy link
Member

Related to the discussion about the pandas metadata specification in pandas-dev/pandas#18201, and an alternative to #1271.

I don't open this PR because it should necessarily be merged, I just want to show that it is not that difficult to both fix ARROW-1754 and preserve index names as field names when possible (as this was mentioned in pandas-dev/pandas#18201 as the reason to make this change to not preserve index names).
The diff is partly a revert of #1271, but then adapted to the current codebase.

Main reasons I prefer to preserve index names: 1) usability in pyarrow itself (if you would want to work with pyarrow Tables created from pandas) and 2) when interchanging parquet files with other people / other non-pandas systems, then it would be much nicer to not have __index_level_n__ column names if possible.

@wesm
Copy link
Member

wesm commented Jan 17, 2018

Can this be closed?

@jorisvandenbossche
Copy link
Member Author

My opinion is to merge this, but I had the feeling nobody else was feeling strongly in favor of it. See the top-level post for my reasoning.

@wesm
Copy link
Member

wesm commented Jan 17, 2018

OK, if @cpcloud could take a look at this and advise (since he worked on this code most recently) I'm fine with merging

@cpcloud
Copy link
Contributor

cpcloud commented Jan 17, 2018

Looking now.

-------
name : str
"""
if index.name is not None and index.name not in column_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be concerned about the linear search for index.name not in column_names? If so, let's create a set outside the loop below that we can check so that we don't need to do a full scan of the column names for every index column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some timings, and conversion to a set typically takes twice the time of a single search in the list. So you already need to have 3 index levels to benefit from this, and I don't think this is the typical use case?
So I would personally leave it as is, but can certainly also easily add the suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me.

@cpcloud
Copy link
Contributor

cpcloud commented Jan 18, 2018

LGTM other than the comment. Should be rebased to run tests against current master.

@wesm
Copy link
Member

wesm commented Jan 18, 2018

Rebased

@cpcloud
Copy link
Contributor

cpcloud commented Jan 24, 2018

LGTM

@jorisvandenbossche
Copy link
Member Author

Regarding the PR backlog, given the comments above I think there was agreement to merge this.
There are no merge conflicts yet, but should I update with master to ensure tests are still passing?

@cpcloud
Copy link
Contributor

cpcloud commented Feb 1, 2018

@jorisvandenbossche Yep that's a good idea, I can merge on green.

@wesm
Copy link
Member

wesm commented Feb 1, 2018

Seems like this could be a stale merge -- doesn't look like it got the ARROW-2062 patch

@jorisvandenbossche
Copy link
Member Author

I see the ARROW-2062 commit in the history of this branch: https://github.com/jorisvandenbossche/arrow/commits/index-names (I fetched upstream master just before I merged / pushed)

But, it is failing on travis (amongst others, a timeout for the first (gcc) build), is that the reason you were thinking this is not up to date?

…name if available

Change-Id: I68ca058b7d038a9f30d265aeaad192d0f86757cc
@wesm
Copy link
Member

wesm commented Feb 2, 2018

It looked a lot like the failure that was happening before ARROW-2062, I triggered a new build to see if it's transient

@jorisvandenbossche
Copy link
Member Author

Hmm, still timing out on the first one (but the other failures seems resolved)

@wesm
Copy link
Member

wesm commented Feb 2, 2018

No problem, I'm merging this, thanks @jorisvandenbossche!

@wesm wesm closed this in 5042863 Feb 2, 2018
@jorisvandenbossche jorisvandenbossche deleted the index-names branch February 2, 2018 19:50
@jorisvandenbossche
Copy link
Member Author

Thanks for merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants