-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1883: [Python] Fix handling of metadata in to_pandas when not all columns are present #1386
ARROW-1883: [Python] Fix handling of metadata in to_pandas when not all columns are present #1386
Conversation
…ll columns are present
The travis failure looks unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Thanks for reporting and fixing! Couple of minor things to address before merging.
python/pyarrow/pandas_compat.py
Outdated
raw_name = col_meta['name'] | ||
if i >= n_columns: | ||
# index columns | ||
raw_name = pandas_metadata['index_columns'][i - n_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pull pandas_metadata['index_columns']
out into a variable?
@@ -1221,6 +1221,21 @@ def test_array_from_pandas_typed_array_with_mask(self, t, data, expected): | |||
assert pa.Array.from_pandas(expected, | |||
type=pa.list_(t())).equals(result) | |||
|
|||
def test_table_column_subset_metadata(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add 2 tests: 1 against a datetime index and 1 against another index that isn't datetime? We need to make sure the i >= ncolumns
condition is tested.
You'll need to pass preserve_index=True
to pa.Table.from_pandas()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was actually playing with such frames interactively testing it out, but forgot to add to the actual tests.
preserve_index=True
is the default, but I can add it for explicitness if you want
Thanks for the review, added some more tests. |
How does this patch align with #1397? |
@wesm I believe @jorisvandenbossche said he was going to rebase on top of #1397 when that's merged |
@wesm yes, and I have time to that today (or tomorrow) once the other is merged |
a800b5a
to
ea891b2
Compare
I updated this now #1397 is merged. But, I could not really simplify the code to just use the new |
For reference, the test file is written as:
(similar code to construct the dataframe is in the test to create the expected result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We should probably remove the backwards compatibility code after some statute of limitations has passed (1 or 2 major releases) so we aren't maintaining it forever. Having this release available will give people the ability to "fix" their files if needed
To make this cleaner, I could also move all back compat-handling code to a function that takes a metadata object and rewrites it in a new form, so the rest of the code doesn't need to care about it, and it is easier to remove later on. If you would be interested in that, I can do that here, or in a new PR (as that is not release critical) |
No need to rock the boat here, I will merge this once the Appveyor build runs |
thanks @jorisvandenbossche!! |
You're welcome! And happy to have a first patch to arrow :) |
This closes ARROW-1883.
So basically what I did in
_add_any_metadata
was replacingcol = table[i]
with:to check that the column is actually present in the schema. However, that involved some more code to get to
raw_name
(the name how the column is present in the schema), as this does not always match the name inpandas_metadata['column'][..]['name']
. Not sure if there is a better way to get that name.(or if it would be better to filter
pandas_metadata
earlier on, instead of checking when actually trying to process the metadata of that column)