Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.DataFrame.from_records() raises a KeyError if passed a string index and an empty iterable #47319

Conversation

Kelvingandhi
Copy link

@Kelvingandhi Kelvingandhi commented Jun 12, 2022

@pep8speaks
Copy link

pep8speaks commented Jun 12, 2022

Hello @Kelvingandhi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-24 20:23:11 UTC

@simonjayhawkins simonjayhawkins changed the title Bug/47285 from records key error BUG: pd.DataFrame.from_records() raises a KeyError if passed a string index and an empty iterable Jun 13, 2022
@simonjayhawkins
Copy link
Member

cc @phofl (from issue)

@simonjayhawkins simonjayhawkins added Bug Constructors Series/DataFrame/Index/pd.array Constructors labels Jun 13, 2022
if isinstance(index, str) or not hasattr(index, "__iter__"):
if index not in columns:
result_index = [index]
index = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This swallows KeyErrors when the iterable is populated but the column is not present.

data = [
    {"col_1": 3, "col_2": "a"},
    {"col_1": 2, "col_2": "b"},
    {"col_1": 1, "col_2": "c"},
    {"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_14")
print(result)

This should still raise, I think.

cc @simonjayhawkins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that maybe is another bug, see #47285 (comment) where we get NaN in the index for a missing key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main, this raises a KeyError right now, which is fine I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. if we do have data (not the empty list case) we should probably raise on missing keys

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I think so too.

Copy link
Author

@Kelvingandhi Kelvingandhi Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. if we do have data (not the empty list case) we should probably raise on missing keys

Yeah agree with you.
Furthermore, when we have EMPTY list and we pass some list index then currently on Main, it returns that list index as output with empty list. We want same behavior for string index as well for EMPTY list instead of raising KeyError.

Copy link
Author

@Kelvingandhi Kelvingandhi Jun 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This swallows KeyErrors when the iterable is populated but the column is not present.

data = [
    {"col_1": 3, "col_2": "a"},
    {"col_1": 2, "col_2": "b"},
    {"col_1": 1, "col_2": "c"},
    {"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_14")
print(result)

This should still raise, I think.

cc @simonjayhawkins

As @simonjayhawkins mentioned, and I also tested it, it generates ValueError: Length of values (4) does not match length of index (1) when we've iterable populated. And if it should raise KeyError then that is another issue.

Here, in my PR, I tried to address actual issue when we have empty list ([]) and we pass as string index, It gives KeyError. But Ideally in output we want Empty list along with passed string as index.

We want Exact same behaviour when we have EMPTY list and we pass some list index then currently on Main, it returns that list index as output with empty list.

Code change I put here, it will swallow KeyError only when we have EMPTY list passed with string index. Otherwise it won't get executed.

But please correct me if I misunderstood any of your concerns in discussion. Thanks.

Screenshot from 2022-06-19 18-06-26

Screenshot from 2022-06-19 18-12-29

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My code example should still raise the same error. We do not want to change this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to fix the bug without breaking anything else.

Please also add my example as a test with the KeyError as expected

Copy link
Author

@Kelvingandhi Kelvingandhi Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohk. I apologize, earlier I misunderstood it. I fixed this issue and now it is behaving as expected (exactly what currently in main this raises a KeyError). I have added your example as a test as well. Thank you.

@Kelvingandhi
Copy link
Author

Kelvingandhi commented Jun 19, 2022

Hello @simonjayhawkins , Just wanted to follow up on this PR. I'm not sure, is it waiting for another approval or is it being checked or need more discussion. If you could provide your feedback, that would be appreciated.

@phofl
Copy link
Member

phofl commented Jun 19, 2022

Did you address our comments?

@Kelvingandhi
Copy link
Author

Did you address our comments?

Please check my response above.

#47319 (comment)

@simonjayhawkins
Copy link
Member

@github-actions pre-commit

@@ -491,3 +491,19 @@ def test_info_int_columns():
"""
)
assert result == expected


def test_info_from_rec():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the test we discussed where the behavior changed initially?

Also pre-commit is failing

Copy link
Author

@Kelvingandhi Kelvingandhi Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our discussion, I fixed that issue and also added that test in test_info.py (please check lines 502-509) plus in that test case, I made correction as well suggested by @simonjayhawkins (#47319 (comment)).

Furthermore, I already have initial test case included as well (lines 498-500).

I'm unsure what else is required.

On pre-commit failure, I'm beginner here and not sure what causing it to fail (when last time it passed for earlier commits). Any help to get this resolved would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, missed that. Could you split that into two tests? Since technically they are testing different things.

You can run pre-commit locally and check the log. This should show the errors

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will do that. Thank you for the response.

@Kelvingandhi
Copy link
Author

Kelvingandhi commented Jun 24, 2022

All checks are successful except these 2 which are keep getting canceled after 90 min timeout. Not sure what action is needed now.

  • Windows-MacOS / windows-latest actions-38.yaml (pull_request)
  • Windows-MacOS / windows-latest actions-310.yaml (pull_request)

@Kelvingandhi
Copy link
Author

@phofl @simonjayhawkins Have you gotten a chance to look at this 2 checks getting cancelled issue. #47319 (comment)

Comment on lines +2294 to +2303

if len(arrays) == 0:
if index is not None:
if isinstance(index, str) or not hasattr(index, "__iter__"):
if index not in columns:
# error: Incompatible types in assignment
# type "List[Any]"; expected "Optional[Index]"
result_index = [index] # type: ignore[assignment]
index = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use only one if here to be consistent with other code.

Suggested change
if len(arrays) == 0:
if index is not None:
if isinstance(index, str) or not hasattr(index, "__iter__"):
if index not in columns:
# error: Incompatible types in assignment
# type "List[Any]"; expected "Optional[Index]"
result_index = [index] # type: ignore[assignment]
index = None
if (
len(arrays) == 0
and index is not None
and (isinstance(index, str) or not hasattr(index, "__iter__"))
and index not in columns
):
# error: Incompatible types in assignment
# type "List[Any]"; expected "Optional[Index]"
result_index = [index] # type: ignore[assignment]
index = None

Copy link
Author

@Kelvingandhi Kelvingandhi Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check that if condition and earlier I had both conditions with one if but it was failing for other case where we pass this pd.DataFrame.from_records([]) (it was failing for if index is not None condition)

Basically we can’t add these conditions in one if to filter out specific edge case.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 4, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.DataFrame.from_records() raises a KeyError if passed a string index and an empty iterable
6 participants