-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#3515: Make loc with slice and 1 column return dataframe instead of series #3517
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3517 +/- ##
===========================================
- Coverage 85.66% 61.49% -24.17%
===========================================
Files 150 144 -6
Lines 15969 15544 -425
===========================================
- Hits 13680 9559 -4121
- Misses 2289 5985 +3696
Continue to review full report at Codecov.
|
9dcb9ce
to
8bdf0a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mvashishtha, left a couple of comments
8bdf0a6
to
0a8c7c5
Compare
We already have a logic of determining the proper modin/modin/pandas/indexing.py Lines 301 to 317 in 9c386b0
the bug that this PR is fixing occurs because in certain cases loc.__getitem__ flow doesn't go to the base indexer getitem but to the df.__getitem__ bypassing the logic of constructing proper output type:modin/modin/pandas/indexing.py Lines 532 to 542 in 9c386b0
This fast path to the df.__getitem__ seems redundant and more likely it's some kind of legacy. At the current moment, we can process .loc[:, something] efficiently in the natural loc.__getitem__ flow without this fast path that ruins proper type inference.
So, it seems that the proper fix for the #3515 is to remove this fast path. I already did this in #3513 PR where I'm optimizing |
@dchigarev The fastpath for |
You mean that |
Isn't this logic is hidden inside >>> df
a b c
0 1 3 5
1 2 4 4
2 3 5 2
>>> type(df["a"])
<class 'pandas.core.series.Series'>
>>> type(df[["a"]])
<class 'pandas.core.frame.DataFrame'> |
Ideally you would be correct that the performance is the same, but in practice it isn't. I commented out the fastpath code and ran the query and got 20x slowdown when using the normal
|
Agree, let's keep this fast path |
…rame instead of series. The root cause was that in cases like [this](modin-project#3515 (comment)), _LocIndexer.__getitem__ passed a list of columns instead of a scalar column to df.__get__item [here](https://github.com/modin-project/modin/blob/c2b399a0d69baa519dab67acd715061e93986b22/modin/pandas/indexing.py#L539). Signed-off-by: mvashishtha <mahesh@ponder.io>
0a8c7c5
to
74751da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of doc changes, then looks great!
Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Signed-off-by: mvashishtha <mahesh@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvashishtha can you clear the conflict? That should also allow the updated commitlint to run. Thanks!
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @mvashishtha
The root cause was that in cases like this, _LocIndexer.getitem passed a list of columns instead of a scalar column to df.__get__item here.
The new test case fails without this change.
Signed-off-by: mvashishtha mahesh@ponder.io
What do these changes do?
Fix issue 3515.
flake8 modin
black --check modin
git commit -s
loc
with slice for rows and single column returns dataframe instead of series #3515docs/developer/architecture.rst
is up-to-date