-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem #36353
Conversation
This is still a WIP PR now, only make ci passed with panda 1.4.x.
|
I haven't got enough info to see why we added databricks/koalas#1592 before (performance or follow pandas old behavior). cc @HyukjinKwon @ueshin Maybe you could give some input or suggestion in here. |
I think this needs @ueshin's look |
That's to follow the pandas behavior at that time and seems like it's still necessary. The examples in the description of databricks/koalas#1592 are still valid. >>> pd.__version__
'1.4.1'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.fillna(0, inplace=True)
>>> pser
0 0.0
1 2.0
2 3.0
3 4.0
4 0.0
5 6.0
Name: x, dtype: float64
>>> pdf
x y
0 0.0 NaN
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 0.0 NaN
5 6.0 6.0 If we revert the change, it won't work anymore. Have you tried to revert the changes except for the tests and run the whole tests? If there are behavior changes in pandas, we should fix our behavior. |
@ueshin OK, thanks for your info, I will take an another look soon. |
Just confirming, any update on this ? |
@itholic We might want to do some special process when calling |
Test with Panda 1.4.2: https://github.com/Yikun/spark/runs/6574570761?check_suite_focus=true#step:8:20 All UT passed, but doctest failed due to other unrelated failed. This PR is ready for review. @HyukjinKwon @xinrong-databricks @itholic |
The renaming is so much better, thanks Yikun! LGTM. |
Let me defer to @ueshin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition to generate a new dataframe seems a bit more complex?
I can still see the behavior difference:
>>> pd.__version__
'1.4.2'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pdf.fillna(0, inplace=True)
>>> pdf
x y
0 0.0 0.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 0.0 0.0
5 6.0 6.0
>>> pser
0 0.0
1 2.0
2 3.0
3 4.0
4 0.0
5 6.0
Name: x, dtype: float64
whereas:
>>> psdf = ps.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> psser = psdf.x
>>> psdf.fillna(0, inplace=True)
>>> psdf
x y
0 0.0 0.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 0.0 0.0
5 6.0 6.0
>>> psser
0 NaN
1 2.0
2 3.0
3 4.0
4 NaN
5 6.0
Name: x, dtype: float64
In this case, pandas seems to not make a copy and reuse the underlying array.
I'm not sure whether this is a pandas bug or not, though.
# SPARK-38946: Since Spark 3.4, inplace set generate a new dataframe to follow | ||
# pandas 1.4 behaviors | ||
if LooseVersion(pd.__version__) >= LooseVersion("1.4"): | ||
self.assert_eq(psser, pser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is a common test with the old pandas?
Shall we move it out of if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we follow latest behavior, so before 1.3 will raise not equal. added
@ueshin Yes, it is really yes from your example, the |
According to pandas-dev/pandas#47188 and pandas-dev/pandas#47449 , we should only address |
@ueshin ready for review, it would be good if you could find some time. : ) |
cd1a689
to
a8da2e1
Compare
Let me do a brief conclusion to help review:
I also test 1.4.2 CI with this PR: Yikun#113, all tests passed, so we can upgrade pandas to So, it's ready for review! |
I will leave it to @ueshin |
just a rebase because we update numpy version in infra, and add a line in migration guide, |
|
||
pser = pdf.z | ||
psser = psdf.z | ||
pdf.fillna(0, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a testcase which @ueshin mentioned before
psdf.fillna({("x", "b"): -2, "x": -1}), pdf.fillna({("x", "b"): -2, "x": -1}) | ||
) | ||
# See also: https://github.com/pandas-dev/pandas/issues/47649 | ||
if LooseVersion("1.4.3") != LooseVersion(pd.__version__): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test to make sure 1.4.3 pass. it doesn't casue by this patch, it is a regression since pandas 1.4.3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for working with pandas community.
@ueshin Thanks for review! |
@HyukjinKwon Mind to take another look? |
Just a rebase |
Merged to master. |
### What changes were proposed in this pull request? Change `requires_same_anchor` to `check_same_anchor` for `_update_internal_frame` func ### Why are the changes needed? There were some discussion in #36353 (comment) , this parameter is a flag for whether checking the same anchor. Consider it's a parameter of internal use function, it can be renamed safely. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes #37585 from Yikun/SPARK-39310. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Upgrade pandas to 1.4.4 in infra and doc gen ### Why are the changes needed? https://pandas.pydata.org/docs/whatsnew/v1.4.4.html Especially, fix bugs which mentioned in #36353: - Fixed regression in DataFrame.fillna() not working on a DataFrame with a MultiIndex (GH47649) - Fixed regression in DataFrame.eval() creating a copy when updating inplace (GH47449) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes #37810 from Yikun/pandas-1.4.4. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Generates a new dataframe instead of operating inplace in setitem
Why are the changes needed?
Make CI passed in with pandas 1.4.3
Since pandas 1.4.0 pandas-dev/pandas@03dd698 , dataframe.setitem should always make a copy and never write into the existing array.
Does this PR introduce any user-facing change?
No
How was this patch tested?
CI test with current pandas (1.3.x) and latest pandas 1.4.2, 1.4.3