Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem #36353

Closed
wants to merge 2 commits into from

Conversation

Yikun
Copy link
Member

@Yikun Yikun commented Apr 26, 2022

What changes were proposed in this pull request?

Generates a new dataframe instead of operating inplace in setitem

Why are the changes needed?

Make CI passed in with pandas 1.4.3

Since pandas 1.4.0 pandas-dev/pandas@03dd698 , dataframe.setitem should always make a copy and never write into the existing array.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI test with current pandas (1.3.x) and latest pandas 1.4.2, 1.4.3

@Yikun
Copy link
Member Author

Yikun commented Apr 26, 2022

This is still a WIP PR now, only make ci passed with panda 1.4.x.

- If we want to follow pandas behavior, we might want to revert databricks/koalas#1592
- If we only want to still keep this, we might only need this test fix PR.

@Yikun
Copy link
Member Author

Yikun commented Apr 26, 2022

I haven't got enough info to see why we added databricks/koalas#1592 before (performance or follow pandas old behavior).

cc @HyukjinKwon @ueshin Maybe you could give some input or suggestion in here.

@HyukjinKwon
Copy link
Member

I think this needs @ueshin's look

@ueshin
Copy link
Member

ueshin commented Apr 26, 2022

I haven't got enough info to see why we added databricks/koalas#1592 before (performance or follow pandas old behavior).

That's to follow the pandas behavior at that time and seems like it's still necessary.

The examples in the description of databricks/koalas#1592 are still valid.
E.g.,:

>>> pd.__version__
'1.4.1'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.fillna(0, inplace=True)
>>> pser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64
>>> pdf
     x    y
0  0.0  NaN
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  NaN
5  6.0  6.0

If we revert the change, it won't work anymore.

Have you tried to revert the changes except for the tests and run the whole tests?
I guess some tests fail, especially some in test_series.py.

If there are behavior changes in pandas, we should fix our behavior.

@Yikun
Copy link
Member Author

Yikun commented Apr 27, 2022

@ueshin OK, thanks for your info, I will take an another look soon.

@itholic
Copy link
Contributor

itholic commented May 24, 2022

Just confirming, any update on this ?

@Yikun
Copy link
Member Author

Yikun commented May 24, 2022

@itholic We might want to do some special process when calling _update_internal_frame for pandas 1.4.x. Will update today.

@Yikun Yikun changed the title [SPARK-38946][PYTHON] Fix different behavior in setitem [SPARK-38946][PYTHON][PS] Never operate inplace in eval/update/fillna/setitem since pandas 1.4 May 24, 2022
@Yikun Yikun changed the title [SPARK-38946][PYTHON][PS] Never operate inplace in eval/update/fillna/setitem since pandas 1.4 [SPARK-38946][PYTHON][PS] Never operate inplace in dataframe inplace operations since pandas 1.4 May 24, 2022
@Yikun Yikun changed the title [SPARK-38946][PYTHON][PS] Never operate inplace in dataframe inplace operations since pandas 1.4 [SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in df.eval/update/fillna/setitem May 24, 2022
@Yikun Yikun marked this pull request as ready for review May 24, 2022 11:13
@Yikun Yikun marked this pull request as draft May 24, 2022 13:23
@Yikun Yikun marked this pull request as ready for review May 24, 2022 13:44
@Yikun
Copy link
Member Author

Yikun commented May 24, 2022

Test with Panda 1.4.2: https://github.com/Yikun/spark/runs/6574570761?check_suite_focus=true#step:8:20

All UT passed, but doctest failed due to other unrelated failed.

This PR is ready for review.

@HyukjinKwon @xinrong-databricks @itholic

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved
@xinrong-meng
Copy link
Member

The renaming is so much better, thanks Yikun! LGTM.

@HyukjinKwon
Copy link
Member

Let me defer to @ueshin

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition to generate a new dataframe seems a bit more complex?

I can still see the behavior difference:

>>> pd.__version__
'1.4.2'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pdf.fillna(0, inplace=True)
>>> pdf
     x    y
0  0.0  0.0
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  0.0
5  6.0  6.0
>>> pser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64

whereas:

>>> psdf = ps.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> psser = psdf.x
>>> psdf.fillna(0, inplace=True)
>>> psdf
     x    y
0  0.0  0.0
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  0.0
5  6.0  6.0
>>> psser
0    NaN
1    2.0
2    3.0
3    4.0
4    NaN
5    6.0
Name: x, dtype: float64

In this case, pandas seems to not make a copy and reuse the underlying array.
I'm not sure whether this is a pandas bug or not, though.

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved
python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved
# SPARK-38946: Since Spark 3.4, inplace set generate a new dataframe to follow
# pandas 1.4 behaviors
if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
self.assert_eq(psser, pser)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is a common test with the old pandas?
Shall we move it out of if LooseVersion(pd.__version__) >= LooseVersion("1.4"):?

Copy link
Member Author

@Yikun Yikun Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we follow latest behavior, so before 1.3 will raise not equal. added

python/pyspark/pandas/tests/test_dataframe.py Show resolved Hide resolved
python/pyspark/pandas/tests/test_dataframe.py Show resolved Hide resolved
python/pyspark/pandas/tests/test_dataframe.py Show resolved Hide resolved
@Yikun
Copy link
Member Author

Yikun commented Jun 1, 2022

The condition to generate a new dataframe seems a bit more complex?

@ueshin Yes, it is really yes from your example, the setitem has behavior influence on several functions and change the final or part of behaviors for these functions (but pandas not mention this behavior change). Anyway, I will raise a issue on Pandas comunity to get the detail attitude for these beahavior changes.

@Yikun
Copy link
Member Author

Yikun commented Jun 1, 2022

pandas-dev/pandas#47188

@Yikun Yikun changed the title [SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in df.eval/update/fillna/setitem [SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem Jun 23, 2022
@Yikun
Copy link
Member Author

Yikun commented Jun 23, 2022

According to pandas-dev/pandas#47188 and pandas-dev/pandas#47449 , we should only address set_item in here, so just skip test for oldest 1.4.x pandas.

@Yikun
Copy link
Member Author

Yikun commented Jun 24, 2022

@ueshin ready for review, it would be good if you could find some time. : )

@Yikun Yikun force-pushed the SPARK-38946 branch 2 times, most recently from cd1a689 to a8da2e1 Compare July 8, 2022 08:03
@Yikun
Copy link
Member Author

Yikun commented Jul 8, 2022

Let me do a brief conclusion to help review:

I also test 1.4.2 CI with this PR: Yikun#113, all tests passed, so we can upgrade pandas to 1.4.2 in CI after this PR merged.

So, it's ready for review!

@HyukjinKwon
Copy link
Member

I will leave it to @ueshin

@Yikun
Copy link
Member Author

Yikun commented Jul 14, 2022

just a rebase because we update numpy version in infra, and add a line in migration guide,


pser = pdf.z
psser = psdf.z
pdf.fillna(0, inplace=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a testcase which @ueshin mentioned before

psdf.fillna({("x", "b"): -2, "x": -1}), pdf.fillna({("x", "b"): -2, "x": -1})
)
# See also: https://github.com/pandas-dev/pandas/issues/47649
if LooseVersion("1.4.3") != LooseVersion(pd.__version__):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test to make sure 1.4.3 pass. it doesn't casue by this patch, it is a regression since pandas 1.4.3.

@Yikun
Copy link
Member Author

Yikun commented Aug 9, 2022

All test passed and

Test passed with Pandas 1.4.2:
Yikun#148

Test passed with Pandas 1.4.3:
Yikun#147

@ueshin Would you mind take a look? Thanks!

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for working with pandas community.

@Yikun
Copy link
Member Author

Yikun commented Aug 11, 2022

@ueshin Thanks for review!

@Yikun
Copy link
Member Author

Yikun commented Aug 11, 2022

@HyukjinKwon Mind to take another look?

@Yikun
Copy link
Member Author

Yikun commented Aug 17, 2022

Just a rebase

@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon pushed a commit that referenced this pull request Aug 20, 2022
### What changes were proposed in this pull request?
Change `requires_same_anchor` to `check_same_anchor` for `_update_internal_frame` func

### Why are the changes needed?
There were some discussion in #36353 (comment) , this parameter is a flag for whether checking the same anchor.

Consider it's a parameter of internal use function, it can be renamed safely.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #37585 from Yikun/SPARK-39310.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Sep 7, 2022
### What changes were proposed in this pull request?
Upgrade pandas to 1.4.4 in infra and doc gen

### Why are the changes needed?
https://pandas.pydata.org/docs/whatsnew/v1.4.4.html

Especially, fix bugs which mentioned in #36353:
- Fixed regression in DataFrame.fillna() not working on a DataFrame with a MultiIndex (GH47649)
- Fixed regression in DataFrame.eval() creating a copy when updating inplace (GH47449)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #37810 from Yikun/pandas-1.4.4.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants