[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem #36353

Yikun · 2022-04-26T07:11:24Z

What changes were proposed in this pull request?

Generates a new dataframe instead of operating inplace in setitem

Why are the changes needed?

Make CI passed in with pandas 1.4.3

Since pandas 1.4.0 pandas-dev/pandas@03dd698 , dataframe.setitem should always make a copy and never write into the existing array.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI test with current pandas (1.3.x) and latest pandas 1.4.2, 1.4.3

Yikun · 2022-04-26T07:16:17Z

This is still a WIP PR now, only make ci passed with panda 1.4.x.

~~- If we want to follow pandas behavior, we might want to revert databricks/koalas#1592~~
~~- If we only want to still keep this, we might only need this test fix PR.~~

Yikun · 2022-04-26T07:16:46Z

I haven't got enough info to see why we added databricks/koalas#1592 before (performance or follow pandas old behavior).

cc @HyukjinKwon @ueshin Maybe you could give some input or suggestion in here.

HyukjinKwon · 2022-04-26T10:17:39Z

I think this needs @ueshin's look

ueshin · 2022-04-26T22:06:59Z

I haven't got enough info to see why we added databricks/koalas#1592 before (performance or follow pandas old behavior).

That's to follow the pandas behavior at that time and seems like it's still necessary.

The examples in the description of databricks/koalas#1592 are still valid.
E.g.,:

>>> pd.__version__
'1.4.1'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.fillna(0, inplace=True)
>>> pser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64
>>> pdf
     x    y
0  0.0  NaN
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  NaN
5  6.0  6.0

If we revert the change, it won't work anymore.

Have you tried to revert the changes except for the tests and run the whole tests?
I guess some tests fail, especially some in test_series.py.

If there are behavior changes in pandas, we should fix our behavior.

Yikun · 2022-04-27T13:10:41Z

@ueshin OK, thanks for your info, I will take an another look soon.

itholic · 2022-05-24T03:20:42Z

Just confirming, any update on this ?

Yikun · 2022-05-24T04:00:23Z

@itholic We might want to do some special process when calling _update_internal_frame for pandas 1.4.x. Will update today.

Yikun · 2022-05-24T13:45:07Z

Test with Panda 1.4.2: https://github.com/Yikun/spark/runs/6574570761?check_suite_focus=true#step:8:20

All UT passed, but doctest failed due to other unrelated failed.

This PR is ready for review.

@HyukjinKwon @xinrong-databricks @itholic

python/pyspark/pandas/frame.py

xinrong-meng · 2022-05-27T16:14:27Z

The renaming is so much better, thanks Yikun! LGTM.

HyukjinKwon · 2022-05-30T02:07:45Z

Let me defer to @ueshin

ueshin

The condition to generate a new dataframe seems a bit more complex?

I can still see the behavior difference:

>>> pd.__version__
'1.4.2'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pdf.fillna(0, inplace=True)
>>> pdf
     x    y
0  0.0  0.0
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  0.0
5  6.0  6.0
>>> pser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64

whereas:

>>> psdf = ps.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> psser = psdf.x
>>> psdf.fillna(0, inplace=True)
>>> psdf
     x    y
0  0.0  0.0
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  0.0
5  6.0  6.0
>>> psser
0    NaN
1    2.0
2    3.0
3    4.0
4    NaN
5    6.0
Name: x, dtype: float64

In this case, pandas seems to not make a copy and reuse the underlying array.
I'm not sure whether this is a pandas bug or not, though.

python/pyspark/pandas/frame.py

ueshin · 2022-05-31T23:29:44Z

python/pyspark/pandas/tests/test_dataframe.py

+        # SPARK-38946: Since Spark 3.4, inplace set generate a new dataframe to follow
+        # pandas 1.4 behaviors
+        if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
+            self.assert_eq(psser, pser)


Seems like this is a common test with the old pandas?
Shall we move it out of if LooseVersion(pd.__version__) >= LooseVersion("1.4"):?

we follow latest behavior, so before 1.3 will raise not equal. added

python/pyspark/pandas/tests/test_dataframe.py

Yikun · 2022-06-01T03:59:35Z

The condition to generate a new dataframe seems a bit more complex?

@ueshin Yes, it is really yes from your example, the setitem has behavior influence on several functions and change the final or part of behaviors for these functions (but pandas not mention this behavior change). Anyway, I will raise a issue on Pandas comunity to get the detail attitude for these beahavior changes.

Yikun · 2022-06-01T08:48:20Z

pandas-dev/pandas#47188

Yikun · 2022-06-23T12:25:49Z

According to pandas-dev/pandas#47188 and pandas-dev/pandas#47449 , we should only address set_item in here, so just skip test for oldest 1.4.x pandas.

Yikun · 2022-06-24T15:31:47Z

@ueshin ready for review, it would be good if you could find some time. : )

Yikun · 2022-07-08T12:54:20Z

Let me do a brief conclusion to help review:

Change setitem to make a copypandas-dev/pandas@03dd698 dataframe.setitem make a copy and never write into the existing array, so we also follow this.
Skip update/fillna for 1.4.0-1.4.2: BUG: inplace behavior is inconsist for fillna pandas-dev/pandas#47188 this issue fix the update and fillna behaviors, I confirm it had been resolved by pandas 1.4.3, so we skip the validate copy check.
Skip eval for 1.4.0-1.4.3: REGR: DataFrame.eval not respecting inplace argument in pandas 1.4 pandas-dev/pandas#47449 eval still have a regression, so we skip it in 1.4.0~1.4.3.

I also test 1.4.2 CI with this PR: Yikun#113, all tests passed, so we can upgrade pandas to 1.4.2 in CI after this PR merged.

So, it's ready for review!

HyukjinKwon · 2022-07-11T02:02:24Z

I will leave it to @ueshin

Yikun · 2022-07-14T01:48:56Z

just a rebase because we update numpy version in infra, and add a line in migration guide,

Yikun · 2022-08-09T07:27:07Z

python/pyspark/pandas/tests/test_dataframe.py

+
+        pser = pdf.z
+        psser = psdf.z
+        pdf.fillna(0, inplace=True)


Add a testcase which @ueshin mentioned before

Yikun · 2022-08-09T07:27:28Z

python/pyspark/pandas/tests/test_dataframe.py

-            psdf.fillna({("x", "b"): -2, "x": -1}), pdf.fillna({("x", "b"): -2, "x": -1})
-        )
+        # See also: https://github.com/pandas-dev/pandas/issues/47649
+        if LooseVersion("1.4.3") != LooseVersion(pd.__version__):


Add a test to make sure 1.4.3 pass. it doesn't casue by this patch, it is a regression since pandas 1.4.3.

Yikun · 2022-08-09T22:58:23Z

All test passed and

Test passed with Pandas 1.4.2:
Yikun#148

Test passed with Pandas 1.4.3:
Yikun#147

@ueshin Would you mind take a look? Thanks!

ueshin

LGTM. Thanks for working with pandas community.

Yikun · 2022-08-11T01:23:38Z

@ueshin Thanks for review!

Yikun · 2022-08-11T12:20:14Z

@HyukjinKwon Mind to take another look?

Yikun · 2022-08-17T10:14:26Z

Just a rebase

HyukjinKwon · 2022-08-18T03:34:36Z

Merged to master.

### What changes were proposed in this pull request? Change `requires_same_anchor` to `check_same_anchor` for `_update_internal_frame` func ### Why are the changes needed? There were some discussion in #36353 (comment) , this parameter is a flag for whether checking the same anchor. Consider it's a parameter of internal use function, it can be renamed safely. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes #37585 from Yikun/SPARK-39310. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Upgrade pandas to 1.4.4 in infra and doc gen ### Why are the changes needed? https://pandas.pydata.org/docs/whatsnew/v1.4.4.html Especially, fix bugs which mentioned in #36353: - Fixed regression in DataFrame.fillna() not working on a DataFrame with a MultiIndex (GH47649) - Fixed regression in DataFrame.eval() creating a copy when updating inplace (GH47449) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes #37810 from Yikun/pandas-1.4.4. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added CORE PYTHON labels Apr 26, 2022

Yikun force-pushed the SPARK-38946 branch from 45a4a1c to 5d73865 Compare May 24, 2022 07:23

Yikun changed the title ~~[SPARK-38946][PYTHON] Fix different behavior in setitem~~ [SPARK-38946][PYTHON][PS] Never operate inplace in eval/update/fillna/setitem since pandas 1.4 May 24, 2022

github-actions bot added the PANDAS API ON SPARK label May 24, 2022

Yikun changed the title ~~[SPARK-38946][PYTHON][PS] Never operate inplace in eval/update/fillna/setitem since pandas 1.4~~ [SPARK-38946][PYTHON][PS] Never operate inplace in dataframe inplace operations since pandas 1.4 May 24, 2022

Yikun mentioned this pull request May 24, 2022

[WIP][PYTHON][PS] Upgrade Pandas to 1.4.2 (current latest) #36650

Closed

Yikun force-pushed the SPARK-38946 branch from 5d73865 to b0c1efc Compare May 24, 2022 08:44

Yikun changed the title ~~[SPARK-38946][PYTHON][PS] Never operate inplace in dataframe inplace operations since pandas 1.4~~ [SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in df.eval/update/fillna/setitem May 24, 2022

Yikun marked this pull request as ready for review May 24, 2022 11:13

Yikun marked this pull request as draft May 24, 2022 13:23

Yikun marked this pull request as ready for review May 24, 2022 13:44

Yikun force-pushed the SPARK-38946 branch from b0c1efc to 44e5604 Compare May 25, 2022 02:54

itholic approved these changes May 25, 2022

View reviewed changes

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved

xinrong-meng reviewed May 27, 2022

View reviewed changes

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved

xinrong-meng reviewed May 27, 2022

View reviewed changes

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved

xinrong-meng approved these changes May 27, 2022

View reviewed changes

ueshin reviewed May 31, 2022

View reviewed changes

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved

ueshin reviewed May 31, 2022

View reviewed changes

Yikun force-pushed the SPARK-38946 branch from 01777dc to 0c3ac97 Compare June 23, 2022 12:19

Yikun changed the title ~~[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in df.eval/update/fillna/setitem~~ [SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem Jun 23, 2022

Yikun force-pushed the SPARK-38946 branch from 0c3ac97 to 019e958 Compare June 24, 2022 01:19

Yikun force-pushed the SPARK-38946 branch 2 times, most recently from cd1a689 to a8da2e1 Compare July 8, 2022 08:03

Yikun force-pushed the SPARK-38946 branch from a8da2e1 to 91904c0 Compare July 14, 2022 01:48

Yikun force-pushed the SPARK-38946 branch from 91904c0 to 8e2aec5 Compare August 9, 2022 07:26

Yikun commented Aug 9, 2022

View reviewed changes

ueshin approved these changes Aug 10, 2022

View reviewed changes

Yikun added 2 commits August 17, 2022 18:13

Generates a new dataframe instead of operating inplace in setitem

1e129d7

Add more test and make test pass in 1.4.3

656a296

Yikun force-pushed the SPARK-38946 branch from 8e2aec5 to 656a296 Compare August 17, 2022 10:14

HyukjinKwon closed this in 532c500 Aug 18, 2022

Yikun mentioned this pull request Aug 19, 2022

[SPARK-39310][PS] Change requires_same_anchor to check_same_anchor #37585

Closed

Yikun mentioned this pull request Sep 6, 2022

[SPARK-40356][INFRA][PS] Upgrade pandas to 1.4.4 (infra and docs) #37810

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem #36353

[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem #36353

Yikun commented Apr 26, 2022 •

edited

Loading

Yikun commented Apr 26, 2022 •

edited

Loading

Yikun commented Apr 26, 2022

HyukjinKwon commented Apr 26, 2022

ueshin commented Apr 26, 2022 •

edited

Loading

Yikun commented Apr 27, 2022

itholic commented May 24, 2022

Yikun commented May 24, 2022

Yikun commented May 24, 2022 •

edited

Loading

xinrong-meng commented May 27, 2022

HyukjinKwon commented May 30, 2022

ueshin left a comment

ueshin May 31, 2022

Yikun Jun 23, 2022 •

edited

Loading

Yikun commented Jun 1, 2022

Yikun commented Jun 1, 2022

Yikun commented Jun 23, 2022

Yikun commented Jun 24, 2022

Yikun commented Jul 8, 2022 •

edited

Loading

HyukjinKwon commented Jul 11, 2022

Yikun commented Jul 14, 2022 •

edited

Loading

Yikun Aug 9, 2022

Yikun Aug 9, 2022

Yikun commented Aug 9, 2022

ueshin left a comment •

edited

Loading

Yikun commented Aug 11, 2022

Yikun commented Aug 11, 2022

Yikun commented Aug 17, 2022

HyukjinKwon commented Aug 18, 2022

[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem #36353

[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem #36353

Conversation

Yikun commented Apr 26, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Yikun commented Apr 26, 2022 • edited Loading

Yikun commented Apr 26, 2022

HyukjinKwon commented Apr 26, 2022

ueshin commented Apr 26, 2022 • edited Loading

Yikun commented Apr 27, 2022

itholic commented May 24, 2022

Yikun commented May 24, 2022

Yikun commented May 24, 2022 • edited Loading

xinrong-meng commented May 27, 2022

HyukjinKwon commented May 30, 2022

ueshin left a comment

Choose a reason for hiding this comment

ueshin May 31, 2022

Choose a reason for hiding this comment

Yikun Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Yikun commented Jun 1, 2022

Yikun commented Jun 1, 2022

Yikun commented Jun 23, 2022

Yikun commented Jun 24, 2022

Yikun commented Jul 8, 2022 • edited Loading

HyukjinKwon commented Jul 11, 2022

Yikun commented Jul 14, 2022 • edited Loading

Yikun Aug 9, 2022

Choose a reason for hiding this comment

Yikun Aug 9, 2022

Choose a reason for hiding this comment

Yikun commented Aug 9, 2022

ueshin left a comment • edited Loading

Choose a reason for hiding this comment

Yikun commented Aug 11, 2022

Yikun commented Aug 11, 2022

Yikun commented Aug 17, 2022

HyukjinKwon commented Aug 18, 2022

Yikun commented Apr 26, 2022 •

edited

Loading

Yikun commented Apr 26, 2022 •

edited

Loading

ueshin commented Apr 26, 2022 •

edited

Loading

Yikun commented May 24, 2022 •

edited

Loading

Yikun Jun 23, 2022 •

edited

Loading

Yikun commented Jul 8, 2022 •

edited

Loading

Yikun commented Jul 14, 2022 •

edited

Loading

ueshin left a comment •

edited

Loading