Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 #47965

Closed

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 3, 2024

What changes were proposed in this pull request?

Refresh testing image for pyarrow 17

Why are the changes needed?

currently the cached pyarrow==15.0.2 is used in CI, we need to test Spark with latest pyarrow

Does this PR introduce any user-facing change?

No, infra only

How was this patch tested?

updated ci

Was this patch authored or co-authored using generative AI tooling?

no

Closes #46232

@github-actions github-actions bot added the BUILD label Sep 3, 2024
@github-actions github-actions bot added the INFRA label Sep 3, 2024
@zhengruifeng zhengruifeng changed the title [WIP][INFRA] Refresh testing image for pyarrow 17 [SPARK-49496][INFRA][PYTHON] Refresh testing image for pyarrow 17 Sep 3, 2024
@@ -723,7 +723,7 @@ jobs:
# See 'ipython_genutils' in SPARK-38517
# See 'docutils<0.18.0' in SPARK-39421
python3.9 -m pip install 'sphinx==4.5.0' mkdocs 'pydata_sphinx_theme>=0.13' sphinx-copybutton nbsphinx numpydoc jinja2 markupsafe 'pyzmq<24.0.0' \
ipython ipython_genutils sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly>=4.8' 'docutils<0.18.0' \
ipython ipython_genutils sphinx_plotly_directive 'numpy==1.26.4' pyarrow pandas 'plotly>=4.8' 'docutils<0.18.0' \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pin numpy==1.26.4 to avoid test failures

https://github.com/zhengruifeng/spark/actions/runs/10675688719/job/29589058669

need more investigation for numpy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that sounds like a regression somewhere. We fixed it in #47083 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think the initial fix was a partial fix, and we would need a similar fix for pandas API on Spark too, cc @xinrong-meng @itholic FYI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is interesting that the output type of Pandas itself also varies after numpy upgrade:

before

In [4]: import pandas as pd

In [5]: import numpy as np

In [6]: pd.Series([None, None, 3, 4, 5], index=[100, 200, 300, 400, 500]).first_valid_index()
Out[6]: 300

In [7]: pd.__version__
Out[7]: '2.2.2'

In [8]: np.__version__
Out[8]: '1.26.4'

after

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: pd.Series([None, None, 3, 4, 5], index=[100, 200, 300, 400, 500]).first_valid_index()
Out[3]: np.int64(300)

In [4]: pd.__version__
Out[4]: '2.2.2'

In [5]: np.__version__
Out[5]: '2.1.0'

Copy link
Contributor Author

@zhengruifeng zhengruifeng Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another example:

1.26.4

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])

In [3]: df.at[4, 'B']
Out[3]: 2

2.1.0

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])

In [3]: df.at[4, 'B']
Out[3]: np.int64(2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing!
Also pandas only provides a minimum supported version of NumPy (here), similar to what we did, rather than a “recommended” version.
It’s surprising to see such changes in return results across supported NumPy versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any existing discussion in the pandas community on this. I'm wondering if we should raise an issue there.

Copy link
Contributor Author

@zhengruifeng zhengruifeng Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me too, cannot find any related documentation. Please help file a Pandas issue, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, filed pandas-dev/pandas#59838

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is MLflow 2.16.0 ready? In the community, I've been testing this until now here.

#46232

The blocker was MLFlow until 2.15.x. If you don't mind, use SPARK-47995 instead of a new JIRA ID because it's filed before. Then, I'll close my PR.

Thank you for working on this, @zhengruifeng .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-49496][INFRA][PYTHON] Refresh testing image for pyarrow 17 [SPARK-47995][INFRA][PYTHON] Refresh testing image for pyarrow 17 Sep 3, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

I revised the PR title with SPARK-47995 and adds Closes #46232 at the PR description.

Thank you, @zhengruifeng and @HyukjinKwon .

Merged to master.

@zhengruifeng zhengruifeng deleted the infra_refresh_test_doc branch September 4, 2024 00:25
@zhengruifeng
Copy link
Contributor Author

@dongjoon-hyun thanks for taking care of it. I was not aware of that ticket so file a new one :)

@dongjoon-hyun
Copy link
Member

No problem at all~ Thank you for doing this. I've been waiting for this so long. ;)

IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
### What changes were proposed in this pull request?
Refresh testing image for pyarrow 17

### Why are the changes needed?
currently the cached `pyarrow==15.0.2` is used in [CI](https://github.com/apache/spark/actions/runs/10674534002/job/29585233434), we need to test Spark with latest pyarrow

### Does this PR introduce _any_ user-facing change?
No, infra only

### How was this patch tested?
updated ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#46232

Closes apache#47965 from zhengruifeng/infra_refresh_test_doc.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
Refresh testing image for pyarrow 17

### Why are the changes needed?
currently the cached `pyarrow==15.0.2` is used in [CI](https://github.com/apache/spark/actions/runs/10674534002/job/29585233434), we need to test Spark with latest pyarrow

### Does this PR introduce _any_ user-facing change?
No, infra only

### How was this patch tested?
updated ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#46232

Closes apache#47965 from zhengruifeng/infra_refresh_test_doc.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants