Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pyspark): add support for pyarrow and python UDFs #9753

Merged
merged 8 commits into from
Aug 2, 2024

Conversation

jstammers
Copy link
Contributor

Description of changes

This PR modifies the pyspark Backend so that it can support UDFs implemented in pure python or pyarrow by making use of the pyspark.sql.udf wrapper.

To successfully run the unit tests for a pyarrow UDF, I needed to ensure that the spark worker executed the correct python interpreter by setting

export PYSPARK_PYTHON=$(which python)

within my nix shell. This might require some further configuration of the nix environment in order to pass during the CI tests.

Issues closed

@cpcloud
Copy link
Member

cpcloud commented Aug 2, 2024

export PYSPARK_PYTHON=$(which python)

🤔 interesting...

We don't run the full backend test suite through nix (only the backends that can be run locally with the exception of PySpark). You might have figured out the way we can now do that. I'll give it a try 🥳

@cpcloud
Copy link
Member

cpcloud commented Aug 2, 2024

It looks like you might need to xfail the UDF tests for the earlier version of PySpark that we test against (3.3.3).

@cpcloud
Copy link
Member

cpcloud commented Aug 2, 2024

Sweet, setting PYSPARK_PYTHON does allow pyspark to work in the nix environment!

I'll put up a separate PR to add that to our shellHook so everyone using nix can pick it up.

@jstammers
Copy link
Contributor Author

Thanks for that spot @cpcloud - I'd mistakenly assumed that it was implemented in a previous version. The pyspark docs weren't exactly clear on that. I've pushed some changes that should hopefully check that the pyarrow UDFs fail for pyspark < 3.5.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I will commit my suggested changes and the push up any fixes needed!

ibis/backends/pyspark/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/backends/pyspark/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_udf.py Outdated Show resolved Hide resolved
@cpcloud cpcloud force-pushed the feat/pyspark-udfs branch from 60e0526 to edc7838 Compare August 2, 2024 15:14
@cpcloud cpcloud changed the title feat: add support for pyarrow and python UDFs for pyspark backend feat(pyspark): add support for pyarrow and python UDFs Aug 2, 2024
@cpcloud cpcloud enabled auto-merge (squash) August 2, 2024 15:22
@cpcloud cpcloud added feature Features or general enhancements udf Issues related to user-defined functions pyspark The Apache PySpark backend labels Aug 2, 2024
@cpcloud cpcloud merged commit 02a1d48 into ibis-project:main Aug 2, 2024
89 checks passed
@jstammers jstammers deleted the feat/pyspark-udfs branch August 2, 2024 15:49
@cpcloud cpcloud added this to the 9.3 milestone Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements pyspark The Apache PySpark backend udf Issues related to user-defined functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: support pyarrow UDFs for pyspark backend
2 participants