Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support pyarrow UDFs for pyspark backend #9074

Closed
1 task done
jstammers opened this issue Apr 29, 2024 · 0 comments · Fixed by #9753
Closed
1 task done

feat: support pyarrow UDFs for pyspark backend #9074

jstammers opened this issue Apr 29, 2024 · 0 comments · Fixed by #9753
Labels
feature Features or general enhancements pyspark The Apache PySpark backend udf Issues related to user-defined functions
Milestone

Comments

@jstammers
Copy link
Contributor

Is your feature request related to a problem?

Pyspark now supports Arrow UDFs that facilitate efficient row-by-row executions using Arrow as a backend e.g.

import pandas as pd
from pyspark.sql.functions import udf

@udf(returnType="int",useArrow=True)
def add_one(x:int) -> int:
    return x + 1

#Create column using pyarrow-udf
df = pd.DataFrame({"a":[1,2,3]})
dfs = spark.createDataFrame(df)
dfs.withColumn("b", add_one("a")).show()

However, the equivalent function using ibis raises a NotImplementedError, because only Pandas-based vectorized UDFs are supported

import ibis
from ibis import _

@ibis.udf.scalar.pyarrow
def add_one_pyarrow(x:int) -> int:
    return x + 1

@ibis.udf.scalar.pandas
def add_one_pandas(x:int) -> int:
    return x + 1

con = ibis.pyspark.connect(spark)
con.create_table("df", df, format="delta", overwrite=True)

table = con.table("df")
table.mutate(b=add_one_pandas(_.a)).execute()
table.mutate(b=add_one_pyarrow(_.a)).execute() #raises NotImpletmentedError

What is the motivation behind your request?

Pandas-based UDFs are not supported for the DuckDB backend, but Arrow-based ones are. For my use case, I would like to ensure parity between using either backend as much as possible, so being able to use Arrow-based UDFs on a pyspark table would be very useful

Describe the solution you'd like

I'd like a solution that would allow me to use an Arrow-based UDF on a pyspark table

What version of ibis are you running?

9.0.0.dev686

What backend(s) are you using, if any?

Pyspark

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jstammers jstammers added the feature Features or general enhancements label Apr 29, 2024
@gforsyth gforsyth added udf Issues related to user-defined functions pyspark The Apache PySpark backend labels Apr 29, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Aug 2, 2024
@cpcloud cpcloud added this to the 9.3 milestone Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements pyspark The Apache PySpark backend udf Issues related to user-defined functions
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants