feat: support pyarrow UDFs for pyspark backend #9074

jstammers · 2024-04-29T08:11:56Z

Is your feature request related to a problem?

Pyspark now supports Arrow UDFs that facilitate efficient row-by-row executions using Arrow as a backend e.g.

import pandas as pd
from pyspark.sql.functions import udf

@udf(returnType="int",useArrow=True)
def add_one(x:int) -> int:
    return x + 1

#Create column using pyarrow-udf
df = pd.DataFrame({"a":[1,2,3]})
dfs = spark.createDataFrame(df)
dfs.withColumn("b", add_one("a")).show()

However, the equivalent function using ibis raises a NotImplementedError, because only Pandas-based vectorized UDFs are supported

import ibis
from ibis import _

@ibis.udf.scalar.pyarrow
def add_one_pyarrow(x:int) -> int:
    return x + 1

@ibis.udf.scalar.pandas
def add_one_pandas(x:int) -> int:
    return x + 1

con = ibis.pyspark.connect(spark)
con.create_table("df", df, format="delta", overwrite=True)

table = con.table("df")
table.mutate(b=add_one_pandas(_.a)).execute()
table.mutate(b=add_one_pyarrow(_.a)).execute() #raises NotImpletmentedError

What is the motivation behind your request?

Pandas-based UDFs are not supported for the DuckDB backend, but Arrow-based ones are. For my use case, I would like to ensure parity between using either backend as much as possible, so being able to use Arrow-based UDFs on a pyspark table would be very useful

Describe the solution you'd like

I'd like a solution that would allow me to use an Arrow-based UDF on a pyspark table

What version of ibis are you running?

9.0.0.dev686

What backend(s) are you using, if any?

Pyspark

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

jstammers added the feature Features or general enhancements label Apr 29, 2024

github-project-automation bot added this to Ibis planning and roadmap Apr 29, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Apr 29, 2024

gforsyth added udf Issues related to user-defined functions pyspark The Apache PySpark backend labels Apr 29, 2024

jstammers mentioned this issue Aug 2, 2024

feat(pyspark): add support for pyarrow and python UDFs #9753

Merged

cpcloud closed this as completed in #9753 Aug 2, 2024

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Aug 2, 2024

cpcloud added this to the 9.3 milestone Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support pyarrow UDFs for pyspark backend #9074

feat: support pyarrow UDFs for pyspark backend #9074

jstammers commented Apr 29, 2024

feat: support pyarrow UDFs for pyspark backend #9074

feat: support pyarrow UDFs for pyspark backend #9074

Comments

jstammers commented Apr 29, 2024

Is your feature request related to a problem?

What is the motivation behind your request?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct