Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pandas UDF hang in Databricks #6157

Closed
viadea opened this issue Jul 29, 2022 · 3 comments · Fixed by #6166
Closed

[BUG] Pandas UDF hang in Databricks #6157

viadea opened this issue Jul 29, 2022 · 3 comments · Fixed by #6166
Assignees
Labels
bug Something isn't working

Comments

@viadea
Copy link
Collaborator

viadea commented Jul 29, 2022

Env:
Databricks 10.4ML LTS with 22.06GA jar.

Below sample Pandas UDF(note: this is NOT cuDF Pandas UDF) got hang:

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))
 
def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    v = pdf.v
    return pdf.assign(v=v - v.mean())
 
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double").show()

I also tried the below parameters set in Spark configs before starting the cluster, but it is failing the query saying "no module for cudf":

 spark.rapids.sql.python.gpu.enabled true
 spark.python.daemon.module rapids.daemon_databricks
 spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-22.06.0.jar:/databricks/spark/python
@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 29, 2022
@firestarman
Copy link
Collaborator

firestarman commented Aug 1, 2022

I also tried the below parameters set in Spark configs before starting the cluster, but it is failing the query saying "no module for cudf":

Expected, cuDF python module is requred to run cudf UDF.

@firestarman
Copy link
Collaborator

firestarman commented Aug 1, 2022

The IT runs the udf_test nightly, and does not get this issue.
But I can reproduce it on DB node.

@firestarman
Copy link
Collaborator

firestarman commented Aug 1, 2022

A quick WAR is to add a config as below.
spark.conf.set("spark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled", "false")

I am still debugging it. Not sure if the protocal was changed for pandasZeroConfConversion being true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants