[BUG] Pandas UDF hang in Databricks #6157

viadea · 2022-07-29T18:29:14Z

Env:
Databricks 10.4ML LTS with 22.06GA jar.

Below sample Pandas UDF(note: this is NOT cuDF Pandas UDF) got hang:

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))
 
def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    v = pdf.v
    return pdf.assign(v=v - v.mean())
 
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double").show()

I also tried the below parameters set in Spark configs before starting the cluster, but it is failing the query saying "no module for cudf":

 spark.rapids.sql.python.gpu.enabled true
 spark.python.daemon.module rapids.daemon_databricks
 spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-22.06.0.jar:/databricks/spark/python

The text was updated successfully, but these errors were encountered:

firestarman · 2022-08-01T01:48:23Z

I also tried the below parameters set in Spark configs before starting the cluster, but it is failing the query saying "no module for cudf":

Expected, cuDF python module is requred to run cudf UDF.

firestarman · 2022-08-01T03:21:28Z

The IT runs the udf_test nightly, and does not get this issue.
But I can reproduce it on DB node.

firestarman · 2022-08-01T07:35:53Z

A quick WAR is to add a config as below.
spark.conf.set("spark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled", "false")

I am still debugging it. Not sure if the protocal was changed for pandasZeroConfConversion being true.

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 29, 2022

firestarman mentioned this issue Aug 1, 2022

Update the configs used to choose the Python runner for flat-map Pandas UDF [databricks] #6166

Merged

jlowe closed this as completed in #6166 Aug 1, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 2, 2022

sameerz assigned jlowe Aug 2, 2022

jlowe assigned firestarman and unassigned jlowe Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Pandas UDF hang in Databricks #6157

[BUG] Pandas UDF hang in Databricks #6157

viadea commented Jul 29, 2022

firestarman commented Aug 1, 2022 •

edited

Loading

firestarman commented Aug 1, 2022 •

edited

Loading

firestarman commented Aug 1, 2022 •

edited

Loading

[BUG] Pandas UDF hang in Databricks #6157

[BUG] Pandas UDF hang in Databricks #6157

Comments

viadea commented Jul 29, 2022

firestarman commented Aug 1, 2022 • edited Loading

firestarman commented Aug 1, 2022 • edited Loading

firestarman commented Aug 1, 2022 • edited Loading

firestarman commented Aug 1, 2022 •

edited

Loading

firestarman commented Aug 1, 2022 •

edited

Loading

firestarman commented Aug 1, 2022 •

edited

Loading