[BUG] very long tail task is observed when many tasks are contending for PrioritySemaphore #11573

binmahone · 2024-10-09T09:00:53Z

In some of our customer queries, very long tail task is observed when many tasks are contending for PrioritySemaphore, taking as long as 3 hours to finish. (the whole stage last for 3h too).
The long tail tasks occupies the CPU slot doing nothing, this could potentially harm CPU resource utilization.

This bug can be reproduced with following test code:

echo "reproduce long tail problem, at aggv3 latest" && bin/spark-shell    \
       --master 'local[16]'  --driver-memory 20g  --conf spark.rapids.sql.concurrentGpuTasks=2  \
       --conf spark.celeborn.client.shuffle.compression.codec=zstd --conf spark.io.compression.codec=zstd \
       --conf spark.rapids.memory.pinnedPool.size=10G --conf spark.rapids.memory.host.spillStorageSize=40G \
       --conf spark.sql.files.maxPartitionBytes=2g \
       --conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true \
       --conf spark.plugins=com.nvidia.spark.SQLPlugin \
       --conf  spark.rapids.sql.metrics.level='DEBUG' \
       --conf spark.eventLog.enabled=true \
       --conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
       --conf spark.celeborn.master.endpoints=10.19.129.151:9097 \
       --jars /home/hongbin/develop/spark-3.2.1-bin-hadoop2.7/rapids_jars/fresh.jar -i query_1009_long_tail_semaphore.scala  2>&1 | tee spill_`date +'%Y-%m-%d-%H-%M-%S'`.output

with query_1009_long_tail_semaphore.scala being:

spark.conf.set("spark.rapids.sql.agg.singlePassPartialSortEnabled", false)

spark.time(spark.range(0,9000000000L, 1, 100).selectExpr("cast(CAST(rand(0) * 100000000000 AS LONG) DIV 1 as string) as id", "id % 2 as data").groupBy("id").agg(count(lit(1)), avg(col("data"))).orderBy("id").show())

System.exit(0)

The long tail tasks can be found in the below snapshot:

The text was updated successfully, but these errors were encountered:

binmahone · 2024-10-11T05:27:36Z

This issue is fixed by #11574 + #11587

* avoid long tail tasks due to PrioritySemaphore (#11574) * use task id as tie breaker Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org> * save threadlocal lookup Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org> --------- Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org> * addressing jason's comment Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org> --------- Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>

binmahone added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 9, 2024

binmahone self-assigned this Oct 9, 2024

binmahone mentioned this issue Oct 9, 2024

avoid long tail tasks due to PrioritySemaphore #11574

Merged

binmahone closed this as completed in #11574 Oct 10, 2024

binmahone mentioned this issue Oct 11, 2024

backport fixes of #11573 to branch 24.10 #11588

Merged

sameerz removed the ? - Needs Triage Need team to review and classify label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] very long tail task is observed when many tasks are contending for PrioritySemaphore #11573

[BUG] very long tail task is observed when many tasks are contending for PrioritySemaphore #11573

binmahone commented Oct 9, 2024

binmahone commented Oct 11, 2024

[BUG] very long tail task is observed when many tasks are contending for PrioritySemaphore #11573

[BUG] very long tail task is observed when many tasks are contending for PrioritySemaphore #11573

Comments

binmahone commented Oct 9, 2024

binmahone commented Oct 11, 2024