[BUG] test_parquet_read_round_trip hanging forever in spark 32x standalone mode #6017

pxLi · 2022-07-19T01:58:41Z

Describe the bug
if parquet-hadoop-1.12.2-tests.jar exists in classpath (related commit d84b518), test_parquet_read_round_trip will be hanging there forever when submit to standalone cluster (3.2.0, 3.2.1).

and keep dumping into app/executor/stdout
stdout.log

remove parquet-hadoop-1.12.2-tests.jar from the classpath could resolve the issue.

BTW
run the script in spark local mode works fine.
All other spark versions (31X, 330) standalone cluster works fine.

Steps/Code to reproduce bug

# SPARK_HOME to spark 3.2.0 or 3.2.1 location
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://$HOSTNAME:7077

export TEST_PARALLEL=0
export LOCAL_JAR_PATH=<path_to_nightly_artifacts>

SPARK_SUBMIT_FLAGS="--master spark://$HOSTNAME:7077"  integration_tests/run_pyspark_from_build.sh -k test_parquet_read_round_trip

LOCAL_JAR_PATH includes,

-rw-rw-r--  1 peixinl peixinl    425468 Jul 19 00:50 parquet-hadoop-1.12.2-tests.jar
-rw-rw-r--  1 peixinl peixinl 429939804 Jul 19 00:48 rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar
-rw-rw-r--  1 peixinl peixinl     92801 Jul 19 00:47 rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark321.jar
-rw-rw-r--  1 peixinl peixinl    186483 Jul 19 00:49 spark-avro_2.12-3.2.1.jar

Expected behavior
Pass the test

The text was updated successfully, but these errors were encountered:

pxLi · 2022-07-19T02:12:31Z

related to #5771

tgravescs · 2022-07-19T14:56:30Z

Note:
Spark 3.2.1 use 1.12.2
Spark 3.2.0 uses parquet version 1.12.1

very odd to see a hang, we should try to get a thread dump to see where its stuck or what is going on

tgravescs · 2022-07-21T14:16:52Z

all tasks are blocked on one task thread which seems to be trying to do some logging:

Lock(java.util.concurrent.ThreadPoolExecutor$Worker@574319057}), Monitor(java.lang.Throwable@1417496863}), Monitor(org.apache.log4j.ConsoleAppender@16411375}), Monitor(org.apache.log4j.spi.RootLogger@390212036})

stack:

java.lang.Throwable.getStackTraceElement(Native Method)
java.lang.Throwable.getOurStackTrace(Throwable.java:828) => holding Monitor(java.lang.Throwable@1417496863})
java.lang.Throwable.getStackTrace(Throwable.java:817)
sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.log4j.spi.LocationInfo.<init>(LocationInfo.java:139)
org.apache.log4j.spi.LoggingEvent.getLocationInformation(LoggingEvent.java:253)
org.apache.log4j.helpers.PatternParser$LocationPatternConverter.convert(PatternParser.java:500)
org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) => holding Monitor(org.apache.log4j.ConsoleAppender@16411375})
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
org.apache.log4j.Category.callAppenders(Category.java:206) => holding Monitor(org.apache.log4j.spi.RootLogger@390212036})
org.apache.log4j.Category.forcedLog(Category.java:391)
org.apache.log4j.Category.log(Category.java:856)
org.slf4j.impl.Log4jLoggerAdapter.debug(Log4jLoggerAdapter.java:273)
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.log(MessageColumnIO.java:283)
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.setRepetitionLevel(MessageColumnIO.java:388)
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addLong(MessageColumnIO.java:458)
org.apache.parquet.io.RecordConsumerLoggingWrapper.addLong(RecordConsumerLoggingWrapper.java:98)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeDecimalWriter$3(ParquetWriteSupport.scala:271)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeDecimalWriter$3$adapted(ParquetWriteSupport.scala:269)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$Lambda$1737/1868824230.apply(Unknown Source)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:162)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$Lambda$1963/1827561993.apply$mcV$sp(Unknown Source)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:480)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:162)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:152)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$Lambda$1961/297385568.apply$mcV$sp(Unknown Source)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:468)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:152)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:181)
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:43)
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:39)
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:175)
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85)
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92)
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:304)
org.apache.spark.sql.execution.datasources.FileFormatWriter$$$Lambda$1957/831858247.apply(Unknown Source)
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:311)
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229)
org.apache.spark.sql.execution.datasources.FileFormatWriter$$$Lambda$1319/1579026595.apply(Unknown Source)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
org.apache.spark.executor.Executor$TaskRunner$$Lambda$1001/1407820288.apply(Unknown Source)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

tgravescs · 2022-07-21T15:46:05Z

spark 3.3 does work its also using log4j2. < 3.2 should not run the encryption test.
For spark 3.2 removing hte log4j.properties file from the parquet hadoop test jar also fixes the issue. I think because it doesn't have debug logging on.

also note the tasks are processing but I think its doing so many stack traces and logging its just taking forever.

tgravescs · 2022-07-21T16:02:56Z

note executors are loading wrong log4j.properties:
log4j: Using URL [jar:file:/home/tgraves/workspace/spark-rapids-another/integration_tests/target/dependency/parquet-hadoop-1.12.2-tests.jar!/log4j.properties] for automatic log4j configuration.

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 19, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Jul 19, 2022

sameerz assigned tgravescs Jul 19, 2022

sameerz added the P0 Must have for release label Jul 20, 2022

tgravescs mentioned this issue Jul 21, 2022

Fix test hang caused by parquet hadoop test jar log4j file #6049

Merged

tgravescs closed this as completed in #6049 Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_parquet_read_round_trip hanging forever in spark 32x standalone mode #6017

[BUG] test_parquet_read_round_trip hanging forever in spark 32x standalone mode #6017

pxLi commented Jul 19, 2022 •

edited

Loading

pxLi commented Jul 19, 2022

tgravescs commented Jul 19, 2022

tgravescs commented Jul 21, 2022

tgravescs commented Jul 21, 2022

tgravescs commented Jul 21, 2022

[BUG] test_parquet_read_round_trip hanging forever in spark 32x standalone mode #6017

[BUG] test_parquet_read_round_trip hanging forever in spark 32x standalone mode #6017

Comments

pxLi commented Jul 19, 2022 • edited Loading

pxLi commented Jul 19, 2022

tgravescs commented Jul 19, 2022

tgravescs commented Jul 21, 2022

tgravescs commented Jul 21, 2022

tgravescs commented Jul 21, 2022

pxLi commented Jul 19, 2022 •

edited

Loading