Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_parquet_read_round_trip hanging forever in spark 32x standalone mode #6017

Closed
pxLi opened this issue Jul 19, 2022 · 5 comments · Fixed by #6049
Closed

[BUG] test_parquet_read_round_trip hanging forever in spark 32x standalone mode #6017

pxLi opened this issue Jul 19, 2022 · 5 comments · Fixed by #6049
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jul 19, 2022

Describe the bug
if parquet-hadoop-1.12.2-tests.jar exists in classpath (related commit d84b518), test_parquet_read_round_trip will be hanging there forever when submit to standalone cluster (3.2.0, 3.2.1).

and keep dumping into app/executor/stdout
stdout.log

remove parquet-hadoop-1.12.2-tests.jar from the classpath could resolve the issue.

BTW
run the script in spark local mode works fine.
All other spark versions (31X, 330) standalone cluster works fine.

Steps/Code to reproduce bug

# SPARK_HOME to spark 3.2.0 or 3.2.1 location
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://$HOSTNAME:7077

export TEST_PARALLEL=0
export LOCAL_JAR_PATH=<path_to_nightly_artifacts>

SPARK_SUBMIT_FLAGS="--master spark://$HOSTNAME:7077"  integration_tests/run_pyspark_from_build.sh -k test_parquet_read_round_trip

LOCAL_JAR_PATH includes,

-rw-rw-r--  1 peixinl peixinl    425468 Jul 19 00:50 parquet-hadoop-1.12.2-tests.jar
-rw-rw-r--  1 peixinl peixinl 429939804 Jul 19 00:48 rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar
-rw-rw-r--  1 peixinl peixinl     92801 Jul 19 00:47 rapids-4-spark-integration-tests_2.12-22.08.0-SNAPSHOT-spark321.jar
-rw-rw-r--  1 peixinl peixinl    186483 Jul 19 00:49 spark-avro_2.12-3.2.1.jar

Expected behavior
Pass the test

@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 19, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented Jul 19, 2022

related to #5771

@tgravescs
Copy link
Collaborator

Note:
Spark 3.2.1 use 1.12.2
Spark 3.2.0 uses parquet version 1.12.1

very odd to see a hang, we should try to get a thread dump to see where its stuck or what is going on

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jul 19, 2022
@sameerz sameerz added the P0 Must have for release label Jul 20, 2022
@tgravescs
Copy link
Collaborator

all tasks are blocked on one task thread which seems to be trying to do some logging:

Lock(java.util.concurrent.ThreadPoolExecutor$Worker@574319057}), Monitor(java.lang.Throwable@1417496863}), Monitor(org.apache.log4j.ConsoleAppender@16411375}), Monitor(org.apache.log4j.spi.RootLogger@390212036})

stack:

java.lang.Throwable.getStackTraceElement(Native Method)
java.lang.Throwable.getOurStackTrace(Throwable.java:828) => holding Monitor(java.lang.Throwable@1417496863})
java.lang.Throwable.getStackTrace(Throwable.java:817)
sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.log4j.spi.LocationInfo.<init>(LocationInfo.java:139)
org.apache.log4j.spi.LoggingEvent.getLocationInformation(LoggingEvent.java:253)
org.apache.log4j.helpers.PatternParser$LocationPatternConverter.convert(PatternParser.java:500)
org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
org.apache.log4j.PatternLayout.format(PatternLayout.java:506)
org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) => holding Monitor(org.apache.log4j.ConsoleAppender@16411375})
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
org.apache.log4j.Category.callAppenders(Category.java:206) => holding Monitor(org.apache.log4j.spi.RootLogger@390212036})
org.apache.log4j.Category.forcedLog(Category.java:391)
org.apache.log4j.Category.log(Category.java:856)
org.slf4j.impl.Log4jLoggerAdapter.debug(Log4jLoggerAdapter.java:273)
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.log(MessageColumnIO.java:283)
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.setRepetitionLevel(MessageColumnIO.java:388)
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addLong(MessageColumnIO.java:458)
org.apache.parquet.io.RecordConsumerLoggingWrapper.addLong(RecordConsumerLoggingWrapper.java:98)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeDecimalWriter$3(ParquetWriteSupport.scala:271)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeDecimalWriter$3$adapted(ParquetWriteSupport.scala:269)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$Lambda$1737/1868824230.apply(Unknown Source)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:162)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$Lambda$1963/1827561993.apply$mcV$sp(Unknown Source)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:480)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:162)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:152)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$Lambda$1961/297385568.apply$mcV$sp(Unknown Source)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:468)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:152)
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138)
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:181)
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:43)
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:39)
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:175)
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85)
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92)
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:304)
org.apache.spark.sql.execution.datasources.FileFormatWriter$$$Lambda$1957/831858247.apply(Unknown Source)
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:311)
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229)
org.apache.spark.sql.execution.datasources.FileFormatWriter$$$Lambda$1319/1579026595.apply(Unknown Source)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
org.apache.spark.executor.Executor$TaskRunner$$Lambda$1001/1407820288.apply(Unknown Source)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

@tgravescs
Copy link
Collaborator

spark 3.3 does work its also using log4j2. < 3.2 should not run the encryption test.
For spark 3.2 removing hte log4j.properties file from the parquet hadoop test jar also fixes the issue. I think because it doesn't have debug logging on.

also note the tasks are processing but I think its doing so many stack traces and logging its just taking forever.

@tgravescs
Copy link
Collaborator

note executors are loading wrong log4j.properties:
log4j: Using URL [jar:file:/home/tgraves/workspace/spark-rapids-another/integration_tests/target/dependency/parquet-hadoop-1.12.2-tests.jar!/log4j.properties] for automatic log4j configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants