SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... #1956

sryza · 2014-08-15T00:04:47Z

...job fails while reading from Hadoop

ash211 · 2014-08-15T00:07:43Z

Lowering the log level hides it, but what's the cause of these issues?

SparkQA · 2014-08-15T00:10:03Z

QA tests have started for PR 1956. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18582/consoleFull

sryza · 2014-08-15T00:28:50Z

This occurs when an executor process shuts down while tasks are executing (e.g. because the driver disassociated or an OOME).

Hadoop FileSystems register a shutdown hook to close themselves. RecordReaders get closed in a finally block after the tasks that they're used in.

So there's a race between these two and I can't think of a good way to make one execute after the other. I'm a little confused as to why the HadoopRDD finally block is running at all. Some googling seems to indicate that finally blocks don't run during a System.exit(). And I would think a ShutdownHook would run after that happens anyway. So I can't claim to have 100% understanding of what's going on here. Spark isn't closing the FileSystem on its own.

More generally, I think logging a warning is overkill on a reader close error.

ash211 · 2014-08-15T00:41:16Z

Ah and the order they should be shut down in is RecordReader then
FileSystem?

Thanks for catching this -- I've seen it myself and was wondering why the
job output seemed to be correct
On Aug 14, 2014 8:28 PM, "Sandy Ryza" notifications@github.com wrote:

This occurs when an executor process shuts down while tasks are executing
(e.g. because the driver disassociated or an OOME).

Hadoop FileSystems register a shutdown hook to close themselves.
RecordReaders get closed in a finally block after the tasks that they're
used in.

So there's a race between these two and I can't think of a good way to
make one execute after the other. I'm a little confused as to why the
HadoopRDD finally block is running at all. Some googling seems to indicate
that finally blocks don't run during a System.exit(). And I would think a
ShutdownHook would run after that happens anyway. So I can't claim to have
100% understanding of what's going on here. Spark isn't closing the
FileSystem on its own.

More generally, think logging a warning is overkill on a reader close
error.

—
Reply to this email directly or view it on GitHub
#1956 (comment).

SparkQA · 2014-08-15T01:03:34Z

QA results for PR 1956:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18582/consoleFull

sryza · 2014-08-15T02:11:33Z

Ah and the order they should be shut down in is RecordReader then
FileSystem?

Right

mateiz · 2014-08-27T20:31:42Z

@sryza what's the stack trace printed here? I think it would be better to check whether we're shutting down (with Utils.inShutdown) and log a warning if we're not shutting down. A failed close() seems bad in other situations.

sryza · 2014-09-02T06:37:03Z

Here's the exception:

java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:610)
at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Thanks, I wasn't aware of Utils.inShutdown. I'll post a patch that uses that. I haven't yet figured out how to reliably reproduce this, so I can't verify that it will safeguard against the warning in all situations where it should, but it seems like an improvement.

… a job fails while reading from Hadoop

SparkQA · 2014-09-02T06:44:23Z

QA tests have started for PR 1956 at commit 815813a.

This patch merges cleanly.

SparkQA · 2014-09-02T07:36:52Z

QA tests have finished for PR 1956 at commit 815813a.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ByteArrayChunkOutputStream(chunkSize: Int) extends OutputStream

sryza · 2014-09-02T16:02:08Z

I believe the failure is unrelated. I noticed it on SPARK-2461 as well.

mateiz · 2014-09-02T18:35:38Z

Thanks Sandy, merged this.

… a ... ...job fails while reading from Hadoop Author: Sandy Ryza <sandy@cloudera.com> Closes apache#1956 from sryza/sandy-spark-3052 and squashes the following commits: 815813a [Sandy Ryza] SPARK-3052. Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop

…pache#1956)

SPARK-3052. Misleading and spurious FileSystem closed errors whenever…

815813a

… a job fails while reading from Hadoop

sryza force-pushed the sandy-spark-3052 branch from 073da37 to 815813a Compare September 2, 2014 06:40

asfgit closed this in 81b9d5b Sep 2, 2014

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024

rdar://127504507 Add Spark SQL test pipeline with ANSI mode enabled (a…

782fc1b

…pache#1956)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... #1956

SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... #1956

sryza commented Aug 15, 2014

ash211 commented Aug 15, 2014

SparkQA commented Aug 15, 2014

sryza commented Aug 15, 2014

ash211 commented Aug 15, 2014

SparkQA commented Aug 15, 2014

sryza commented Aug 15, 2014

mateiz commented Aug 27, 2014

sryza commented Sep 2, 2014

SparkQA commented Sep 2, 2014

SparkQA commented Sep 2, 2014

sryza commented Sep 2, 2014

mateiz commented Sep 2, 2014

SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... #1956

SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... #1956

Conversation

sryza commented Aug 15, 2014

ash211 commented Aug 15, 2014

SparkQA commented Aug 15, 2014

sryza commented Aug 15, 2014

ash211 commented Aug 15, 2014

SparkQA commented Aug 15, 2014

sryza commented Aug 15, 2014

mateiz commented Aug 27, 2014

sryza commented Sep 2, 2014

SparkQA commented Sep 2, 2014

SparkQA commented Sep 2, 2014

sryza commented Sep 2, 2014

mateiz commented Sep 2, 2014