Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... #1956

Closed
wants to merge 1 commit into from

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Aug 15, 2014

...job fails while reading from Hadoop

@ash211
Copy link
Contributor

ash211 commented Aug 15, 2014

Lowering the log level hides it, but what's the cause of these issues?

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have started for PR 1956. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18582/consoleFull

@sryza
Copy link
Contributor Author

sryza commented Aug 15, 2014

This occurs when an executor process shuts down while tasks are executing (e.g. because the driver disassociated or an OOME).

Hadoop FileSystems register a shutdown hook to close themselves. RecordReaders get closed in a finally block after the tasks that they're used in.

So there's a race between these two and I can't think of a good way to make one execute after the other. I'm a little confused as to why the HadoopRDD finally block is running at all. Some googling seems to indicate that finally blocks don't run during a System.exit(). And I would think a ShutdownHook would run after that happens anyway. So I can't claim to have 100% understanding of what's going on here. Spark isn't closing the FileSystem on its own.

More generally, I think logging a warning is overkill on a reader close error.

@ash211
Copy link
Contributor

ash211 commented Aug 15, 2014

Ah and the order they should be shut down in is RecordReader then
FileSystem?

Thanks for catching this -- I've seen it myself and was wondering why the
job output seemed to be correct
On Aug 14, 2014 8:28 PM, "Sandy Ryza" notifications@github.com wrote:

This occurs when an executor process shuts down while tasks are executing
(e.g. because the driver disassociated or an OOME).

Hadoop FileSystems register a shutdown hook to close themselves.
RecordReaders get closed in a finally block after the tasks that they're
used in.

So there's a race between these two and I can't think of a good way to
make one execute after the other. I'm a little confused as to why the
HadoopRDD finally block is running at all. Some googling seems to indicate
that finally blocks don't run during a System.exit(). And I would think a
ShutdownHook would run after that happens anyway. So I can't claim to have
100% understanding of what's going on here. Spark isn't closing the
FileSystem on its own.

More generally, think logging a warning is overkill on a reader close
error.


Reply to this email directly or view it on GitHub
#1956 (comment).

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA results for PR 1956:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18582/consoleFull

@sryza
Copy link
Contributor Author

sryza commented Aug 15, 2014

Ah and the order they should be shut down in is RecordReader then
FileSystem?

Right

@mateiz
Copy link
Contributor

mateiz commented Aug 27, 2014

@sryza what's the stack trace printed here? I think it would be better to check whether we're shutting down (with Utils.inShutdown) and log a warning if we're not shutting down. A failed close() seems bad in other situations.

@sryza
Copy link
Contributor Author

sryza commented Sep 2, 2014

Here's the exception:

java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:775)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:610)
at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Thanks, I wasn't aware of Utils.inShutdown. I'll post a patch that uses that. I haven't yet figured out how to reliably reproduce this, so I can't verify that it will safeguard against the warning in all situations where it should, but it seems like an improvement.

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have started for PR 1956 at commit 815813a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have finished for PR 1956 at commit 815813a.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ByteArrayChunkOutputStream(chunkSize: Int) extends OutputStream

@sryza
Copy link
Contributor Author

sryza commented Sep 2, 2014

I believe the failure is unrelated. I noticed it on SPARK-2461 as well.

@mateiz
Copy link
Contributor

mateiz commented Sep 2, 2014

Thanks Sandy, merged this.

@asfgit asfgit closed this in 81b9d5b Sep 2, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
… a ...

...job fails while reading from Hadoop

Author: Sandy Ryza <sandy@cloudera.com>

Closes apache#1956 from sryza/sandy-spark-3052 and squashes the following commits:

815813a [Sandy Ryza] SPARK-3052. Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants