[SPARK-4195][Core]retry to fetch blocks's result when fetchfailed's reason is connection timeout #3061

lianhuiwang · 2014-11-02T13:36:18Z

when there are many executors in a application(example:1000),Connection timeout often occure.
Exception is:
WARN nio.SendingConnection: Error finishing connection
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.spark.network.nio.SendingConnection.finishConnect(Connection.scala:342)
at org.apache.spark.network.nio.ConnectionManager$$anon$11.run(ConnectionManager.scala:273)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

that will make driver as these executors are lost, but in fact these executors are alive. so add retry mechanism to reduce the probability of the occurrence of this problem. @rxin

…cks's result

SparkQA · 2014-11-02T13:37:33Z

Test build #22761 has started for PR 3061 at commit dcfef7d.

This patch merges cleanly.

SparkQA · 2014-11-02T13:39:25Z

Test build #22761 has finished for PR 3061 at commit dcfef7d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-02T13:39:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22761/
Test FAILed.

rxin · 2014-11-02T20:08:48Z

@aarondav we should incorporate this into the new transport thing

JoshRosen · 2014-12-24T02:48:06Z

core/src/main/scala/org/apache/spark/network/nio/NioBlockTransferService.scala

@@ -39,6 +41,10 @@ final class NioBlockTransferService(conf: SparkConf, securityManager: SecurityMa

  private var blockDataManager: BlockDataManager = _

+  private val blockFailedCounts = new HashMap[Seq[String], Int]


To avoid memory leaks, we need to be sure that this won't grow without bound. Let me try to walk through the cases...

If no errors occur, this will remain empty since entries are only added on error.

If a fetch fails and a retry succeeds, then the entry is removed from this map.

If the maximum number of attempts is exceeded, we don't remove an entry from this map.

So, looks like this adds a memory leak?

yes, thank you for finding its error.

JoshRosen · 2014-12-24T02:52:02Z

There are a bunch of minor style issues here; I don't want to comment on them individually, so please take a look at https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide.

JoshRosen · 2014-12-24T02:53:20Z

core/src/main/scala/org/apache/spark/network/nio/NioBlockTransferService.scala

-      blockIds.foreach { blockId =>
-        listener.onBlockFetchFailure(blockId, exception)
+      exception match {
+        case connectExcpt: IOException =>


Why not catch ConnectException? I suppose it's always safe to retry as long as the number of retries is bounded, but it's probably better to catch the narrower exception if we're only trying to deal with connection establishment errors.

because in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala#L963 doesnot catch ConnectException and throw a IOException. so now in here we only catch IOException. if we are only trying to deal with connection errors, it need to catch ConnectException in ConnectionManager.scala.

SparkQA · 2014-12-24T04:02:33Z

Test build #24757 has started for PR 3061 at commit 5a73b04.

This patch merges cleanly.

SparkQA · 2014-12-24T04:04:23Z

Test build #24757 has finished for PR 3061 at commit 5a73b04.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-24T04:04:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24757/
Test FAILed.

rxin · 2014-12-24T04:54:32Z

Actually let's close this one. Based on Tencent's feedback, we have already implemented the same functionality in Netty. I'm not sure whether it is worth it to fix the current connection manager that very few people understand.

lianhuiwang · 2014-12-24T15:42:12Z

OK. I will close this PR.

while fetchFailed's reason is connectionException, retry to fetch blo…

dcfef7d

…cks's result

rxin mentioned this pull request Nov 5, 2014

[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches #3101

Closed

2 tasks

JoshRosen reviewed Dec 24, 2014
View reviewed changes

address JoshRosen's comments

5a73b04

lianhuiwang closed this Dec 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4195][Core]retry to fetch blocks's result when fetchfailed's reason is connection timeout #3061

[SPARK-4195][Core]retry to fetch blocks's result when fetchfailed's reason is connection timeout #3061

lianhuiwang commented Nov 2, 2014

SparkQA commented Nov 2, 2014

SparkQA commented Nov 2, 2014

AmplabJenkins commented Nov 2, 2014

rxin commented Nov 2, 2014

JoshRosen Dec 24, 2014

lianhuiwang Dec 24, 2014

JoshRosen commented Dec 24, 2014

JoshRosen Dec 24, 2014

lianhuiwang Dec 24, 2014

SparkQA commented Dec 24, 2014

SparkQA commented Dec 24, 2014

AmplabJenkins commented Dec 24, 2014

rxin commented Dec 24, 2014

lianhuiwang commented Dec 24, 2014

		@@ -39,6 +41,10 @@ final class NioBlockTransferService(conf: SparkConf, securityManager: SecurityMa

		private var blockDataManager: BlockDataManager = _

		private val blockFailedCounts = new HashMap[Seq[String], Int]

[SPARK-4195][Core]retry to fetch blocks's result when fetchfailed's reason is connection timeout #3061

[SPARK-4195][Core]retry to fetch blocks's result when fetchfailed's reason is connection timeout #3061

Conversation

lianhuiwang commented Nov 2, 2014

SparkQA commented Nov 2, 2014

SparkQA commented Nov 2, 2014

AmplabJenkins commented Nov 2, 2014

rxin commented Nov 2, 2014

JoshRosen Dec 24, 2014

Choose a reason for hiding this comment

lianhuiwang Dec 24, 2014

Choose a reason for hiding this comment

JoshRosen commented Dec 24, 2014

JoshRosen Dec 24, 2014

Choose a reason for hiding this comment

lianhuiwang Dec 24, 2014

Choose a reason for hiding this comment

SparkQA commented Dec 24, 2014

SparkQA commented Dec 24, 2014

AmplabJenkins commented Dec 24, 2014

rxin commented Dec 24, 2014

lianhuiwang commented Dec 24, 2014