[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches #3101

aarondav · 2014-11-05T02:55:59Z

This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException.

This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load.

TODO:

unit tests
put in ExternalShuffleClient too

aarondav · 2014-11-05T02:56:29Z

@rxin @lianhuiwang PTAL

rxin · 2014-11-05T02:57:55Z

This should replace #3061

SparkQA · 2014-11-05T02:59:51Z

Test build #22912 has started for PR 3101 at commit c293a3f.

This patch merges cleanly.

SparkQA · 2014-11-05T04:20:11Z

Test build #22912 has finished for PR 3101 at commit c293a3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class RetryingBlockFetcher

AmplabJenkins · 2014-11-05T04:20:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22912/
Test PASSed.

SparkQA · 2014-11-05T09:10:11Z

Test build #22929 has started for PR 3101 at commit d406db7.

This patch merges cleanly.

aarondav · 2014-11-05T09:21:38Z

@rxin Added unit test and ExternalShuffleClient support. This is good to go from my end.

SparkQA · 2014-11-05T10:34:24Z

Test build #22929 has finished for PR 3101 at commit d406db7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class RetryingBlockFetcher

AmplabJenkins · 2014-11-05T10:34:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22929/
Test PASSed.

This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException. This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load. TODO: - [ ] unit tests - [ ] put in ExternalShuffleClient too

SparkQA · 2014-11-05T23:25:02Z

Test build #22956 has started for PR 3101 at commit 6f594cd.

This patch merges cleanly.

rxin · 2014-11-05T23:28:22Z

network/shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java

+   * Lightweight method which initiates a retry in a different thread. The retry will involve
+   * calling fetchAllOutstanding() after a configured wait time.
+   */
+  private synchronized  void initiateRetry() {


extra space

AmplabJenkins · 2014-11-05T23:32:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22955/
Test FAILed.

rxin · 2014-11-05T23:42:56Z

network/common/src/test/java/org/apache/spark/network/TransportClientFactorySuite.java

@@ -57,7 +57,7 @@ public void tearDown() {
  }

  @Test
-  public void createAndReuseBlockClients() throws TimeoutException {
+  public void createAndReuseBlockClients() throws Exception {


are we throwing more than Timeout now?

IOException instead of TimeoutException

aarondav · 2014-11-06T00:33:52Z

Jenkins, retest this please.

SparkQA · 2014-11-06T00:40:18Z

Test build #22968 has started for PR 3101 at commit 6f594cd.

This patch merges cleanly.

SparkQA · 2014-11-06T00:48:21Z

Test build #22956 has finished for PR 3101 at commit 6f594cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public final class LimitedInputStream extends FilterInputStream
- public class RetryingBlockFetcher

AmplabJenkins · 2014-11-06T00:48:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22956/
Test PASSed.

SparkQA · 2014-11-06T00:55:39Z

Test build #22970 has started for PR 3101 at commit e80e4c2.

This patch merges cleanly.

SparkQA · 2014-11-06T02:02:03Z

Test build #22968 has finished for PR 3101 at commit 6f594cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class RetryingBlockFetcher

SparkQA · 2014-11-06T02:02:06Z

Test build #22970 has finished for PR 3101 at commit e80e4c2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class RetryingBlockFetcher

AmplabJenkins · 2014-11-06T02:02:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22968/
Test PASSed.

AmplabJenkins · 2014-11-06T02:02:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22970/
Test FAILed.

SparkQA · 2014-11-06T02:52:30Z

Test build #22980 has started for PR 3101 at commit c7fd107.

This patch merges cleanly.

SparkQA · 2014-11-06T04:17:50Z

Test build #22980 has finished for PR 3101 at commit c7fd107.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-06T04:17:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22980/
Test PASSed.

SparkQA · 2014-11-07T02:14:59Z

Test build #23033 has started for PR 3101 at commit 72a2a32.

This patch merges cleanly.

rxin · 2014-11-07T02:39:01Z

Merging in master. Thanks.

This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException. This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load. TODO: - [x] unit tests - [x] put in ExternalShuffleClient too Author: Aaron Davidson <aaron@databricks.com> Closes #3101 from aarondav/retry and squashes the following commits: 72a2a32 [Aaron Davidson] Add that we should remove the condition around the retry thingy c7fd107 [Aaron Davidson] Fix unit tests e80e4c2 [Aaron Davidson] Address initial comments 6f594cd [Aaron Davidson] Fix unit test 05ff43c [Aaron Davidson] Add to external shuffle client and add unit test 66e5a24 [Aaron Davidson] [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches (cherry picked from commit f165b2b) Signed-off-by: Reynold Xin <rxin@databricks.com>

SparkQA · 2014-11-07T03:42:30Z

Test build #23033 has finished for PR 3101 at commit 72a2a32.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class RetryingBlockFetcher

AmplabJenkins · 2014-11-07T03:42:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23033/
Test PASSed.

aarondav added 2 commits November 5, 2014 15:12

Add to external shuffle client and add unit test

05ff43c

aarondav force-pushed the retry branch from d406db7 to 05ff43c Compare November 5, 2014 23:16

Fix unit test

6f594cd

rxin reviewed Nov 5, 2014
View reviewed changes

Address initial comments

e80e4c2

Fix unit tests

c7fd107

aarondav changed the title ~~[SPARK-4238] [Core] Perform network-level retry of shuffle file fetches~~ [SPARK-4188] [Core] Perform network-level retry of shuffle file fetches Nov 6, 2014

Add that we should remove the condition around the retry thingy

72a2a32

asfgit closed this in f165b2b Nov 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches #3101

[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches #3101

aarondav commented Nov 5, 2014

aarondav commented Nov 5, 2014

rxin commented Nov 5, 2014

SparkQA commented Nov 5, 2014

SparkQA commented Nov 5, 2014

AmplabJenkins commented Nov 5, 2014

SparkQA commented Nov 5, 2014

aarondav commented Nov 5, 2014

SparkQA commented Nov 5, 2014

AmplabJenkins commented Nov 5, 2014

SparkQA commented Nov 5, 2014

rxin Nov 5, 2014

AmplabJenkins commented Nov 5, 2014

rxin Nov 5, 2014

aarondav Nov 6, 2014

aarondav commented Nov 6, 2014

SparkQA commented Nov 6, 2014

SparkQA commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

SparkQA commented Nov 6, 2014

SparkQA commented Nov 6, 2014

SparkQA commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

SparkQA commented Nov 6, 2014

SparkQA commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

SparkQA commented Nov 7, 2014

rxin commented Nov 7, 2014

SparkQA commented Nov 7, 2014

AmplabJenkins commented Nov 7, 2014

[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches #3101

[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches #3101

Conversation

aarondav commented Nov 5, 2014

aarondav commented Nov 5, 2014

rxin commented Nov 5, 2014

SparkQA commented Nov 5, 2014

SparkQA commented Nov 5, 2014

AmplabJenkins commented Nov 5, 2014

SparkQA commented Nov 5, 2014

aarondav commented Nov 5, 2014

SparkQA commented Nov 5, 2014

AmplabJenkins commented Nov 5, 2014

SparkQA commented Nov 5, 2014

rxin Nov 5, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Nov 5, 2014

rxin Nov 5, 2014

Choose a reason for hiding this comment

aarondav Nov 6, 2014

Choose a reason for hiding this comment

aarondav commented Nov 6, 2014

SparkQA commented Nov 6, 2014

SparkQA commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

SparkQA commented Nov 6, 2014

SparkQA commented Nov 6, 2014

SparkQA commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

SparkQA commented Nov 6, 2014

SparkQA commented Nov 6, 2014

AmplabJenkins commented Nov 6, 2014

SparkQA commented Nov 7, 2014

rxin commented Nov 7, 2014

SparkQA commented Nov 7, 2014

AmplabJenkins commented Nov 7, 2014