Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches #3101

Closed
wants to merge 6 commits into from

Conversation

aarondav
Copy link
Contributor

@aarondav aarondav commented Nov 5, 2014

This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException.

This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load.

TODO:

  • unit tests
  • put in ExternalShuffleClient too

@aarondav
Copy link
Contributor Author

aarondav commented Nov 5, 2014

@rxin @lianhuiwang PTAL

@rxin
Copy link
Contributor

rxin commented Nov 5, 2014

This should replace #3061

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22912 has started for PR 3101 at commit c293a3f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22912 has finished for PR 3101 at commit c293a3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class RetryingBlockFetcher

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22912/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22929 has started for PR 3101 at commit d406db7.

  • This patch merges cleanly.

@aarondav
Copy link
Contributor Author

aarondav commented Nov 5, 2014

@rxin Added unit test and ExternalShuffleClient support. This is good to go from my end.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22929 has finished for PR 3101 at commit d406db7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class RetryingBlockFetcher

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22929/
Test PASSed.

This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException.

This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load.

TODO:
- [ ] unit tests
- [ ] put in ExternalShuffleClient too
@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22956 has started for PR 3101 at commit 6f594cd.

  • This patch merges cleanly.

* Lightweight method which initiates a retry in a different thread. The retry will involve
* calling fetchAllOutstanding() after a configured wait time.
*/
private synchronized void initiateRetry() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22955/
Test FAILed.

@@ -57,7 +57,7 @@ public void tearDown() {
}

@Test
public void createAndReuseBlockClients() throws TimeoutException {
public void createAndReuseBlockClients() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we throwing more than Timeout now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException instead of TimeoutException

@aarondav
Copy link
Contributor Author

aarondav commented Nov 6, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22968 has started for PR 3101 at commit 6f594cd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22956 has finished for PR 3101 at commit 6f594cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class LimitedInputStream extends FilterInputStream
    • public class RetryingBlockFetcher

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22956/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22970 has started for PR 3101 at commit e80e4c2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22968 has finished for PR 3101 at commit 6f594cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class RetryingBlockFetcher

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22970 has finished for PR 3101 at commit e80e4c2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class RetryingBlockFetcher

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22968/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22970/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22980 has started for PR 3101 at commit c7fd107.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22980 has finished for PR 3101 at commit c7fd107.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22980/
Test PASSed.

@aarondav aarondav changed the title [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches [SPARK-4188] [Core] Perform network-level retry of shuffle file fetches Nov 6, 2014
@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23033 has started for PR 3101 at commit 72a2a32.

  • This patch merges cleanly.

@rxin
Copy link
Contributor

rxin commented Nov 7, 2014

Merging in master. Thanks.

@asfgit asfgit closed this in f165b2b Nov 7, 2014
asfgit pushed a commit that referenced this pull request Nov 7, 2014
This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException.

This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load.

TODO:
- [x] unit tests
- [x] put in ExternalShuffleClient too

Author: Aaron Davidson <aaron@databricks.com>

Closes #3101 from aarondav/retry and squashes the following commits:

72a2a32 [Aaron Davidson] Add that we should remove the condition around the retry thingy
c7fd107 [Aaron Davidson] Fix unit tests
e80e4c2 [Aaron Davidson] Address initial comments
6f594cd [Aaron Davidson] Fix unit test
05ff43c [Aaron Davidson] Add to external shuffle client and add unit test
66e5a24 [Aaron Davidson] [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches

(cherry picked from commit f165b2b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23033 has finished for PR 3101 at commit 72a2a32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class RetryingBlockFetcher

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23033/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants