-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches #3101
Conversation
@rxin @lianhuiwang PTAL |
This should replace #3061 |
Test build #22912 has started for PR 3101 at commit
|
Test build #22912 has finished for PR 3101 at commit
|
Test PASSed. |
Test build #22929 has started for PR 3101 at commit
|
@rxin Added unit test and ExternalShuffleClient support. This is good to go from my end. |
Test build #22929 has finished for PR 3101 at commit
|
Test PASSed. |
This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException. This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load. TODO: - [ ] unit tests - [ ] put in ExternalShuffleClient too
Test build #22956 has started for PR 3101 at commit
|
* Lightweight method which initiates a retry in a different thread. The retry will involve | ||
* calling fetchAllOutstanding() after a configured wait time. | ||
*/ | ||
private synchronized void initiateRetry() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space
Test FAILed. |
@@ -57,7 +57,7 @@ public void tearDown() { | |||
} | |||
|
|||
@Test | |||
public void createAndReuseBlockClients() throws TimeoutException { | |||
public void createAndReuseBlockClients() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we throwing more than Timeout now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOException instead of TimeoutException
Jenkins, retest this please. |
Test build #22968 has started for PR 3101 at commit
|
Test build #22956 has finished for PR 3101 at commit
|
Test PASSed. |
Test build #22970 has started for PR 3101 at commit
|
Test build #22968 has finished for PR 3101 at commit
|
Test build #22970 has finished for PR 3101 at commit
|
Test PASSed. |
Test FAILed. |
Test build #22980 has started for PR 3101 at commit
|
Test build #22980 has finished for PR 3101 at commit
|
Test PASSed. |
Test build #23033 has started for PR 3101 at commit
|
Merging in master. Thanks. |
This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException. This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load. TODO: - [x] unit tests - [x] put in ExternalShuffleClient too Author: Aaron Davidson <aaron@databricks.com> Closes #3101 from aarondav/retry and squashes the following commits: 72a2a32 [Aaron Davidson] Add that we should remove the condition around the retry thingy c7fd107 [Aaron Davidson] Fix unit tests e80e4c2 [Aaron Davidson] Address initial comments 6f594cd [Aaron Davidson] Fix unit test 05ff43c [Aaron Davidson] Add to external shuffle client and add unit test 66e5a24 [Aaron Davidson] [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches (cherry picked from commit f165b2b) Signed-off-by: Reynold Xin <rxin@databricks.com>
Test build #23033 has finished for PR 3101 at commit
|
Test PASSed. |
This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException.
This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load.
TODO: