Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-41415/SPARK-42090 Backport to 3.3 #39634

Conversation

akpatnam25
Copy link

What changes were proposed in this pull request?

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

Why are the changes needed?

We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes #38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam apatnam@linkedin.com
Signed-off-by: Mridul Muralidharan <mridulgmail.com>

What changes were proposed in this pull request?

This PR introduces sasl retry count in RetryingBlockTransferor.

Why are the changes needed?

Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

  1. SaslTimeoutException
  2. IOException
  3. SaslTimeoutException
  4. IOException

Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New test is added, courtesy of Mridul.

Closes #39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu yuzhihong@gmail.com
Signed-off-by: Mridul Muralidharan <mridulgmail.com>

Aravind Patnam and others added 2 commits January 17, 2023 15:55
### What changes were proposed in this pull request?

Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries.

### Why are the changes needed?
We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests, and tested on cluster to ensure the retries are being triggered correctly.

Closes apache#38959 from akpatnam25/SPARK-41415.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
This PR introduces sasl retry count in RetryingBlockTransferor.

### Why are the changes needed?
Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario:

1. SaslTimeoutException
2. IOException
3. SaslTimeoutException
4. IOException

Even though IOException at apache#2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step apache#4.
Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New test is added, courtesy of Mridul.

Closes apache#39611 from tedyu/sasl-cnt.

Authored-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
@github-actions github-actions bot added the CORE label Jan 17, 2023
@akpatnam25
Copy link
Author

@mridulm @otterc @tedyu @dongjoon-hyun backport into 3.3

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we backport one by one?

@akpatnam25
Copy link
Author

Are we sure we want to do backport one by one? Asking because the 2nd backport fixes a corner case in which the 1st one does not. Ideally, I feel like they should be backported together. WDYT @dongjoon-hyun ?
cc @mridulm

@dongjoon-hyun
Copy link
Member

One-by-one is more clear. I'm sure what I want to keep in that way, @akpatnam25 .

@srowen
Copy link
Member

srowen commented Jan 18, 2023

Also please write [SPARK-xxxxxx] in the titles, as that will connect it to the JIRA

@akpatnam25 akpatnam25 closed this Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants