Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #887 Enable automatic retry by a handy way #1084

Merged
merged 8 commits into from
Aug 7, 2021

Conversation

seratch
Copy link
Member

@seratch seratch commented Aug 5, 2021

Summary

This pull request fixes #887 by adding the new RetryHandler feature to all the API clients (except legacy ones under slack package).

With the default settings, the API clients do one retry only for connectivity issues like the "Connection reset by peer" error. For the intervals of retries, the built-in retry handlers behave in the manner of exponential backoff and jitter.

To customize the behavior, you can pass your own retry_handlers argument to API client constructors:

from slack_sdk.http_retry.handler import RetryHandler
from slack_sdk.http_retry.builtin_handlers import RateLimitErrorRetryHandler

class MyRetryHandler(RetryHandler):
    def _can_retry(self, *, state, request, response, error) -> bool:
        return response is not None and response.status_code >= 500

my_retry_handler = MyRetryHandler(max_retry_count=2)
ratelimit_retry_handler = RateLimitErrorRetryHandler(max_retry_count=1)

import os
from slack_sdk.web import WebClient

client = WebClient(
    token=os.environ["SLACK_BOT_TOKEN"],
    retry_handlers=[my_retry_handler, ratelimit_retry_handler],
)

If an API client with retry handlers encounters an error, it runs each handler's def can_retry(args) -> bool method. If any of the method executions returns True, the client runs its def prepare_for_next_retry(args) -> None method to wait for the right timing. Then, the same API request will be performed until the client hits the handler's max_retry_count.

In this pull request, I've updated the following API clients:

  • slack_sdk.web.WebClient
  • slack_sdk.webhook.WebhookClient
  • slack_sdk.audit_logs.AuditLogsClient
  • slack_sdk.scim.SCIMClient
  • slack_sdk.web.async_client.AsyncWebClient (aiohttp/asyncio compatible)
  • slack_sdk.webhook.async_client.AsyncWebhookClient (aiohttp/asyncio compatible)
  • slack_sdk.audit_logs.async_client.AsyncAuditLogsClient (aiohttp/asyncio compatible)
  • slack_sdk.scim.async_client.AsyncSCIMClient (aiohttp/asyncio compatible)

You can reuse retry handlers across the above API clients:

from slack_sdk.scim import SCIMClient
client = WebClient(
    token=os.environ["SLACK_ADMIN_TOKEN"],
    retry_handlers=[my_retry_handler],
)

from slack_sdk.audit_logs.async_client import AsyncAuditLogsClient
from slack_sdk.http_retry.builtin_async_handlers import AsyncConnectionErrorRetryHandler
from slack_sdk.http_retry.builtin_interval_calculators import BackoffRetryIntervalCalculator
from slack_sdk.http_retry.jitter import RandomJitter

client = AsyncAuditLogsClient(
    token=os.environ["SLACK_ADMIN_TOKEN"],
    retry_handlers=[AsyncConnectionErrorRetryHandler(
        max_retry_count=2,
        interval_calculator=BackoffRetryIntervalCalculator(
            backoff_factor=0.2,
            jitter=RandomJitter(),
        )
    )],
)

TODOs

  • Implement the features
  • Add new unit tests for the changes
  • Run all the integration tests to verify if there is no regression
  • Update the document to cover how to customize retry handlers (in a different PR; we'll merge it after releasing v3.9)

Category (place an x in each of the [ ])

  • slack_sdk.web.WebClient (sync/async) (Web API client)
  • slack_sdk.webhook.WebhookClient (sync/async) (Incoming Webhook, response_url sender)
  • slack_sdk.socket_mode (Socket Mode client)
  • slack_sdk.signature (Request Signature Verifier)
  • slack_sdk.oauth (OAuth Flow Utilities)
  • slack_sdk.models (UI component builders)
  • slack_sdk.scim (SCIM API client)
  • slack_sdk.audit_logs (Audit Logs API client)
  • slack_sdk.rtm_v2 (RTM client)
  • /docs-src (Documents, have you run ./docs.sh?)
  • /docs-src-v2 (Documents, have you run ./docs-v2.sh?)
  • /tutorial (PythOnBoardingBot tutorial)
  • tests/integration_tests (Automated tests for this library)

Requirements (place an x in each [ ])

  • I've read and understood the Contributing Guidelines and have done my best effort to follow them.
  • I've read and agree to the Code of Conduct.
  • I've run python3 -m venv .venv && source .venv/bin/activate && ./scripts/run_validation.sh after making the changes.

@seratch seratch added this to the 3.9.0 milestone Aug 5, 2021
retry_response: Optional[RetryHttpResponse] = None
response_body = ""

if self.logger.level <= logging.DEBUG:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this debug logging to print this every time this client does a retry.

@@ -49,6 +55,7 @@ def __init__(
user_agent_prefix: Optional[str] = None,
user_agent_suffix: Optional[str] = None,
logger: Optional[logging.Logger] = None,
retry_handlers: List[RetryHandler] = async_default_handlers,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default one consists of only the (Async)ConnectionErrorRetryHandler instance with its defaults.

default_interval_calculator = BackoffRetryIntervalCalculator()


class RetryHandler:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is the main interface in this pull request

from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator


class MyRetryHandler(RetryHandler):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom retry handler for testing

@@ -111,7 +111,7 @@ def _handle(self):
return
if pattern == "rate_limited":
self.send_response(429)
self.send_header("Retry-After", 30)
self.send_header("Retry-After", 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the value for faster test execution (we really don't want to wait for 30 seconds

@codecov
Copy link

codecov bot commented Aug 5, 2021

Codecov Report

Merging #1084 (2bf87c4) into main (8a0c802) will increase coverage by 0.46%.
The diff coverage is 90.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1084      +/-   ##
==========================================
+ Coverage   85.62%   86.09%   +0.46%     
==========================================
  Files          99      110      +11     
  Lines        9324     9847     +523     
==========================================
+ Hits         7984     8478     +494     
- Misses       1340     1369      +29     
Impacted Files Coverage Δ
slack_sdk/http_retry/interval_calculator.py 66.66% <66.66%> (ø)
slack_sdk/web/async_internal_utils.py 81.81% <80.95%> (+2.78%) ⬆️
slack_sdk/audit_logs/v1/async_client.py 89.16% <85.18%> (+0.41%) ⬆️
slack_sdk/http_retry/jitter.py 85.71% <85.71%> (ø)
slack_sdk/web/base_client.py 89.55% <87.32%> (+0.45%) ⬆️
slack_sdk/audit_logs/v1/client.py 91.20% <88.57%> (+1.66%) ⬆️
slack_sdk/webhook/async_client.py 92.23% <90.00%> (-1.42%) ⬇️
slack_sdk/scim/v1/async_client.py 94.20% <90.19%> (-1.72%) ⬇️
slack_sdk/scim/v1/client.py 93.75% <90.27%> (+3.27%) ⬆️
slack_sdk/http_retry/builtin_handlers.py 92.10% <92.10%> (ø)
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a0c802...2bf87c4. Read the comment docs.

@seratch
Copy link
Member Author

seratch commented Aug 5, 2021

Applied the following changes:

  • Renamed RetryHandler#can_retry_custom(...) to RetryHandler#_can_retry(...)
  • Renamed RetryHandler#prepare_for_next_retry(...) to RetryHandler#prepare_for_next_attempt(...)

@seratch seratch force-pushed the issue-887-retry-handlers branch from 5e418a6 to 84b12e1 Compare August 5, 2021 12:56
@seratch seratch force-pushed the issue-887-retry-handlers branch from 84b12e1 to 85eab23 Compare August 6, 2021 04:22
@filmaj
Copy link
Contributor

filmaj commented Aug 6, 2021

I plan on reviewing today - it is a big PR so I did not end up having time yesterday.

@seratch
Copy link
Member Author

seratch commented Aug 6, 2021

@filmaj Thanks! No rush at all. I know this includes so many changes. I am thinking that this pull request can add more unit tests covering rate limited error patterns for safety.

Copy link
Member Author

@seratch seratch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments for reviewers

@@ -190,7 +199,7 @@ def api_call(
return self._perform_http_request(
http_verb=http_verb,
url=url,
body_params=body_params,
body=body_params,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As _perform_http_request is an internal method, we can safely rename this arg name

Comment on lines +259 to +260
counter_for_safety = 0
while counter_for_safety < 100:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to remove this counter for simplicity. while True here should be safe enough as retry_state.next_attempt_requested is usually False

Comment on lines +19 to +24
error_types: List[Exception] = [
ServerConnectionError,
ServerDisconnectedError,
# ClientOSError: [Errno 104] Connection reset by peer
ClientOSError,
],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are aiohttp specific exceptions

return False


class ServerErrorRetryHandler(RetryHandler):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this one as a reference implementation but it's unused. We may want to remove this for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is fair enough 👍

Copy link

@tgiardina tgiardina Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.

Sorry to necro, but I think it's worth reconsidering this decision. While it may be undesirable from Slack's perspective to exponentially backoff when their API is returning 5xxs, I think this is what the SDK consumers will want. And I think it makes sense to do what the consumer wants here, because if we don't, the consumer is just going to implement their own exponential backoff logic that includes 5xxs. This is my plan anyway.

Thanks for your work here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a good time to mention the Circuit Breaker pattern.

duration += random.random()
else:
duration = (
int(response.headers.get(retry_after_header_name)[0]) + random.random()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random.random() is a random jitter but it might not be necessary here. This is not backoff

from .interval_calculator import RetryIntervalCalculator


class FixedValueRetryIntervalCalculator(RetryIntervalCalculator):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a reference. Unused with the default setting. I should add some test for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's unused, can probably not worry about the tests. Unless you want to keep the coverage scores high 😆

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yeah, I always like better coverage!

Copy link
Contributor

@filmaj filmaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, lots of work and you did a great job! Thanks for involving me in the review.

I left a few comments, mostly for my own learning and education.

slack_sdk/http_retry/handler.py Show resolved Hide resolved
return False


class ServerErrorRetryHandler(RetryHandler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.

duration = (
int(response.headers.get(retry_after_header_name)[0]) + random.random()
)
time.sleep(duration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, using the synchronous client, if the API responds with a relatively large value in the Retry-After header (e.g. the docs for this header show an example value of 30) - would this freeze the entire process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this freeze the entire process?

When it comes to the same thread, yes. Thinking about the behavior as the whole app. it depends on how the app is implemented. In the case of Bolt for Python, all the code except ack() will be executed in a background thread. It does not result in 3 secound timeout.

By default, we don't enable rate limited error retries. Developers should turn it on with great understanding of the potential long pause.

from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator


class AsyncConnectionErrorRetryHandler(RetryHandler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this async implementation relies on the same base class that is shared with the sync implementation, and the base RetryHandler class' prepare_for_next_request uses the built-in Python's sleep method - could this lead to a situation where we block the process even when using an async handler?

I am not very familiar with aiohttp, but it seems like it is based on the asyncio library which has its own async-friendly sleep implementation (or, at least, this aiohttp document page implies that such an async sleep exists - search for asyncio on this page for the relevant section).

I am posing this question from a place of ignorance and a desire to learn so it is likely I am completely off. But asking dumb questions is helpful for me to learn 🤪

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filmaj Ah, this is a great point! Yes, we should use asyncio.sleep instead here and I was aware of it. But somehow I forgot to override the method. We can have a base class RetryHandler, which uses asyncio'sleep method. All the methods in it will be async methods. I will update this part shortly.

header = self.headers["Authorization"]
if header is not None and "xoxp-" in header:
pattern = str(header).split("xoxp-", 1)[1]
if "remote_disconnected" in pattern:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice pattern, I like this a lot!

Copy link
Member Author

@seratch seratch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filmaj Thanks for your review! I will update some parts before merging this.

from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator


class AsyncConnectionErrorRetryHandler(RetryHandler):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filmaj Ah, this is a great point! Yes, we should use asyncio.sleep instead here and I was aware of it. But somehow I forgot to override the method. We can have a base class RetryHandler, which uses asyncio'sleep method. All the methods in it will be async methods. I will update this part shortly.

return False


class ServerErrorRetryHandler(RetryHandler):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is fair enough 👍

duration = (
int(response.headers.get(retry_after_header_name)[0]) + random.random()
)
time.sleep(duration)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this freeze the entire process?

When it comes to the same thread, yes. Thinking about the behavior as the whole app. it depends on how the app is implemented. In the case of Bolt for Python, all the code except ack() will be executed in a background thread. It does not result in 3 secound timeout.

By default, we don't enable rate limited error retries. Developers should turn it on with great understanding of the potential long pause.

from .interval_calculator import RetryIntervalCalculator


class FixedValueRetryIntervalCalculator(RetryIntervalCalculator):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yeah, I always like better coverage!

slack_sdk/http_retry/handler.py Show resolved Hide resolved
@seratch
Copy link
Member Author

seratch commented Aug 7, 2021

Fixed all the issues in the latest revision. Let me merge this PR now. I will release an RC version for getting feedback from communities.

@seratch seratch merged commit c6efe45 into slackapi:main Aug 7, 2021
@seratch seratch deleted the issue-887-retry-handlers branch August 7, 2021 02:34
seratch added a commit to seratch/python-slack-sdk that referenced this pull request Aug 13, 2021
seratch added a commit to seratch/python-slack-sdk that referenced this pull request Aug 13, 2021
seratch added a commit to seratch/python-slack-sdk that referenced this pull request Aug 13, 2021
seratch added a commit to seratch/python-slack-sdk that referenced this pull request Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable automatic retry by a handy way
4 participants