-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix a bug where servers could be marked as up when they were failing #16506
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Fix a bug introduced in Synapse 1.59.0 where servers would be incorrectly marked as available when a request resulted in an error. | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -170,10 +170,10 @@ def __init__( | |
database in milliseconds, or zero if the last request was | ||
successful. | ||
backoff_on_404: Back off if we get a 404 | ||
|
||
backoff_on_failure: set to False if we should not increase the | ||
retry interval on a failure. | ||
|
||
notifier: A notifier used to mark servers as up. | ||
replication_client A replication client used to mark servers as up. | ||
backoff_on_all_error_codes: Whether we should back off on any | ||
error code. | ||
""" | ||
|
@@ -237,6 +237,9 @@ def __exit__( | |
else: | ||
valid_err_code = False | ||
|
||
# Store whether the destination had previously been failing. | ||
previously_failing = bool(self.failure_ts) | ||
|
||
if success: | ||
# We connected successfully. | ||
if not self.retry_interval: | ||
|
@@ -291,17 +294,15 @@ async def store_retry_timings() -> None: | |
self.retry_interval, | ||
) | ||
|
||
if self.notifier: | ||
# Inform the relevant places that the remote server is back up. | ||
self.notifier.notify_remote_server_up(self.destination) | ||
|
||
if self.replication_client: | ||
# If we're on a worker we try and inform master about this. The | ||
# replication client doesn't hook into the notifier to avoid | ||
# infinite loops where we send a `REMOTE_SERVER_UP` command to | ||
# master, which then echoes it back to us which in turn pokes | ||
# the notifier. | ||
self.replication_client.send_remote_server_up(self.destination) | ||
# If the server was previously failing, but is no longer. | ||
if previously_failing: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @erikjohnston this might need some thoughts from you as the original author of #12500 -- was this done on purpose and I'm missing some understanding? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I think this is not quite right still, it will end up calling this code if we were previously failing & still failing. I think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It should be OK now. 👍 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The logic could be simplified to only check |
||
if self.notifier: | ||
# Inform the relevant places that the remote server is back up. | ||
self.notifier.notify_remote_server_up(self.destination) | ||
|
||
if self.replication_client: | ||
# Inform other workers that the remote server is up. | ||
self.replication_client.send_remote_server_up(self.destination) | ||
|
||
except Exception: | ||
logger.exception("Failed to store destination_retry_timings") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there's a more visible symptom? Perhaps it would cause things to be retried too often?