-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(clustering/rpc): sync retry timeout due to block #14195
Conversation
1b010c6
to
8c14f00
Compare
7c4fcb6
to
b89f0e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kong/clustering/services/sync/rpc.lua:380:16: unused function start_sync_once_timer
Could you fix the linting?
b89f0e5
to
dd60315
Compare
done |
abd0016
to
1ee8325
Compare
6570804
to
c43194d
Compare
timeout should not be considered a failure trial KAG-6224
c43194d
to
51e634d
Compare
if retry_count > MAX_RETRY then | ||
ngx_log(ngx_ERR, "sync_once retry count exceeded. retry_count: ", retry_count) | ||
return | ||
end | ||
|
||
return start_sync_once_timer(retry_count + 1) | ||
-- we do not count a timed out sync. just retry | ||
if err ~= "timeout" then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now timeout
option is 0
:
local SYNC_MUTEX_OPTS = { name = "get_delta", timeout = 0, }
So how the error timeout
will happen? Or it will always be timeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chronolaw This is a locked operation, where timeout = 0
means it will fail instantly if the lock can not be acquired. By tolerating timeout in this case, we allow sync_once
to actually retry instead of giving up when another coroutine holds the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that it will always get a timeout
error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chronolaw No. When the callback fails or errors out it will emit a different error. That is what we count for a real error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e, if we have no error but the version still does not match, we count it as 1 trial; if it's a real failure (internal or external), we still count it as 1 trial. This way we prevent it from looping forever; but meanwhile we do not want a running coroutine to block the sync_once
call and make it quickly exhaust all the chances, so we do not count timeouts(lock failure)
@StarlightIbuki do not forget cherry-pick. |
Summary
timeout should not be considered a failure trial
Checklist
changelog/unreleased/kong
orskip-changelog
label added on PR if changelog is unnecessary. README.mdIssue reference
KAG-6224