Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(clustering/rpc): sync retry timeout due to block #14195

Merged
merged 2 commits into from
Jan 21, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 19 additions & 15 deletions kong/clustering/services/sync/rpc.lua
Original file line number Diff line number Diff line change
Expand Up @@ -374,28 +374,17 @@ local function sync_handler(premature)
if not res and err ~= "timeout" then
ngx_log(ngx_ERR, "unable to create worker mutex and sync: ", err)
end
end


local sync_once_impl


local function start_sync_once_timer(retry_count)
local ok, err = kong.timer:at(0, sync_once_impl, retry_count or 0)
if not ok then
return nil, err
end

return true
return res, err
end


function sync_once_impl(premature, retry_count)
local function sync_once_impl(premature, retry_count)
if premature then
return
end

sync_handler()
local _, err = sync_handler()
StarlightIbuki marked this conversation as resolved.
Show resolved Hide resolved

-- check if "kong.sync.v2.notify_new_version" updates the latest version

Expand All @@ -413,12 +402,27 @@ function sync_once_impl(premature, retry_count)

-- retry if the version is not updated
retry_count = retry_count or 0

if retry_count > MAX_RETRY then
ngx_log(ngx_ERR, "sync_once retry count exceeded. retry_count: ", retry_count)
return
end

return start_sync_once_timer(retry_count + 1)
-- we do not count a timed out sync. just retry
if err ~= "timeout" then
Copy link
Contributor

@chronolaw chronolaw Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now timeout option is 0:

local SYNC_MUTEX_OPTS = { name = "get_delta", timeout = 0, }

So how the error timeout will happen? Or it will always be timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chronolaw This is a locked operation, where timeout = 0 means it will fail instantly if the lock can not be acquired. By tolerating timeout in this case, we allow sync_once to actually retry instead of giving up when another coroutine holds the lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that it will always get a timeout error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chronolaw No. When the callback fails or errors out it will emit a different error. That is what we count for a real error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e, if we have no error but the version still does not match, we count it as 1 trial; if it's a real failure (internal or external), we still count it as 1 trial. This way we prevent it from looping forever; but meanwhile we do not want a running coroutine to block the sync_once call and make it quickly exhaust all the chances, so we do not count timeouts(lock failure)

retry_count = retry_count + 1
end

-- in some cases, the new spawned timer will be switched to immediately,
-- preventing the coroutine who possesses the mutex to run
-- to let other coroutines has a chance to run
local ok, err = kong.timer:at(0.1, sync_once_impl, retry_count or 0)
chronolaw marked this conversation as resolved.
Show resolved Hide resolved
-- this is a workaround for a timerng bug, where tail recursion causes failure
chronolaw marked this conversation as resolved.
Show resolved Hide resolved
-- ok could be a string so let's convert it to boolean
if not ok then
return nil, err
end
return true
end


Expand Down
Loading