-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic LB sync non-external backends only when necessary #5418
Conversation
Hi @ZxYuan. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ZxYuan The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ZxYuan can you share some more numbers please, ideally graphs? I'd like to see how much impact does this change have on CPU usage. |
/ok-to-test |
/assign @ElvinEfendi |
rootfs/etc/nginx/lua/balancer.lua
Outdated
@@ -36,6 +36,8 @@ local IMPLEMENTATIONS = { | |||
|
|||
local _M = {} | |||
local balancers = {} | |||
local external_backends = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about backends_with_external_name
?
rootfs/etc/nginx/lua/balancer.lua
Outdated
@@ -36,6 +36,8 @@ local IMPLEMENTATIONS = { | |||
|
|||
local _M = {} | |||
local balancers = {} | |||
local external_backends = {} | |||
local last_timestamp = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be more specific and indicate that it is specifically about backend sync - what about backends_last_synced_at
?
rootfs/etc/nginx/lua/balancer.lua
Outdated
@@ -92,6 +94,12 @@ local function format_ipv6_endpoints(endpoints) | |||
return formatted_endpoints | |||
end | |||
|
|||
local function use_external_name(backend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to https://github.com/kubernetes/ingress-nginx/pull/5418/files#r412948805, I'd go with is_backend_with_external_name
.
rootfs/etc/nginx/lua/balancer.lua
Outdated
end | ||
return | ||
end | ||
last_timestamp = timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you deliberately updating this in the beginning of function body to avoid retries in case of i.e json decoding error happens? Intuitively I'd have expected this update to be in the end of the function body. I wonder how we can make the intention clearer for the reader.
@@ -17,6 +17,14 @@ function _M.get_general_data() | |||
return configuration_data:get("general") | |||
end | |||
|
|||
function _M.get_timestamp_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to https://github.com/kubernetes/ingress-nginx/pull/5418/files#r412949876, get_raw_backends_last_sycned_at
.
@@ -17,6 +17,14 @@ function _M.get_general_data() | |||
return configuration_data:get("general") | |||
end | |||
|
|||
function _M.get_timestamp_data() | |||
local timestamp = configuration_data:get("timestamp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/timestamp/raw_backends_last_sycned_at
@@ -177,6 +185,14 @@ local function handle_backends() | |||
return | |||
end | |||
|
|||
local timestamp = os.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ngx.update_time
and ngx.time
?
local timestamp = os.time() | ||
success, err = configuration_data:set("timestamp", timestamp) | ||
if not success then | ||
ngx.log(ngx.ERR, "dynamic-configuration: error setting sync timestamp: " .. tostring(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include some information in the error message about the potential consequences of not being able to update this timestamp.
rootfs/etc/nginx/lua/balancer.lua
Outdated
@@ -129,6 +135,15 @@ local function sync_backend(backend) | |||
end | |||
|
|||
local function sync_backends() | |||
local timestamp = configuration.get_timestamp_data() | |||
if timestamp <= last_timestamp then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we force sync if last_timestamp
- timestamp
is too large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should force sync if current timestamp is too larger than backends_last_sycned_at
stored in each worker, avoiding failure of backends syncing caused by error updating raw_backends_last_sycned_at
in shared configuration data?
Thanks for your nice suggestions. I'll give a patch later.
|
@ZxYuan: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@ZxYuan do you have a script to generate this scenario? |
FYI. Create a nginx deployment on port 80 with label
Run the bash script
|
The PR records a timestamp in nginx shared configuration when API "/configuration/backends" is called by ingress controller. Every second each nginx worker checks whether it is necessary to sync non-external backends.
What this PR does / why we need it:
In our Kubernetes cluster with thousands of service and ingress resources, each of nginx workers syncing backends every second leads to quite heavy load for CPUs. In my testing environment with no input requests, impact on cpu usage is showed as below. Heavy CPU load also seriously affects request time of Biz APIs.
Although it is necessary to sync backends of services using ExternalName since DNS may change anytime, frequent sync of non-external backends can be avoided.
Types of changes
I'm not sure about the type of change, but how dynamic LB works internally is actually modified. Perhaps Breaking change?
Which issue/s this PR fixes
How Has This Been Tested?
Create N ingress rules and corresponding N non-ExternalName services. Each ingress rule makes a route to one of the services. Without the patch, when N is large, each nginx worker runs out much of its CPU core as the figure above shows.
Checklist: