Kong routes requests to targets deleted via Admin API #8757

mkornatz · 2022-05-04T17:36:10Z

Is there an existing issue for this?

I have searched the existing issues #9019

Kong version (`$ kong version`)

2.8.1

Current Behavior

When deleting a target from an upstream via the Admin API, we see a recurring issue where the Kong Gateway proxy nodes continue to route traffic to the target. The Admin API shows the correct state (the target is removed), but we can confirm from the request logs that traffic is still being routed to the target long after the target is removed.

Our current workaround has been to restart the proxy nodes when we detect this issue, which resolves the issue, but is causing other side-effects, such as cutting off long running API requests.

About 1 second after the target is removed, using the DELETE /ustreams/{upstream_id}/targets/{target_id} Admin API, we see log entries like the following coming from the Kong Gateway. Note: these log lines may be out of order due to how our logging system collects entries with the same timestamp.

2022/05/02 15:10:39 [error] 1108#0: *20612 [lua] events.lua:194: do_handlerlist(): worker-events: event callback failed; source=balancer, event=targets, pid=1115 error='/usr/local/openresty/lualib/resty/lock.lua:153: API disabled in the context of log_by_lua*
log_by_lua(nginx-kong.conf:136):2: in main chunk', data={"operation":"create","entity":{"upstream":{"id":"943f01f4-60d0-4435-933d-c76adbbe7060"}}} while logging request, client: REDACTED, server: kong, request: "POST / HTTP/1.1", upstream: "http://REDACTED:80/", host: "REDACTED"
/usr/local/share/lua/5.1/kong/init.lua:1380: in function 'log'
/usr/local/share/lua/5.1/kong/runloop/handler.lua:1651: in function 'after'
...l/share/lua/5.1/kong/runloop/balancer/healthcheckers.lua:169: in function 'report_http_status'
/usr/local/share/lua/5.1/resty/healthcheck.lua:582: in function 'report_http_status'
/usr/local/share/lua/5.1/resty/healthcheck.lua:560: in function 'run_mutexed_fn'
[C]: in function 'pcall'
/usr/local/share/lua/5.1/resty/healthcheck.lua:676: in function </usr/local/share/lua/5.1/resty/healthcheck.lua:640>
/usr/local/share/lua/5.1/resty/healthcheck.lua:1242: in function 'raise_event'
/usr/local/share/lua/5.1/resty/worker/events.lua:361: in function 'post'
/usr/local/share/lua/5.1/resty/worker/events.lua:219: in function 'do_event_json'
/usr/local/share/lua/5.1/resty/worker/events.lua:185: in function 'do_handlerlist'
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/runloop/handler.lua:240: in function </usr/local/share/lua/5.1/kong/runloop/handler.lua:235>
/usr/local/share/lua/5.1/kong/runloop/balancer/targets.lua:154: in function 'on_target_event'
.../local/share/lua/5.1/kong/runloop/balancer/balancers.lua:198: in function 'create_balancer'
.../local/share/lua/5.1/kong/runloop/balancer/balancers.lua:105: in function 'create_balancer_exclusive'
/usr/local/share/lua/5.1/kong/cache/init.lua:168: in function 'fetch_targets'
/usr/local/share/lua/5.1/resty/mlcache.lua:685: in function 'get'
/usr/local/openresty/lualib/resty/lock.lua:153: in function 'lock'
[C]: in function 'sleep'
stack traceback:

Intermixed with those errors above, we also see several log entries like the following related to the given target/upstream.

1108#0: *26698 [lua] healthcheckers.lua:171: report_http_status(): [healthchecks] failed reporting status: true while logging request

This issue is intermittent, and doesn't always occur. It seems to be most prevalent when the targets being delete are having issues and have become unresponsive or take a very long time (30s) to respond to requests.

Expected Behavior

When a target is removed from an upstream, we expect that traffic will no longer be routed to it. With eventual Kong worker consistency, we don't expect this change to be immediately applied. However, we do expect it to take effect within at least several seconds after the target is removed.

Steps To Reproduce

We are still working to reproduce the issue in an isolated environment. However, in our production environment we can confirm the follow steps are taken to see the behavior.

1. In an environment with multiple gateway proxy nodes (at least 2)
2. Create a service and upstream and configure active and passive health checks with a standard configuration
3. Add target_1 and target_2 to the upstream
4. Force target_2 to become unresponsive or take a long time to respond to requests
5. Delete target_2 from the upstream

Anything else?

We are using the following config values. We have tried multiple variations of these configs with no change in behavior.

KONG_DB_UPDATE_FREQUENCY: "5"
KONG_WORKER_STATE_UPDATE_FREQUENCY: "5"
KONG_WORKER_CONSISTENCY: "eventual"

In our environment, we often have targets that become unresponsive or start serving incorrect data when they are under heavy load. A background job in a separate system is responsible for deleting these targets when they get into a problematic state. This often leads to targets being added and removed every 5 minutes or so.
Both the admin and proxy nodes are running in a Kubernetes cluster
The logs included in the report above seem to be the only relevant logs
The upstream is configured to use both active and passive checks for the targets

The text was updated successfully, but these errors were encountered:

23henne · 2022-05-09T11:10:31Z

I'm wondering if it might be same issue or at least related to #8517. But in your case it logs at least a more or less meaningful error message. Maybe same but different but same.

mkornatz · 2022-05-19T21:22:10Z

@23henne Thanks for pointing me to that other issue. It does indeed look similar. I'm still having trouble reproducing in a non-production environment, but if I can, I'll also mention it in the other issue.

uvish · 2022-05-25T09:08:06Z

Same issue here ,Kong 2.8.1 logs shows requests routed to old IP ,but admin api /targets show only one target with new ip.
Update : On checking the DB we found multiple entries for same upstreams , one with old format and one with new format . On clearing all the older entries , this problem was fixed . This might have occured because of incomplete migration.

esatterwhite · 2022-06-20T11:05:52Z

This may be related to #8976

locao · 2022-07-08T13:32:42Z

Hi @mkornatz! Thanks for you report. The log you shared is short of the stack traceback. Can you share the next lines after stack traceback:?

ybce · 2022-07-08T18:48:16Z

Hello @locao I'm Matt's colleague and I believe that that is the stack traceback, as Matt mentioned Note: these log lines may be out of order due to how our logging system collects entries with the same timestamp.

altnti · 2022-07-25T09:02:00Z

Some issue, kong version is 2.1.4.

Use kong admin api add one target(10...:47179) and remove one target(10...:57593).
One of kong continue to access the removed target(10...:57593).
Use kong admin get target list that it is correct, the info:
{
"next": null,
"data": [
{
"created_at": 1658732450.015,
"id": "a2c1f98b-2142-43b4-b489-eb223b9ae970",
"tags": null,
"weight": 100,
#new target(is live)
"target": "10...:47179",
"upstream": {
"id": "fadc44a1-550e-410f-be86-de67ce1a3acb"
}
}
]
}
The question myabe auto fix when update the upstream.

raoxiaoyan · 2023-08-31T02:24:23Z

The related PR is #10384. Close it.

1madhu9 · 2024-04-07T21:07:17Z

Hi Team,
We are observing the exact same issue in Kong version 3.2.2 as well i.e., When deleting a target from an upstream via the Admin API, we see a recurring issue where the Kong Gateway proxy nodes continue to route traffic to the target. We had instances in our production environment where some of the kong nodes were proxying the request to an upstream target which has been terminated more than 2 days ago.
The Admin API shows the correct state of all targets (the target is removed), but from the request logs we see that traffic is still being proxied to the target long after the target is removed.

We are trying to reproduce the issue in Non-Production environment, but we haven't been able to do so yet. We shall share more detail if we are able to reproduce the issue.

Kong Configuration:
version : 3.2.2
Mode : traditional mode with postgres DB

ADD-SP added core/balancer task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. core/admin-api labels May 5, 2022

ybce mentioned this issue Jul 21, 2022

Restarted workers in kong pod use stale config #9090

Closed

1 task

altnti mentioned this issue Apr 21, 2023

Kong routes requests to targets deleted #10722

Closed

1 task

raoxiaoyan closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kong routes requests to targets deleted via Admin API #8757

Kong routes requests to targets deleted via Admin API #8757

mkornatz commented May 4, 2022 •

edited by eshih

Loading

23henne commented May 9, 2022

mkornatz commented May 19, 2022

uvish commented May 25, 2022 •

edited

Loading

esatterwhite commented Jun 20, 2022

locao commented Jul 8, 2022

ybce commented Jul 8, 2022

altnti commented Jul 25, 2022

raoxiaoyan commented Aug 31, 2023

1madhu9 commented Apr 7, 2024

Kong routes requests to targets deleted via Admin API #8757

Kong routes requests to targets deleted via Admin API #8757

Comments

mkornatz commented May 4, 2022 • edited by eshih Loading

Is there an existing issue for this?

Kong version ($ kong version)

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

23henne commented May 9, 2022

mkornatz commented May 19, 2022

uvish commented May 25, 2022 • edited Loading

esatterwhite commented Jun 20, 2022

locao commented Jul 8, 2022

ybce commented Jul 8, 2022

altnti commented Jul 25, 2022

raoxiaoyan commented Aug 31, 2023

1madhu9 commented Apr 7, 2024

mkornatz commented May 4, 2022 •

edited by eshih

Loading

Kong version (`$ kong version`)

uvish commented May 25, 2022 •

edited

Loading