-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Services unavailable after route updates in db-less mode #8976
Comments
cc: @rainest |
I think it may be partly due to the health checker. When running in kubernetes the target IPs can change and some times frequently. If kong is trying to keep up with pod ip as the roll rather than the service name, that could produce this flaky behavior we are seeing. |
@fred-cardoso Good to know I'm not alone. This problem has been hard to replicate and has generally fixed itself. So we brushed it off and not critical. As we have moved more and larger services behind kong, it has gotten worse and taken longer to fix. In the most recent case, it never recovered even after 2 hours. we had to manually restart all of the kong pods. |
For us it's getting critical since it affects frontends and those are clearly noticed by the users.
Unfortunately on our setup it's not "easy" to change the deployment and deploy the DB, but maybe you are right. Definely something worth testing. |
I agree, this was more a question for the kong folks. just looking for something that might help mitigate the problem. |
Hello @esatterwhite, thank you for reporting this issue. Are you seeing these periods of instability will all reconfigurations of Kong gateway, or just with some of them like @fred-cardoso? Thank you! |
As an additional bit of information: We've found some situations in which multiple reconfiguration requests of Kong issued in short intervals could get it into an instable state with overly long response times. A fix for this issue is being tested, but it is not yet sure when it will be released. |
It is hard to say, we certainly notice the problem much more when we change ingress rules. As that doesn't cause kong to restart. we run db-less on kubernetes and changing anything about the deployment configuration, env vars, etc causes the kong pods to restart. We don't use a lot of plugins as of yet. |
Not sure if this is helpful as well, but I also noticed that the pods don't even get the requests from Kong. What I do see is a drop in the requests, but they don't fail so it's really Kong not being able to connect to them. |
Yes, I'm pretty sure this is was caused the lingering problem. most of the 503s go away as the upstream targets are rebuild, but there was about 2 hours where about 10% of requests would 503. the only way to fix it was to restart the kong instances. In particular, when running on kubernetes, scaling a deployment up/down or restarting one would cause dozens of upstream target rebuilds. as all of the IPs of the pods change. for a large deployment - 300 pods, restarting 10% of pods at a time, as I understand it that is 30 router rebuilds / reconfigures in a very short period of time. |
It happened again were a the synchronization was left in a bad state. We didn't change the configuration of anything directly. but some pods in our infrastructure that are associated to a kong ingress restarted. hey -n 1000 https://xxxx.com
Summary:
Total: 3.7963 secs
Slowest: 0.4579 secs
Fastest: 0.0348 secs
Average: 0.1398 secs
Requests/sec: 263.4139
Total data: 17110 bytes
Size/request: 17 bytes
Response time histogram:
0.035 [1] |
0.077 [266] |■■■■■■■■■■■■■■■■■■■■■■■■■
0.119 [9] |■
0.162 [434] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.204 [150] |■■■■■■■■■■■■■■
0.246 [59] |■■■■■
0.289 [25] |■■
0.331 [22] |■■
0.373 [29] |■■■
0.416 [4] |
0.458 [1] |
Latency distribution:
10% in 0.0411 secs
25% in 0.0479 secs
50% in 0.1413 secs
75% in 0.1726 secs
90% in 0.2341 secs
95% in 0.3038 secs
99% in 0.3591 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0089 secs, 0.0348 secs, 0.4579 secs
DNS-lookup: 0.0038 secs, 0.0000 secs, 0.0764 secs
req write: 0.0000 secs, 0.0000 secs, 0.0036 secs
resp wait: 0.0533 secs, 0.0348 secs, 0.4578 secs
resp read: 0.0002 secs, 0.0000 secs, 0.0067 secs
Status code distribution:
[200] 705 responses
[503] 295 responses restarting it was the only fix. |
@hanshuebner @locao this was a community report that looked similar to the issue we were working on in EE PRs 3344 and 3212. Would it be possible to make an OSS image that includes those also? |
#3207 sounds like the actual fix? The commit makes it sound like the fix is mainly to remove the error log. Does it prevent request from being sent to the missing upstream? |
Oh, those commits are really new, we are going to rebuild the image with @fred-cardoso, will test it and see if it gets better |
@hanshuebner Thanks, this looks like it may be helpful but I'm not entirely sure the size of the configuration is entirely the problem. The problem seems to persist for long periods of time. After one or more reconfigurations happen there seem to be invalid upstream targets lingering which Kong keeps sending requests to event though the ips do not exist anymore. I would think the health checker would eventually remove those from the balancer, or at least stop sending requests to them. |
This is an interesting point. |
thats a good point. The configuration on the upstream exists, but the default check interval is If this problem continues, It sounds like we'd have to do that |
Although it feels unnecessary. between kubernetes + kong. |
We also se this a lot when things restart ingress-kong-7fbc678578-4glkz kong-proxy 2022/06/22 15:30:03 [error] 1107#0: *6829049 [lua] balancers.lua:228: get_balancer(): balancer not found for test.default.80.svc, will create it, client: X.YYY.ZZ.11, server: kong, request: "GET /account/signin HTTP/1.1", host: "app.test.com" The balancer loses track of everything for a while. Things are unusable while this is happening |
@hanshuebner Is there an image/tag we can pull down to try? |
These commits are on the master branch, but they're not yet part of a release. If you want to try them, you'll have to build Kong yourself. Kong 2.8 is planned to be released soon, but I'm not able to give you an exact date. |
With @fred-cardoso we are currently testing kong/kong:2.8.0-d648489b6-alpine which seems to be working great but still some 50Xs. |
If it is still reporting 503s what was the improvement? |
It's reporting way less 502's, and I'm not sure it's related to kong, we need to understand why they are happening. |
For us it is pretty reproducible at scale. Restarting the kubernetes deployment w/ 300 pods will trigger many reconfigurations and IP changes to occur in close proximity. Several, if not all of the services registered to the kong ingress will be unavailable. We know its kong because customers report getting the failure to get peer from ring balancer error. and the logs would indicate kong sent a request to an IP that isn't there anymore (cant connect to). All the pods are up + responsive |
We are going to come back to you in a few days with our conclusions. |
Hey folks, Even though the behaviors are indeed similar, I'm afraid @esatterwhite and @fred-cardoso have different issues. About Kong 2.7:
About Kong 2.8:
@fred-cardoso should get rid of the 503s by enabling health-checks on the upstream entities. Please note that this is still a possibility, we were not able to reproduce the behavior locally. Here are the docs on enabling the health-checks. Passive health-checking would be enough. |
Hi, Looks like kong/kong:2.8.0-d648489b6-alpine fixed our issue. We don't have the Also, @locao we aren't using kong health-checks, only k8s ones. If the 503's start again, we'll try to enable them For us, as far as are aware, the issue is fixed but we'll be monitoring the situation and this issue in case something new comes up |
@Tchekda I'm facing a similar problem. Could you share if you are using ingress.kubernetes.io/service-upstream annotation on your services or not? |
Hi, |
I used this on one of the services that was very problematic during successive config reloads and it helped. But this isn't an ideal solution in all cases. There are cases when we want to control the load balancing algorithm and take pressure off of kube proxy. Using the annotation feel more like a work around |
I upgraded to kong/kong:2.8.0-d648489b6-alpine 4 days ago on production env and the 503 status code is gone! it works for me, thank you! |
I'm a little confused as to which versions do or do not have these fixes. Does 2.8.1 not have them? |
This version specifically |
We don't have a public release with fixes directly related to this issue as we are not able to reproduce it in a test environment. We have a couple of theories that are being tested, but none has been proved to fix it. |
How are you trying to reproduce it? |
@locao Can you shed some light on how you all might be trying to recreate the issue? we're running on kubernetes so if you have a test cluster, it is fairly straight forward to scale the deployments up |
Also related to #9051 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there an existing issue for this?
Kong version (
$ kong version
)2.7.0
Current Behavior
When running kong in DB-less mode w/ the ingress controller updating ingress routes can cause all services to be come unavailable for several minutes. Given the event driven nature of kubernetes updates to ingress rules can happen in rapid succession and in close proximity. When this happens the configuration sync + route update process between the ingress controller and Kong proxy seems to get into a very bad state.
Every Service behind kong will result in a 503
"failure to get a peer from the ring-balancer"
for 3-5 minutes. Its particularly bad with larger kubernetes deployments (100+ pods). In some situations it will never completely recover from where some registered routes will point to ips that don't exist and ~10% of requests will result in a 503. The only way to recover at this point is to restart all instances of kong.Expected Behavior
When configuration is rebuilt existing routes to backing service should continue to work until they can safely be removed from the router. Bringing the entire system down for minutes at a time every time we touch the configuration isn't a viable production solution
Steps To Reproduce
Here is a rough output of the ingress that is used to configure Kong in DB-Less mode using the ingress controller
In this particular case, we added everything under
logs.domain-two.com
. everything had previously existed.logs.domain-two.com
is a copy oflogs.domain-one.com
note this is a rather large deployment w/ 300 pods/IP behind it.
Name: ingress-kong Namespace: default Default backend: default-http-backend:80 (<error: endpoints "default-http-backend" not found>) Rules: Host Path Backends ---- ---- -------- logs.domain-one.com /logs service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /supertenant service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /webhooks service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /akam /aptible service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /p/ping service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /healthcheck service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) logs.domain-two.com /logs service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /supertenant service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /webhooks service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /akamai service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /aptible service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /ping service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /p/ping service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) /healthcheck service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...) heroku.domain-one.com /heroku/logplex service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...) /heroku-addon/logplex service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...) heroku.domain-two.com /heroku/logplex service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...) /heroku-addon/logplex service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...) api.domain-one.com /v1 service-three:80 (XX.XX.AAA.BBB:7080,XX.XX.ZZZ.66:7080,XX.XX.241.175:7080 + 7 more...) /v2/export service-three:80 (XX.XX.AAA.BBB:7080,XX.XX.ZZZ.66:7080,XX.XX.241.175:7080 + 7 more...) api.domain-two.com /v1 service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...) /v2/export service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...) /p/ping service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...) api2.domain-one.com /v1 service-four:80 (XX.AA.BBB.CCC:7080) /v2/export service-four:80 (XX.AA.BBB.CCC:7080) api2.domain-two.com /v1 service-four:80 (XX.AA.BBB.CCC:7080) /v2/export service-four:80 (XX.AA.BBB.CCC:7080) /p/ping service-four:80 (XX.AA.BBB.CCC:7080) app.domain-one.com / service-five:80 (XX.XX.000.111:7080,XX.XX.222.333:7080,XX.XX.444.555:7080 + 7 more...) app.domain-two.com / service-five:80 (XX.XX.000.111:7080,XX.XX.222.333:7080,XX.XX.444.555:7080 + 7 more...) app2.domain-one.com / service-six:80 (XX.XX.AAA.BBB:7080) app2.domain-two.com / service-four:80 (XX.XX.AAA.BBB:7080) tail.domain-one.com / service-seven:80 (XX.XX.YYY.TTT:7080,XX.XX.130.37:7080,XX.XX.ZZZ.RRR:7080 + 37 more...) tail.domain-two.com / service-seven:80 (XX.XX.YYY.TTT:7080,XX.XX.130.37:7080,XX.XX.ZZZ.RRR:7080 + 37 more...)
The text was updated successfully, but these errors were encountered: