Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Services unavailable after route updates in db-less mode #8976

Closed
1 task done
esatterwhite opened this issue Jun 17, 2022 · 41 comments
Closed
1 task done

Services unavailable after route updates in db-less mode #8976

esatterwhite opened this issue Jun 17, 2022 · 41 comments
Labels
area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller area/kubernetes Issues where Kong is running on top of Kubernetes core/performance pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... stale

Comments

@esatterwhite
Copy link

esatterwhite commented Jun 17, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

2.7.0

Current Behavior

When running kong in DB-less mode w/ the ingress controller updating ingress routes can cause all services to be come unavailable for several minutes. Given the event driven nature of kubernetes updates to ingress rules can happen in rapid succession and in close proximity. When this happens the configuration sync + route update process between the ingress controller and Kong proxy seems to get into a very bad state.

Every Service behind kong will result in a 503 "failure to get a peer from the ring-balancer" for 3-5 minutes. Its particularly bad with larger kubernetes deployments (100+ pods). In some situations it will never completely recover from where some registered routes will point to ips that don't exist and ~10% of requests will result in a 503. The only way to recover at this point is to restart all instances of kong.

  • worker_consistency = strict
  • worker_state_update_frequency = 5

Expected Behavior

When configuration is rebuilt existing routes to backing service should continue to work until they can safely be removed from the router. Bringing the entire system down for minutes at a time every time we touch the configuration isn't a viable production solution

Steps To Reproduce

Here is a rough output of the ingress that is used to configure Kong in DB-Less mode using the ingress controller

In this particular case, we added everything under logs.domain-two.com. everything had previously existed. logs.domain-two.com is a copy of logs.domain-one.com

note this is a rather large deployment w/ 300 pods/IP behind it.

Name:             ingress-kong
Namespace:        default
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
Rules:
  Host               Path  Backends
  ----               ----  --------
  logs.domain-one.com  
                     /logs          service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /supertenant   service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /webhooks      service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /akam
                     /aptible       service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /p/ping        service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /healthcheck   service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
  logs.domain-two.com     
                     /logs          service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /supertenant   service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /webhooks      service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /akamai        service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /aptible       service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /ping          service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /p/ping        service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
                     /healthcheck   service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
  heroku.domain-one.com  
                     /heroku/logplex         service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...)
                     /heroku-addon/logplex   service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...)
  heroku.domain-two.com   
                     /heroku/logplex         service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...)
                     /heroku-addon/logplex   service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...)
  api.domain-one.com     
                     /v1          service-three:80 (XX.XX.AAA.BBB:7080,XX.XX.ZZZ.66:7080,XX.XX.241.175:7080 + 7 more...)
                     /v2/export   service-three:80 (XX.XX.AAA.BBB:7080,XX.XX.ZZZ.66:7080,XX.XX.241.175:7080 + 7 more...)
  api.domain-two.com      
                     /v1          service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...)
                     /v2/export   service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...)
                     /p/ping      service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...)
  api2.domain-one.com    
                     /v1          service-four:80 (XX.AA.BBB.CCC:7080)
                     /v2/export   service-four:80 (XX.AA.BBB.CCC:7080)
  api2.domain-two.com     
                     /v1          service-four:80 (XX.AA.BBB.CCC:7080)
                     /v2/export   service-four:80 (XX.AA.BBB.CCC:7080)
                     /p/ping      service-four:80 (XX.AA.BBB.CCC:7080)
  app.domain-one.com     
                     /   service-five:80 (XX.XX.000.111:7080,XX.XX.222.333:7080,XX.XX.444.555:7080 + 7 more...)
  app.domain-two.com      
                     /   service-five:80 (XX.XX.000.111:7080,XX.XX.222.333:7080,XX.XX.444.555:7080 + 7 more...)
  app2.domain-one.com    
                     /   service-six:80 (XX.XX.AAA.BBB:7080)
  app2.domain-two.com     
                     /   service-four:80 (XX.XX.AAA.BBB:7080)
  tail.domain-one.com    
                     /   service-seven:80 (XX.XX.YYY.TTT:7080,XX.XX.130.37:7080,XX.XX.ZZZ.RRR:7080 + 37 more...)
  tail.domain-two.com     
                     /   service-seven:80 (XX.XX.YYY.TTT:7080,XX.XX.130.37:7080,XX.XX.ZZZ.RRR:7080 + 37 more...)
@esatterwhite
Copy link
Author

cc: @rainest

@esatterwhite
Copy link
Author

I think it may be partly due to the health checker. When running in kubernetes the target IPs can change and some times frequently. If kong is trying to keep up with pod ip as the roll rather than the service name, that could produce this flaky behavior we are seeing.

@fred-cardoso
Copy link

I have a Kong deployment as well with around 50 pods and we are experiencing the same issue, or at least it's very identical.

Kong version ($ kong version)

2.8.0-b4d44dac8
I'm using a docker image with the following tag: kong:2.8.0-b4d44dac8-alpine

Current Behavior

Kong is running in DB-less mode. Sometimes (not always), when the ingress controller updates the routes / syncs a new configuration into Kong cause some services (not all) to be unavailable generally until the next sync.
I've been able to correlate the proxy returning 503 "failure to get a peer from the ring-balancer" and the level=info msg="successfully synced configuration to kong." log messages.

For example, I have a first spike of 503 at 17:07 (UTC+1) (Prometheus scraping interval is not precise enough)
image
And the logs show the following:
image

The 503 spikes stop around 17:16:30 (UTC+1)
image
And the logs show:
image

Steps To Reproduce

In my configuration is not easy to reproduce, it happens sometimes. The CPU/Memory usage can't be correlated to the issue itself.

@esatterwhite
Copy link
Author

@fred-cardoso Good to know I'm not alone. This problem has been hard to replicate and has generally fixed itself. So we brushed it off and not critical. As we have moved more and larger services behind kong, it has gotten worse and taken longer to fix.

In the most recent case, it never recovered even after 2 hours. we had to manually restart all of the kong pods.
Wondering if the sync + reconcile process is more accurate and stable in Hybrid mode?

@fred-cardoso
Copy link

For us it's getting critical since it affects frontends and those are clearly noticed by the users.

Wondering if the sync + reconcile process is more accurate and stable in Hybrid mode?

Unfortunately on our setup it's not "easy" to change the deployment and deploy the DB, but maybe you are right. Definely something worth testing.
Even though I think the DB-less should work properly 😛

@esatterwhite
Copy link
Author

I agree, this was more a question for the kong folks. just looking for something that might help mitigate the problem.

@hanshuebner
Copy link
Contributor

Hello @esatterwhite,

thank you for reporting this issue. Are you seeing these periods of instability will all reconfigurations of Kong gateway, or just with some of them like @fred-cardoso?

Thank you!
Hans

@hanshuebner
Copy link
Contributor

As an additional bit of information: We've found some situations in which multiple reconfiguration requests of Kong issued in short intervals could get it into an instable state with overly long response times. A fix for this issue is being tested, but it is not yet sure when it will be released.

@esatterwhite
Copy link
Author

thank you for reporting this issue. Are you seeing these periods of instability will all reconfigurations of Kong gateway, or just with some of them like @fred-cardoso?

It is hard to say, we certainly notice the problem much more when we change ingress rules. As that doesn't cause kong to restart. we run db-less on kubernetes and changing anything about the deployment configuration, env vars, etc causes the kong pods to restart.

We don't use a lot of plugins as of yet.

@fred-cardoso
Copy link

Not sure if this is helpful as well, but I also noticed that the pods don't even get the requests from Kong. What I do see is a drop in the requests, but they don't fail so it's really Kong not being able to connect to them.

@esatterwhite
Copy link
Author

esatterwhite commented Jun 21, 2022

As an additional bit of information: We've found some situations in which multiple reconfiguration requests of Kong issued in short intervals could get it into an instable state with overly long response times.

Yes, I'm pretty sure this is was caused the lingering problem. most of the 503s go away as the upstream targets are rebuild, but there was about 2 hours where about 10% of requests would 503. the only way to fix it was to restart the kong instances.

In particular, when running on kubernetes, scaling a deployment up/down or restarting one would cause dozens of upstream target rebuilds. as all of the IPs of the pods change.

for a large deployment - 300 pods, restarting 10% of pods at a time, as I understand it that is 30 router rebuilds / reconfigures in a very short period of time.
It honesly has every one rather scared of running Kong in production at the moment, for a number of apps we've had to turn on the upstream service target bit for the ingress controller so kong isn't traking IP addresss. But this means we loose the load balancing behaviors of kong and put more pressure on kube proxy.

@esatterwhite
Copy link
Author

esatterwhite commented Jun 21, 2022

It happened again were a the synchronization was left in a bad state. We didn't change the configuration of anything directly. but some pods in our infrastructure that are associated to a kong ingress restarted.

hey -n 1000 https://xxxx.com

Summary:
  Total:        3.7963 secs
  Slowest:      0.4579 secs
  Fastest:      0.0348 secs
  Average:      0.1398 secs
  Requests/sec: 263.4139
  
  Total data:   17110 bytes
  Size/request: 17 bytes

Response time histogram:
  0.035 [1]     |
  0.077 [266]   |■■■■■■■■■■■■■■■■■■■■■■■■■
  0.119 [9]     |■
  0.162 [434]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.204 [150]   |■■■■■■■■■■■■■■
  0.246 [59]    |■■■■■
  0.289 [25]    |■■
  0.331 [22]    |■■
  0.373 [29]    |■■■
  0.416 [4]     |
  0.458 [1]     |


Latency distribution:
  10% in 0.0411 secs
  25% in 0.0479 secs
  50% in 0.1413 secs
  75% in 0.1726 secs
  90% in 0.2341 secs
  95% in 0.3038 secs
  99% in 0.3591 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0089 secs, 0.0348 secs, 0.4579 secs
  DNS-lookup:   0.0038 secs, 0.0000 secs, 0.0764 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0036 secs
  resp wait:    0.0533 secs, 0.0348 secs, 0.4578 secs
  resp read:    0.0002 secs, 0.0000 secs, 0.0067 secs

Status code distribution:
  [200] 705 responses
  [503] 295 responses

restarting it was the only fix.

@rainest
Copy link
Contributor

rainest commented Jun 21, 2022

@hanshuebner @locao this was a community report that looked similar to the issue we were working on in EE PRs 3344 and 3212. Would it be possible to make an OSS image that includes those also?

@esatterwhite
Copy link
Author

#3207 sounds like the actual fix? The commit makes it sound like the fix is mainly to remove the error log. Does it prevent request from being sent to the missing upstream?

@hanshuebner
Copy link
Contributor

The issue that we've recently fixed caused Kong to stall when a new reconfiguration cycle was started while another one was active.

95d704e
200f56e
ef58cdd
de37b51

@Tchekda
Copy link

Tchekda commented Jun 22, 2022

Oh, those commits are really new, we are going to rebuild the image with @fred-cardoso, will test it and see if it gets better

@esatterwhite
Copy link
Author

esatterwhite commented Jun 22, 2022

@hanshuebner Thanks, this looks like it may be helpful but I'm not entirely sure the size of the configuration is entirely the problem. The problem seems to persist for long periods of time. After one or more reconfigurations happen there seem to be invalid upstream targets lingering which Kong keeps sending requests to event though the ips do not exist anymore.

I would think the health checker would eventually remove those from the balancer, or at least stop sending requests to them.

kong-service-503

@fred-cardoso
Copy link

This is an interesting point.
Do you have the health checks enabled @esatterwhite ?
I don't have those. The health checks are only configured on the pods themselves, Kong is not doing them.

@esatterwhite
Copy link
Author

esatterwhite commented Jun 22, 2022

This is an interesting point. Do you have the health checks enabled @esatterwhite ? I don't have those. The health checks are only configured on the pods themselves, Kong is not doing them.

thats a good point. The configuration on the upstream exists, but the default check interval is 0 by default. So I suppose you'd have to manually configure one

If this problem continues, It sounds like we'd have to do that

@esatterwhite
Copy link
Author

esatterwhite commented Jun 22, 2022

Although it feels unnecessary. between kubernetes + kong.
This feels like a behavior that shouldn't happen let alone need to be configured explicitly

@esatterwhite
Copy link
Author

esatterwhite commented Jun 22, 2022

We also se this a lot when things restart

ingress-kong-7fbc678578-4glkz kong-proxy 2022/06/22 15:30:03 [error] 1107#0: *6829049 [lua] balancers.lua:228: get_balancer(): balancer not found for test.default.80.svc, will create it, client: X.YYY.ZZ.11, server: kong, request: "GET /account/signin HTTP/1.1", host: "app.test.com"

The balancer loses track of everything for a while. Things are unusable while this is happening

@esatterwhite
Copy link
Author

The issue that we've recently fixed caused Kong to stall when a new reconfiguration cycle was started while another one was active.

95d704e 200f56e ef58cdd de37b51

@hanshuebner Is there an image/tag we can pull down to try?

@hanshuebner
Copy link
Contributor

@hanshuebner Is there an image/tag we can pull down to try?

These commits are on the master branch, but they're not yet part of a release. If you want to try them, you'll have to build Kong yourself. Kong 2.8 is planned to be released soon, but I'm not able to give you an exact date.

@Tchekda
Copy link

Tchekda commented Jun 23, 2022

@hanshuebner Is there an image/tag we can pull down to try?

With @fred-cardoso we are currently testing kong/kong:2.8.0-d648489b6-alpine which seems to be working great but still some 50Xs.
In the next few days we are going to decide if we keep running the nightly build or revert back to the stable one

@esatterwhite
Copy link
Author

If it is still reporting 503s what was the improvement?

@Tchekda
Copy link

Tchekda commented Jun 23, 2022

If it is still reporting 503s what was the improvement?

It's reporting way less 502's, and I'm not sure it's related to kong, we need to understand why they are happening.
We are almost at 0 rps for the last day so I would say it is an improvement.
We need to give it 1-2 more days to be sure since the bug was happening randomly

@esatterwhite
Copy link
Author

For us it is pretty reproducible at scale. Restarting the kubernetes deployment w/ 300 pods will trigger many reconfigurations and IP changes to occur in close proximity. Several, if not all of the services registered to the kong ingress will be unavailable.
Some times that will linger for several hours.

We know its kong because customers report getting the failure to get peer from ring balancer error. and the logs would indicate kong sent a request to an IP that isn't there anymore (cant connect to). All the pods are up + responsive

@Tchekda
Copy link

Tchekda commented Jun 23, 2022

We are going to come back to you in a few days with our conclusions.
We also know that's it's coming from kong for the same reasons as yours

@locao
Copy link
Contributor

locao commented Jun 24, 2022

Hey folks,

Even though the behaviors are indeed similar, I'm afraid @esatterwhite and @fred-cardoso have different issues.

About Kong 2.7:

  • balancer not found for <upstream>, will create it is mislabeled as an error message. It's informative and you can safely ignore it. It means that the load-balancers are still being created and this particular one was not touched yet, but as it is needed, it will be created before it's turn.
  • I think it may be partly due to the health checker. When running in kubernetes the target IPs can change and some times frequently. If kong is trying to keep up with pod ip as the roll rather than the service name, that could produce this flaky behavior we are seeing.
    Do you mean Gateway health-checker or KIC health-checker? The Gateway health-checker does not resolve DNS records, so it doesn't matter if it's active or not. The DNS records are resolved by the load-balancer, if there's a problem here, proxying will fail either way.
  • Do you see any error level log messages?
  • I would recommend you to try Kong 2.8.1, with health-checks enabled (see below). There are several improvements that may be related to the issues you are seeing, e.g. fix(targets) reschedule resolve timer after resolving #8344, perf(clustering) improve hash calculation performance #8204 and fix(runloop) reschedule rebuild timers after running them #8634.

About Kong 2.8:

  • Kong 2.8 has a big change on the health-checker side. Now the targets health status is kept between config reloads. So, if a target is unhealthy, making a change to the configuration doesn't make the target start to proxy again. But with that we may have introduced an issue (not yet confirmed):
    • If the health-checks are not being used by any target, they are not attached to the upstream.
    • When they are attached, at load-balancer creation-time, they ask the load-balancer to resolve the DNS records from its target.
    • If they are not attached, that doesn't happen immediately. So, if the DNS server takes a bit longer to resolve the addresses, when the balancer is available the IPs were not yet resolved, hence the 503s.

@fred-cardoso should get rid of the 503s by enabling health-checks on the upstream entities. Please note that this is still a possibility, we were not able to reproduce the behavior locally. Here are the docs on enabling the health-checks. Passive health-checking would be enough.

@Tchekda
Copy link

Tchekda commented Jun 25, 2022

Hi,
As promised, here I am with our conclusions from our tests with @fred-cardoso :

Looks like kong/kong:2.8.0-d648489b6-alpine fixed our issue. We don't have the "failure to get a peer from the ring-balancer" message anymore which is a good news 👍 !

Also, @locao we aren't using kong health-checks, only k8s ones. If the 503's start again, we'll try to enable them

For us, as far as are aware, the issue is fixed but we'll be monitoring the situation and this issue in case something new comes up

@hbagdi
Copy link
Member

hbagdi commented Jun 29, 2022

@Tchekda I'm facing a similar problem. Could you share if you are using ingress.kubernetes.io/service-upstream annotation on your services or not?

@Tchekda
Copy link

Tchekda commented Jun 29, 2022

@Tchekda I'm facing a similar problem. Could you share if you are using ingress.kubernetes.io/service-upstream annotation on your services or not?

Hi,
I don't think we are using this annotation. Although I guess having it could have solved our issues because kube-proxy should be better aware of which pods are ready and which aren't.
I can't really test this feature right now but later we will give it a try and see if it helps.

@esatterwhite
Copy link
Author

@Tchekda I'm facing a similar problem. Could you share if you are using ingress.kubernetes.io/service-upstream annotation on your services or not?

I used this on one of the services that was very problematic during successive config reloads and it helped. But this isn't an ideal solution in all cases. There are cases when we want to control the load balancing algorithm and take pressure off of kube proxy. Using the annotation feel more like a work around

@thaonguyen-ct
Copy link

thaonguyen-ct commented Jul 11, 2022

Hi, As promised, here I am with our conclusions from our tests with @fred-cardoso :

Looks like kong/kong:2.8.0-d648489b6-alpine fixed our issue. We don't have the "failure to get a peer from the ring-balancer" message anymore which is a good news 👍 !

Also, @locao we aren't using kong health-checks, only k8s ones. If the 503's start again, we'll try to enable them

For us, as far as are aware, the issue is fixed but we'll be monitoring the situation and this issue in case something new comes up

I upgraded to kong/kong:2.8.0-d648489b6-alpine 4 days ago on production env and the 503 status code is gone! it works for me, thank you!

@esatterwhite
Copy link
Author

I'm a little confused as to which versions do or do not have these fixes. Does 2.8.1 not have them?

@esatterwhite
Copy link
Author

This version specifically

@locao
Copy link
Contributor

locao commented Jul 11, 2022

We don't have a public release with fixes directly related to this issue as we are not able to reproduce it in a test environment. We have a couple of theories that are being tested, but none has been proved to fix it.

@esatterwhite
Copy link
Author

How are you trying to reproduce it?

@esatterwhite
Copy link
Author

@locao Can you shed some light on how you all might be trying to recreate the issue?
I think its a problem at scale (not scale of throughput, but scale of deployment).
I could layout our setup and you could try re-produce the setup.

we're running on kubernetes so if you have a test cluster, it is fairly straight forward to scale the deployments up

@esatterwhite
Copy link
Author

Also related to #9051

@chronolaw chronolaw added area/kubernetes Issues where Kong is running on top of Kubernetes area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... labels Dec 6, 2022
@stale
Copy link

stale bot commented Dec 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 20, 2022
@stale stale bot closed this as completed Dec 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller area/kubernetes Issues where Kong is running on top of Kubernetes core/performance pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... stale
Projects
None yet
Development

No branches or pull requests

9 participants