-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sleep-based synchronization between cache warmup and the first sync to Kong #2315
Comments
Do we indeed want this as an onboarding task? IIRC this was more to track that we're still using the sleep for lack of a great alternative and to collect any reports where the sleep is either insufficient or causes some unexpected issue. #2249 (review of the problem and non-viable options for fixing it) and #2250 (a PoC implementation to demonstrate why the known viable option is not ideal) show prior work, but to be clear the proposal here is basically something we'd like to avoid in order to preserve separation between the controllers, even if it works in a technical sense. This issue says "do this" but this should probably still be a spike to try and come up with alternatives. |
Indeed, I don't think we should use this as an onboarding task, we should mark this one as a nice to have. |
Sleeping is not a correct synchronization mechanism and as such IMO does not require further justification for fixing (for example because changes to unrelated parts of code may unexpectedly change the time it takes for the cache warmup to finish). Whether it is likely to show visible outcomes in predictable future is unknown (maybe it's unlikely, but we cannot know that for sure), though. Issues like this add up and result in surprising bugs across the codebase. I'm happy to be convinced otherwise, though.
Good point. Would making this a spike as @rainest proposed make it a decent onboarding exploratory (based on the prior art) task?
To me, personally, "nice to have" means something like "this is a feature that is not likely to draw our attention ever". We could think that fixing any technical debt is a "nice to have" (and I'm viewing this issue as pure technical debt). Do we want to apply "nice to have" to technical debt items that pose only a maintenance problem? I see this label more on the functional side. To summarize, here's my thinking/alternative proposal:
WDYT? |
As long as it's clear that this is a spike and the work that needs to be done first is to propose a solution, not immediately implement one, that's fine with me. Also you can have it marked as both |
I have encountered related issues when testing in our prod env. Kong 2.8.0 db-less Our prod EKS env have 1000+ service, I configured single ingress by a new ingressClass. Start a kong, after it’s running, request to it cause 503, “failure to get a peer from the ring-balancer”, after a while maybe 1 or 2 min, response recover to 200 debug by kong admin api, I found there are 1 route, 1 service, but no target found in related upstream when error occured. Ingress Controller log Service is running normal in EKS and It ’s occured certainly in our scenario. is there any plan or release version to fix it? |
@jiachinzhao that's not this issue; if it was you wouldn't have any service or route at all. That error generally means that your Service has no Endpoints because there are no Pods ready to serve it. If you do still think there's an issue with the controller not properly seeing Endpoints it should, please open a separate issue. |
OK, i will open a separate issue to explain. another issue related step to reproduce
start a kong pod then excute test-new.sh the shell aims to check whether ingress controller is ready before or after kong routes sync normally.
the shell output shows that ingress controller healthz is ready before this will lead to response 404 when traffic proxy to this pod but in ingress controller 1.3.x, healthz always will be ready after waiting for cache sync success. Requests sent``` ./test-new.sh 10.15.52.171Fri Jul 29 09:33:59 UTC 2022
Fri Jul 29 09:34:00 UTC 2022
Fri Jul 29 09:34:01 UTC 2022
Fri Jul 29 09:34:02 UTC 2022
Fri Jul 29 09:34:03 UTC 2022
< HTTP/1.1 200 OK
Fri Jul 29 09:34:05 UTC 2022
< HTTP/1.1 200 OK
Fri Jul 29 09:34:06 UTC 2022
< HTTP/1.1 200 OK
Fri Jul 29 09:34:07 UTC 2022
< HTTP/1.1 200 OK
Fri Jul 29 09:34:08 UTC 2022
< HTTP/1.1 200 OK
Fri Jul 29 09:34:09 UTC 2022
< HTTP/1.1 200 OK
Fri Jul 29 09:34:10 UTC 2022
< HTTP/1.1 200 OK
|
I thought that we had run into a startup/sync bug caused by this sleep but it actually turned out that the chart was using the wrong readiness endpoint: Kong/charts#716 |
Hi @dcarley 👋 The problem you're trying to solve is not related to this bug. What readyz endpoint is doing in KIC as of now is returning true if the first sync (i.e. sending Kong objects - from parsed and translated Kubernetes resources - to Kong instance(s)) has already happened. Sometimes this sync can send less objects then we'd intend it to because the parser https://github.com/Kong/kubernetes-ingress-controller/blob/91bd96dd47a7d15dcc340d6d2c032ddd1a8ab22d/internal/dataplane/parser/parser.go didn't manage to fill its caches yet. This is what #2250 tried to remediate by using manager's client to get those k8s resources going around the parser cache(s). You fix in Kong/charts#716 still makes sense but it doesn't solve the issue described here. |
Is there an existing issue for this?
Problem Statement
In order to mitigate #2249, #2255 has added a 5s sleep before the first sync.
Proposed Solution
Replace the sleep with a solution proposed in #2249 (comment)
Additional information
No response
Acceptance Criteria
The text was updated successfully, but these errors were encountered: