Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sleep-based synchronization between cache warmup and the first sync to Kong #2315

Open
1 of 2 tasks
mflendrich opened this issue Mar 10, 2022 · 9 comments
Open
1 of 2 tasks

Comments

@mflendrich
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Problem Statement

In order to mitigate #2249, #2255 has added a 5s sleep before the first sync.

Proposed Solution

Replace the sleep with a solution proposed in #2249 (comment)

Additional information

No response

Acceptance Criteria

  • The first sync is blocked until the cache has been filled with all the relevant k8s objects in a reliable fashion.
@rainest
Copy link
Contributor

rainest commented May 3, 2022

Do we indeed want this as an onboarding task? IIRC this was more to track that we're still using the sleep for lack of a great alternative and to collect any reports where the sleep is either insufficient or causes some unexpected issue.

#2249 (review of the problem and non-viable options for fixing it) and #2250 (a PoC implementation to demonstrate why the known viable option is not ideal) show prior work, but to be clear the proposal here is basically something we'd like to avoid in order to preserve separation between the controllers, even if it works in a technical sense. This issue says "do this" but this should probably still be a spike to try and come up with alternatives.

@shaneutt
Copy link
Contributor

shaneutt commented May 4, 2022

Indeed, I don't think we should use this as an onboarding task, we should mark this one as a nice to have.

@mflendrich
Copy link
Contributor Author

collect any reports where the sleep is either insufficient or causes some unexpected issue

Sleeping is not a correct synchronization mechanism and as such IMO does not require further justification for fixing (for example because changes to unrelated parts of code may unexpectedly change the time it takes for the cache warmup to finish). Whether it is likely to show visible outcomes in predictable future is unknown (maybe it's unlikely, but we cannot know that for sure), though. Issues like this add up and result in surprising bugs across the codebase. I'm happy to be convinced otherwise, though.

This issue says "do this" but this should probably still be a spike to try and come up with alternatives.

Good point. Would making this a spike as @rainest proposed make it a decent onboarding exploratory (based on the prior art) task?

we should mark this one as a nice to have

To me, personally, "nice to have" means something like "this is a feature that is not likely to draw our attention ever". We could think that fixing any technical debt is a "nice to have" (and I'm viewing this issue as pure technical debt). Do we want to apply "nice to have" to technical debt items that pose only a maintenance problem? I see this label more on the functional side.

To summarize, here's my thinking/alternative proposal:

  • let's replace "do this" with "please explore and see if we can add a barrier that resolves the race otherwise" (seems uncontroversial)
  • keep "area/debt" rather than "nice to have"
  • consider this "onboarding" because it's a pure golang issue requiring not so much k8s/domain background

WDYT?

@shaneutt
Copy link
Contributor

shaneutt commented May 4, 2022

As long as it's clear that this is a spike and the work that needs to be done first is to propose a solution, not immediately implement one, that's fine with me. Also you can have it marked as both area/debt and nice-to-have at the same time.

@jiachinzhao
Copy link
Contributor

I have encountered related issues when testing in our prod env.

Kong 2.8.0 db-less
Kong ingress controller 2.4.2
EKS 1.18

Our prod EKS env have 1000+ service, I configured single ingress by a new ingressClass.

Start a kong, after it’s running, request to it cause 503, “failure to get a peer from the ring-balancer”, after a while maybe 1 or 2 min, response recover to 200

debug by kong admin api, I found there are 1 route, 1 service, but no target found in related upstream when error occured.

Ingress Controller log
2022-07-28 14:14:52 time="2022-07-28T06:14:52Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:49 time="2022-07-28T06:14:49Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:46 time="2022-07-28T06:14:46Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:43 time="2022-07-28T06:14:43Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:40 time="2022-07-28T06:14:40Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:37 time="2022-07-28T06:14:37Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080

Service is running normal in EKS and It ’s occured certainly in our scenario.

is there any plan or release version to fix it?

@rainest
Copy link
Contributor

rainest commented Jul 28, 2022

@jiachinzhao that's not this issue; if it was you wouldn't have any service or route at all. That error generally means that your Service has no Endpoints because there are no Pods ready to serve it.

If you do still think there's an issue with the controller not properly seeing Endpoints it should, please open a separate issue.

@jiachinzhao
Copy link
Contributor

jiachinzhao commented Jul 29, 2022

OK, i will open a separate issue to explain.

another issue related

step to reproduce

  1. Kong 2.8.0 db-less
  2. Kong ingress controller 2.2.0
  3. EKS 1.18
  4. apply a single ingress rule
  5. service and pod running normally

start a kong pod then excute test-new.sh

the shell aims to check whether ingress controller is ready before or after kong routes sync normally.

ip=`kubectl get pod -o wide -n kong | grep 'kong-new' | awk '{print $6}'`
echo $ip
while true
do
   date
   echo -e "\ningress controller healthz"
   curl -v $ip:10254/healthz
   echo -e "\nkong routes"
   curl $ip:8001/routes
   echo -e "\nrequest pong"
   curl -H 'host: kong-new.test.io'  $ip:8000/v1/frontlogsrv/pong
   sleep 1
done

the shell output shows that ingress controller healthz is ready before
kong routes sync normally.

this will lead to response 404 when traffic proxy to this pod

but in ingress controller 1.3.x, healthz always will be ready after waiting for cache sync success.

Requests sent ``` ./test-new.sh 10.15.52.171

Fri Jul 29 09:33:59 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • connect to 10.15.52.171 port 10254 failed: Connection refused
  • Failed to connect to 10.15.52.171 port 10254: Connection refused
  • Closing connection 0
    curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:00 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • connect to 10.15.52.171 port 10254 failed: Connection refused
  • Failed to connect to 10.15.52.171 port 10254: Connection refused
  • Closing connection 0
    curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:01 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • connect to 10.15.52.171 port 10254 failed: Connection refused
  • Failed to connect to 10.15.52.171 port 10254: Connection refused
  • Closing connection 0
    curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:02 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • connect to 10.15.52.171 port 10254 failed: Connection refused
  • Failed to connect to 10.15.52.171 port 10254: Connection refused
  • Closing connection 0
    curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:03 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

GET /healthz HTTP/1.1
Host: 10.15.52.171:10254
User-Agent: curl/7.61.1
Accept: /

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:04 GMT
< Content-Length: 2
<

  • Connection #0 to host 10.15.52.171 left intact
    ok
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:05 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

GET /healthz HTTP/1.1
Host: 10.15.52.171:10254
User-Agent: curl/7.61.1
Accept: /

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:05 GMT
< Content-Length: 2
<

  • Connection #0 to host 10.15.52.171 left intact
    ok
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:06 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

GET /healthz HTTP/1.1
Host: 10.15.52.171:10254
User-Agent: curl/7.61.1
Accept: /

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:06 GMT
< Content-Length: 2
<

  • Connection #0 to host 10.15.52.171 left intact
    ok
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:07 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

GET /healthz HTTP/1.1
Host: 10.15.52.171:10254
User-Agent: curl/7.61.1
Accept: /

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:07 GMT
< Content-Length: 2
<

  • Connection #0 to host 10.15.52.171 left intact
    ok
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:08 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

GET /healthz HTTP/1.1
Host: 10.15.52.171:10254
User-Agent: curl/7.61.1
Accept: /

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:08 GMT
< Content-Length: 2
<

  • Connection #0 to host 10.15.52.171 left intact
    ok
    kong routes
    {"data":[],"next":null}
    request pong
    {"message":"no Route matched with those values"}

Fri Jul 29 09:34:09 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

GET /healthz HTTP/1.1
Host: 10.15.52.171:10254
User-Agent: curl/7.61.1
Accept: /

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:09 GMT
< Content-Length: 2
<

  • Connection #0 to host 10.15.52.171 left intact
    ok
    kong routes
    {"data":[{"snis":null,"tags":null,"request_buffering":true,"response_buffering":true,"created_at":1659087249,"paths":["/v1/frontlogsrv/"],"methods":null,"https_redirect_status_code":426,"preserve_host":true,"service":{"id":"e56c138c-7c1d-5925-8e9e-698913f8e7ab"},"hosts":["kong-pre-release.klook.io"],"destinations":null,"protocols":["http","https"],"regex_priority":0,"path_handling":"v0","id":"bb31773d-8418-5943-b4ae-c88d8eae90a5","updated_at":1659087249,"headers":null,"name":"deploy.frontlogsrv-kong-new-test.00","strip_path":false,"sources":null}],"next":null}
    request pong
    frontlogsrv:01.:8080:active:service/deploy/frontlogsrv

Fri Jul 29 09:34:10 UTC 2022
ingerss controller healthz

  • Trying 10.15.52.171...
  • TCP_NODELAY set
  • Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

GET /healthz HTTP/1.1
Host: 10.15.52.171:10254
User-Agent: curl/7.61.1
Accept: /

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:10 GMT
< Content-Length: 2
<

  • Connection #0 to host 10.15.52.171 left intact
    ok
    kong routes
    {"data":[{"snis":null,"tags":null,"request_buffering":true,"response_buffering":true,"created_at":1659087249,"paths":["/v1/frontlogsrv/"],"methods":null,"https_redirect_status_code":426,"preserve_host":true,"service":{"id":"e56c138c-7c1d-5925-8e9e-698913f8e7ab"},"hosts":["kong-pre-release.klook.io"],"destinations":null,"protocols":["http","https"],"regex_priority":0,"path_handling":"v0","id":"bb31773d-8418-5943-b4ae-c88d8eae90a5","updated_at":1659087249,"headers":null,"name":"deploy.frontlogsrv-kong-new-test.00","strip_path":false,"sources":null}],"next":null}
    request pong
    frontlogsrv:01.:8080:active:service/deploy/frontlogsrv
</details>

@dcarley
Copy link

dcarley commented Jan 24, 2023

I thought that we had run into a startup/sync bug caused by this sleep but it actually turned out that the chart was using the wrong readiness endpoint: Kong/charts#716

@pmalek
Copy link
Member

pmalek commented Jan 24, 2023

Hi @dcarley 👋

The problem you're trying to solve is not related to this bug. What readyz endpoint is doing in KIC as of now is returning true if the first sync (i.e. sending Kong objects - from parsed and translated Kubernetes resources - to Kong instance(s)) has already happened.

Sometimes this sync can send less objects then we'd intend it to because the parser https://github.com/Kong/kubernetes-ingress-controller/blob/91bd96dd47a7d15dcc340d6d2c032ddd1a8ab22d/internal/dataplane/parser/parser.go didn't manage to fill its caches yet. This is what #2250 tried to remediate by using manager's client to get those k8s resources going around the parser cache(s).

You fix in Kong/charts#716 still makes sense but it doesn't solve the issue described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants