Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Knative in airgapped CKF #140

Closed
kimwnasptd opened this issue Aug 28, 2023 · 6 comments
Closed

Test Knative in airgapped CKF #140

kimwnasptd opened this issue Aug 28, 2023 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@kimwnasptd
Copy link
Contributor

Right now we allow users configure the following images for Knative Serving/Eventing

custom_images:
default: |
activator: ''
autoscaler: ''
controller: ''
webhook: ''
autoscaler-hpa: ''
net-istio-controller/controller: ''
net-istio-webhook/webhook: ''
queue-proxy: ''

eventing-controller/eventing-controller: ''
eventing-webhook/eventing-webhook: ''
imc-controller/controller: ''
imc-dispatcher/dispatcher: ''
broker-controller/eventing-controller: ''

But once I do a microk8s ctr images ls then I see the following relevant Knative images

gcr.io/knative-releases/knative.dev/eventing/cmd/broker/filter@sha256:33ea8a657b974d7bf3d94c0b601a4fc287c1fb33430b3dda028a1a189e3d9526
gcr.io/knative-releases/knative.dev/eventing/cmd/broker/ingress@sha256:f4a9dfce9eec5272c90a19dbdf791fffc98bc5a6649ee85cb8a29bd5145635b1
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:cbc452f35842cc8a78240642adc1ebb11a4c4d7c143c8277edb49012f6cfc5d3
gcr.io/knative-releases/knative.dev/eventing/cmd/in_memory/channel_controller@sha256:3ced549336c7ccf3bb2adf23a558eb55bd1aec7be17837062d21c749dfce8ce5
gcr.io/knative-releases/knative.dev/eventing/cmd/in_memory/channel_dispatcher@sha256:e17bbdf951868359424cd0a0465da8ef44c66ba7111292444ce555c83e280f1a
gcr.io/knative-releases/knative.dev/eventing/cmd/mtchannel_broker@sha256:c5d3664780b394f6d3e546eb94c972965fbd9357da5e442c66455db7ca94124c
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:c9c582f530155d22c01b43957ae0dba549b1cc903f77ec6cc1acb9ae9085be62
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8dgcr.io/knative-releases/knative.dev/pkg/apiextensions/storageversion/cmd/migrate@sha256:59431cf8337532edcd9a4bcd030591866cc867f13bee875d81757c960a53668d
gcr.io/knative-releases/knative.dev/pkg/apiextensions/storageversion/cmd/migrate@sha256:d0095787bc1687e2d8180b36a66997733a52f8c49c3e7751f067813e3fb54b66
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa@sha256:7003443f0faabbaca12249aa16b73fa171bddf350abd826dd93b06f5080a146d
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae

From the above list of images reported in MicroK8s it seems a couple of images are not part of the Service CR. We'll have to make sur

  1. We know which images to use in the Serving CR
  2. The Serving CR is exposing all necessary field for configuring Knative for running in airgapped envs
@kimwnasptd
Copy link
Contributor Author

As part of this effort we found out that the docs from Knative don't expose to us how to:

  1. Set the domain-mapping Deployment image
  2. Set the domain-mapping-webhook Deployment image

We need to find a way to configure these images in the KnativeServing CR
https://knative.dev/docs/install/operator/configuring-serving-cr/#download-images-individually-without-secrets

@kimwnasptd
Copy link
Contributor Author

Looking a little bit into the Knative Operator code I found out that it works the following way:

  1. It finds all Deployments that have ownerReferences to the KnativeServing CR https://github.com/knative/operator/blob/b46a2d38c7e60edcbead2337db0e2d108ca97f5b/pkg/reconciler/common/images.go#L59
  2. Get its PodSpec
  3. For each container in the pod spec it will check if there's a key for that image https://github.com/knative/operator/blob/main/pkg/reconciler/common/images.go#L107
  4. if there is, then it will use the spec.registry.override key and replace that container's image

this means that if the spec.registry.override key, of the KnativeServing CR, matches the name of a container in any Deployment, owned by that CR, then the operator will replace the image with the value from the registry

@kimwnasptd
Copy link
Contributor Author

So with the above we can try setting the container names of the domain-mapping and domain-mapping-webhook Deployments and override them

@orfeas-k
Copy link
Contributor

Looks like this is how the custom images feature has been designed to work, thus we can add domain-mapping and domain-mapping-webhook to charm's config value custom_images (which is a simple dictionary).

@orfeas-k
Copy link
Contributor

orfeas-k commented Sep 1, 2023

As mentioned in canonical/bundle-kubeflow#680, we bumped onto this #147 so for Knative-serving, we will be configuring it to use 1.8.0 (knative-eventing already uses 1.8.0).

@orfeas-k
Copy link
Contributor

orfeas-k commented Sep 5, 2023

Deploying knative charms in an airgrap environment works as expected apart from the activator deployment in the namespace knative-serving. Although the pod starts running, its container is never ready and logs the following constantly

{"severity":"ERROR","timestamp":"2023-09-04T08:41:28.454200818Z","logger":"activator","caller":"websocket/connection.go:144","message":"Websocket connection could not be established","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f","error":"dial tcp: lookup autoscaler.knative-serving.svc.cluster.local: i/o timeout","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:144\nknative.dev/pkg/websocket.(*ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:225\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:222\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:228\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:423\nknative.dev/pkg/websocket.(*ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:222\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:162"}
{"severity":"ERROR","timestamp":"2023-09-04T08:41:28.787749703Z","logger":"activator","caller":"websocket/connection.go:191","message":"Failed to send ping message to ws://autoscaler.knative-serving.svc.cluster.local:8080","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f","error":"connection has not yet been established","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:191"}
{"severity":"WARNING","timestamp":"2023-09-04T08:41:31.05744278Z","logger":"activator","caller":"handler/healthz_handler.go:36","message":"Healthcheck failed: connection has not yet been established","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f"}

Trying to debug this, we also deployed the above charms in a non-airgapped environment and noticed that the pod has the same logs there too, but its container is being able to go to ready. Investigating this further, and inside the airgapped env, we noticed the following in the CoreDNS pod's logs:

[INFO] 10.1.205.153:40339 - 44253 "AAAA IN autoscaler.knative-serving.svc.cluster.local.lxd. udp 66 false 512" - - 0 2.000241772s
[INFO] 10.1.205.153:56166 - 44510 "A IN autoscaler.knative-serving.svc.cluster.local.lxd. udp 66 false 512" - - 0 2.000258023s
[ERROR] plugin/errors: 2 autoscaler.knative-serving.svc.cluster.local.lxd. AAAA: read udp 10.1.205.163:34994->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 autoscaler.knative-serving.svc.cluster.local.lxd. A: read udp 10.1.205.163:34020->8.8.4.4:53: i/o timeout

Looking at the above, we start to believe that this has to do with the way our airgrapped environment is being set up (more info about environment here canonical/bundle-kubeflow#682):

  • Kubernetes pods have an ndots = 5 setting in their /etc/resolv.conf meaning that for query addresses with at least 5 dots in them, it will ignore its search list and will try to resolve the address as a normal domain name. This is probable the reason the above address is being resolved to 8.8.8.8 or 8.8.4.4.
  • Curling those addresses will fail since we are in an airgrap environment. This would probably not create a problem (as we see that these queries fail in a non-airgapped environement). However, we execed into a pod and tried to hit autoscaler.knative-serving.svc.cluster.local.lxd(:8080) and noticed that although the request fails, it takes some seconds before we get a response. In the non airgapped one, we get this response right away.
  • Activator deployment has a TimeoutThreshold of 1 second. We tried to manipulate this but we believe it could be the deployment go code that breaks the deployment.

From the above, we 've lead to believe that it could be that the request to 8.8.x.x slow response results in a timeout that blocks the container from going to READY.

Solution

Configure airgap environment to immediately reject requests towards outside the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Development

No branches or pull requests

3 participants