Policy Server Readiness Probe Fails after some time #239

floriankoch · 2022-04-28T08:39:42Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

After some time (~140m) the Policy Server Readiness Probe fails
Because there is no liveness probe, i have no idea if the application works

We have run the policy server with tracing enabled - no error in the log

Warning Unhealthy 52s (x92 over 157m) kubelet Readiness probe failed: Get "https://10.244.14.179:8443/readiness": dial tcp 10.244.14.179:8443: connect: connection refused

Latest Log Messages
2022-04-28T07: 42: 57.434789Z INFO validation{host="policy-server-default-7dbfb95cb8-nwh29" policy_id="clusterwide-allow-pod-privileged-psp-policy" kind="Pod" kind_group="" kind_version="v1" name="redacted-84f9d67d94-ljv5k" namespace="redacted" operation="CREATE" request_uid="7ac29b7c-b2ee-4bb3-84d1-462de6bf612a" resource="pods" resource_group="" resource_version="v1" subresource=""
}:policy_eval: policy_server: :worker: policy evaluation (monitor mode) policy_id="clusterwide-allow-pod-privileged-psp-policy" allowed_to_mutate=false response="ValidationResponse { uid: "7ac29b7c-b2ee-4bb3-84d1-462de6bf612a", allowed: true, patch_type: None, patch: None, status: Some(ValidationResponseStatus { message: Some(""), code: None }) }"
2022-04-28T07: 42: 57.434921Z DEBUG validation{host="policy-server-default-7dbfb95cb8-nwh29" policy_id="clusterwide-allow-pod-privileged-psp-policy" kind="Pod" kind_group="" kind_version="v1" name="redacted-84f9d67d94-ljv5k" namespace="redacted" operation="CREATE" request_uid="7ac29b7c-b2ee-4bb3-84d1-462de6bf612a" resource="pods" resource_group="" resource_version="v1" subresource="" allowed=true mutated=false
}: policy_server: :api: policy evaluated response="{"apiVersion":"admission.k8s.io/v1","kind":"AdmissionReview","response":{"uid":"7ac29b7c-b2ee-4bb3-84d1-462de6bf612a","allowed":true}}"

Expected Behavior

Policy Server Readiness does not fail without a reason

Steps To Reproduce

Policy Server runs in a Azure AKS Cluster
Default installation from kubewarden helm charts

Environment

- Azure AKS Cluster
- OS: Linux
- Architecture: amd64

Anything else?

No response

floriankoch · 2022-04-28T09:29:29Z

Happend again but now i have an error

2022-04-28T08:53:38.908603Z DEBUG HTTP{http.method=GET http.url=https://kubernetes.default.svc/apis/networking.k8s.io/v1/ingresses? otel.name="HTTP" otel.kind="client"}: kube_client::client: requesting 2022-04-28T08:53:39.083856Z TRACE want: signal: Want 2022-04-28T08:53:39.083909Z TRACE want: signal found waiting giver, notifying 2022-04-28T08:53:39.083921Z TRACE want: signal: Want 2022-04-28T08:53:39.083948Z TRACE want: poll_want: taker wants! 2022-04-28T08:53:39.085653Z TRACE want: signal: Want 2022-04-28T08:53:41.449159Z DEBUG rustls::conn: Sending warning alert CloseNotify 2022-04-28T08:53:41.449276Z TRACE mio::poll: deregistering event source from poller 2022-04-28T08:53:41.449374Z TRACE want: signal: Closed error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""

flavio · 2022-04-28T12:16:04Z

Can you share with us which policies are being enforced by the policy server?

floriankoch · 2022-04-28T12:52:33Z

@flavio
protect mode:
disallow-service-loadbalancer

monitor mode:
disallow-service-nodeport
allow-privilege-escalation-psp-policy
allow-pod-privileged-psp-policy

floriankoch · 2022-04-29T16:00:10Z

Found something in the controller logs
2022-04-28T04:25:11.904Z INFO controller.clusteradmissionpolicy Starting workers {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "worker count": 1} 2022-04-28T04:25:11.905Z INFO controller.admissionpolicy Starting workers {"reconciler group": "policies.kubewarden.io", "reconciler kind": "AdmissionPolicy", "worker count": 1} 2022-04-28T05:12:16.151Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "allow-pod-privileged-psp-policy", "namespace": "", "error": "cannot retrieve admission policy: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T05:44:26.858Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: error reconciling policy-server CA Secret: Post \"https://172.16.0.1:443/api/v1/namespaces/redacted-system/secrets\": http2: server sent GOAWAY and closed the connection; LastStreamID=353937, ErrCode=NO_ERROR, debug=\"\""} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T05:44:35.472Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T05:44:36.046Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T13:12:26.613Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "error reconciling validating webhook: cannot patch validating webhook: etcdserver: request timed out", "errorVerbose": "cannot patch validating webhook: etcdserver: request timed out\nerror reconciling validating webhook\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.reconcilePolicy\n\t/workspace/controllers/policy_utils.go:126\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.startReconciling\n\t/workspace/controllers/policy_utils.go:74\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.(*ClusterAdmissionPolicyReconciler).Reconcile\n\t/workspace/controllers/clusteradmissionpolicy_controller.go:63\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T17:12:15.136Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: cannot reconcile policy-server service: Internal error occurred: failed to allocate a serviceIP: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T17:12:15.149Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "allow-pod-privileged-psp-policy", "namespace": "", "error": "could not read policy server Deployment: etcdserver: leader changed", "errorVerbose": "etcdserver: leader changed\ncould not read policy server Deployment\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.reconcilePolicy\n\t/workspace/controllers/policy_utils.go:113\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.startReconciling\n\t/workspace/controllers/policy_utils.go:74\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.(*ClusterAdmissionPolicyReconciler).Reconcile\n\t/workspace/controllers/clusteradmissionpolicy_controller.go:63\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T17:12:16.748Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T21:12:26.522Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "error reconciling validating webhook: cannot reconcile validating webhook: etcdserver: request timed out", "errorVerbose": "cannot reconcile validating webhook: etcdserver: request timed out\nerror reconciling validating webhook\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.reconcilePolicy\n\t/workspace/controllers/policy_utils.go:126\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.startReconciling\n\t/workspace/controllers/policy_utils.go:74\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.(*ClusterAdmissionPolicyReconciler).Reconcile\n\t/workspace/controllers/clusteradmissionpolicy_controller.go:63\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T03:12:26.091Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: error reconciling policy-server deployment: etcdserver: request timed out, possibly due to previous leader failure"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T03:12:26.606Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T05:12:15.805Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: cannot patch PolicyServer Configmap: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T05:12:18.040Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T09:12:12.699Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "could not read policy server Deployment: etcdserver: leader changed", "errorVerbose": "etcdserver: leader changed\ncould not read policy server Deployment\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.reconcilePolicy\n\t/workspace/controllers/policy_utils.go:113\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.startReconciling\n\t/workspace/controllers/policy_utils.go:74\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.(*ClusterAdmissionPolicyReconciler).Reconcile\n\t/workspace/controllers/clusteradmissionpolicy_controller.go:63\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T11:12:12.614Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "cannot retrieve admission policy: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T13:12:12.603Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-loadbalancer", "namespace": "", "error": "update admission policy status error: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T15:58:25.222Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "update admission policy status error: Operation cannot be fulfilled on clusteradmissionpolicies.policies.kubewarden.io \"disallow-service-nodeport\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227

floriankoch · 2022-05-03T04:44:29Z

Happend again with this error in the controller , maybe its a controller problem?

2022-05-03T01:42:56.386Z        ERROR   controller.policyserver Reconciler error        {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: cannot lookup Policy server ConfigMap: etcdserver: leader changed"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227

floriankoch · 2022-05-03T07:35:13Z

Running with disallow-service-loadbalancer works, no errors, but also not much "traffic"
When adding a pod (maybe mutating?) policy , it breaks after some time , much higher "traffix"

floriankoch · 2022-05-05T09:21:46Z

I think the root cause is located in the controller

 State:          Running
      Started:      Thu, 05 May 2022 10:25:31 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 05 May 2022 10:24:46 +0200
      Finished:     Thu, 05 May 2022 10:25:16 +0200
    Ready:          True
    Restart Count:  3
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:        250m
      memory:     512Mi
    Liveness:     http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:    http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ct654 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webhook-server-cert
    Optional:    false
  kube-api-access-ct654:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Warning  Unhealthy  53m (x3 over 53m)  kubelet  Readiness probe failed: Get "http://10.244.31.32:8081/readyz": dial tcp 10.244.31.32:8081: connect: connection refused
  Warning  Unhealthy  53m                kubelet  Liveness probe failed: Get "http://10.244.31.32:8081/healthz": dial tcp 10.244.31.32:8081: connect: connection refused
  Warning  BackOff    53m                kubelet  Back-off restarting failed container
  Normal   Pulled     53m (x3 over 35h)  kubelet  Container image "redacted/kubewarden-controller:v0.5.2" already present on machine
  Normal   Created    53m (x3 over 35h)  kubelet  Created container manager
  Normal   Started    53m (x3 over 35h)  kubelet  Started container manager

floriankoch · 2022-05-05T11:21:56Z

@flavio the error happens when the kubewarden controller has readiness problems
Warning Unhealthy 57m (x4 over 175m) kubelet Readiness probe failed: Get "http://10.244.31.32:8081/readyz": dial tcp 10.244.31.32:8081: connect: connection refused

So its the controller not the policy server..

ereslibre · 2022-05-05T11:28:12Z

I think we have to work on trying to reproduce this issue.

As an idea: is the machine under high load/pressure? I see several pointers in this direction:

The etcd leader changes. This is not necessarily bad, but if the machine is under pressure, the fact that the etcd leader changes very often might be an indicator. Also, errors about the etcd leader having changed during regular operation are "fine" in terms of the controller operation. The action should be retried when such action happens.
There are readiness probe failures in policy-server
There are readiness probe failures in kubewarden-controller

All this together leads me to think that the machine might be under high load/pressure or having slow disks, and in this situation is common to see things acting in a non-optimal way and finding this kind of errors in the logs. Could it be?

However, we should try to reproduce this problems.

floriankoch · 2022-05-06T10:50:10Z

@ereslibre i try my best that you can reproduce this - in my environment i can reproduce this in about ~60min

The environment is an Azure AKS Cluster, i have no insight in the Controlplane - i even do not see the instances

floriankoch · 2022-05-26T17:23:43Z

@ereslibre @flavio i updated to the latest versions, cannot repeoduce this anymore, but maybe this ist the real cause
kubewarden/kubewarden-controller#238

I closing this one - the discussion is better suited in the controller bug

Prior to this commit, we used the low level hyper to create our HTTP server. The code being used was pretty complex due to the low level nature of this library. Moreover, the code wasn't robus enough. In certain cases the HTTP server could fail and cause the whole policy-server to drop incoming requests. This kind of failures was hard to reproduce, but some users have run into that. I was able to reproduce that too with minikube. Malformed client TLS requests could cause the server to reject them and then enter an error state. Instead of implementing all the possible workarounds for these kind of situations, this code now implements the HTTP and HTTPS server using the warp crate. Warp is built on top of hyper, but provides a ready to be consumed high level API. Our core business is not implementing HTTP(s) servers, hence by using this library we make our code more robust and improve its overall quality. As a matter of fact, thanks to this change, a lot of obscure code and repetitive one has been dropped. Also, a lot of top level dependencies have been removed, because they are now pulled via warp. FIXES kubewarden#239 Signed-off-by: Flavio Castelli <fcastelli@suse.com>

flavio · 2022-06-20T15:05:02Z

I ran into the issue while testing rc2. I found a fix :)

Prior to this commit, we used the low level hyper to create our HTTP server. The code being used was pretty complex due to the low level nature of this library. Moreover, the code wasn't robus enough. In certain cases the HTTP server could fail and cause the whole policy-server to drop incoming requests. This kind of failures was hard to reproduce, but some users have run into that. I was able to reproduce that too with minikube. Malformed client TLS requests could cause the server to reject them and then enter an error state. Instead of implementing all the possible workarounds for these kind of situations, this code now implements the HTTP and HTTPS server using the warp crate. Warp is built on top of hyper, but provides a ready to be consumed high level API. Our core business is not implementing HTTP(s) servers, hence by using this library we make our code more robust and improve its overall quality. As a matter of fact, thanks to this change, a lot of obscure code and repetitive one has been dropped. Also, a lot of top level dependencies have been removed, because they are now pulled via warp. FIXES kubewarden#239 Signed-off-by: Flavio Castelli <fcastelli@suse.com>

floriankoch added the kind/bug label Apr 28, 2022

flavio mentioned this issue May 9, 2022

Feature Request: add liveness probe #237

Closed

floriankoch mentioned this issue May 26, 2022

AKS: {"error": "leader election lost"} kubewarden/kubewarden-controller#238

Closed

1 task

floriankoch closed this as completed May 26, 2022

flavio mentioned this issue Jun 20, 2022

cleanup http server #276

Merged

flavio reopened this Jun 20, 2022

flavio closed this as completed in #276 Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy Server Readiness Probe Fails after some time #239

Policy Server Readiness Probe Fails after some time #239

floriankoch commented Apr 28, 2022 •

edited

Loading

floriankoch commented Apr 28, 2022

flavio commented Apr 28, 2022

floriankoch commented Apr 28, 2022

floriankoch commented Apr 29, 2022 •

edited

Loading

floriankoch commented May 3, 2022

floriankoch commented May 3, 2022

floriankoch commented May 5, 2022

floriankoch commented May 5, 2022

ereslibre commented May 5, 2022 •

edited

Loading

floriankoch commented May 6, 2022

floriankoch commented May 26, 2022

flavio commented Jun 20, 2022

Policy Server Readiness Probe Fails after some time #239

Policy Server Readiness Probe Fails after some time #239

Comments

floriankoch commented Apr 28, 2022 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

floriankoch commented Apr 28, 2022

flavio commented Apr 28, 2022

floriankoch commented Apr 28, 2022

floriankoch commented Apr 29, 2022 • edited Loading

floriankoch commented May 3, 2022

floriankoch commented May 3, 2022

floriankoch commented May 5, 2022

floriankoch commented May 5, 2022

ereslibre commented May 5, 2022 • edited Loading

floriankoch commented May 6, 2022

floriankoch commented May 26, 2022

flavio commented Jun 20, 2022

floriankoch commented Apr 28, 2022 •

edited

Loading

floriankoch commented Apr 29, 2022 •

edited

Loading

ereslibre commented May 5, 2022 •

edited

Loading