Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy Server Readiness Probe Fails after some time #239

Closed
1 task done
floriankoch opened this issue Apr 28, 2022 · 12 comments · Fixed by #276
Closed
1 task done

Policy Server Readiness Probe Fails after some time #239

floriankoch opened this issue Apr 28, 2022 · 12 comments · Fixed by #276
Labels

Comments

@floriankoch
Copy link

floriankoch commented Apr 28, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

After some time (~140m) the Policy Server Readiness Probe fails
Because there is no liveness probe, i have no idea if the application works

We have run the policy server with tracing enabled - no error in the log

Warning Unhealthy 52s (x92 over 157m) kubelet Readiness probe failed: Get "https://10.244.14.179:8443/readiness": dial tcp 10.244.14.179:8443: connect: connection refused

Latest Log Messages
2022-04-28T07: 42: 57.434789Z INFO validation{host="policy-server-default-7dbfb95cb8-nwh29" policy_id="clusterwide-allow-pod-privileged-psp-policy" kind="Pod" kind_group="" kind_version="v1" name="redacted-84f9d67d94-ljv5k" namespace="redacted" operation="CREATE" request_uid="7ac29b7c-b2ee-4bb3-84d1-462de6bf612a" resource="pods" resource_group="" resource_version="v1" subresource=""
}:policy_eval: policy_server: :worker: policy evaluation (monitor mode) policy_id="clusterwide-allow-pod-privileged-psp-policy" allowed_to_mutate=false response="ValidationResponse { uid: "7ac29b7c-b2ee-4bb3-84d1-462de6bf612a", allowed: true, patch_type: None, patch: None, status: Some(ValidationResponseStatus { message: Some(""), code: None }) }"
2022-04-28T07: 42: 57.434921Z DEBUG validation{host="policy-server-default-7dbfb95cb8-nwh29" policy_id="clusterwide-allow-pod-privileged-psp-policy" kind="Pod" kind_group="" kind_version="v1" name="redacted-84f9d67d94-ljv5k" namespace="redacted" operation="CREATE" request_uid="7ac29b7c-b2ee-4bb3-84d1-462de6bf612a" resource="pods" resource_group="" resource_version="v1" subresource="" allowed=true mutated=false
}: policy_server: :api: policy evaluated response="{"apiVersion":"admission.k8s.io/v1","kind":"AdmissionReview","response":{"uid":"7ac29b7c-b2ee-4bb3-84d1-462de6bf612a","allowed":true}}"

Expected Behavior

Policy Server Readiness does not fail without a reason

Steps To Reproduce

Policy Server runs in a Azure AKS Cluster
Default installation from kubewarden helm charts

Environment

- Azure AKS Cluster
- OS: Linux
- Architecture: amd64

Anything else?

No response

@floriankoch
Copy link
Author

Happend again but now i have an error

2022-04-28T08:53:38.908603Z DEBUG HTTP{http.method=GET http.url=https://kubernetes.default.svc/apis/networking.k8s.io/v1/ingresses? otel.name="HTTP" otel.kind="client"}: kube_client::client: requesting 2022-04-28T08:53:39.083856Z TRACE want: signal: Want 2022-04-28T08:53:39.083909Z TRACE want: signal found waiting giver, notifying 2022-04-28T08:53:39.083921Z TRACE want: signal: Want 2022-04-28T08:53:39.083948Z TRACE want: poll_want: taker wants! 2022-04-28T08:53:39.085653Z TRACE want: signal: Want 2022-04-28T08:53:41.449159Z DEBUG rustls::conn: Sending warning alert CloseNotify 2022-04-28T08:53:41.449276Z TRACE mio::poll: deregistering event source from poller 2022-04-28T08:53:41.449374Z TRACE want: signal: Closed error: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""

@flavio
Copy link
Member

flavio commented Apr 28, 2022

Can you share with us which policies are being enforced by the policy server?

@floriankoch
Copy link
Author

@flavio
protect mode:
disallow-service-loadbalancer

monitor mode:
disallow-service-nodeport
allow-privilege-escalation-psp-policy
allow-pod-privileged-psp-policy

@floriankoch
Copy link
Author

floriankoch commented Apr 29, 2022

Found something in the controller logs
2022-04-28T04:25:11.904Z INFO controller.clusteradmissionpolicy Starting workers {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "worker count": 1} 2022-04-28T04:25:11.905Z INFO controller.admissionpolicy Starting workers {"reconciler group": "policies.kubewarden.io", "reconciler kind": "AdmissionPolicy", "worker count": 1} 2022-04-28T05:12:16.151Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "allow-pod-privileged-psp-policy", "namespace": "", "error": "cannot retrieve admission policy: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T05:44:26.858Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: error reconciling policy-server CA Secret: Post \"https://172.16.0.1:443/api/v1/namespaces/redacted-system/secrets\": http2: server sent GOAWAY and closed the connection; LastStreamID=353937, ErrCode=NO_ERROR, debug=\"\""} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T05:44:35.472Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T05:44:36.046Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T13:12:26.613Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "error reconciling validating webhook: cannot patch validating webhook: etcdserver: request timed out", "errorVerbose": "cannot patch validating webhook: etcdserver: request timed out\nerror reconciling validating webhook\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.reconcilePolicy\n\t/workspace/controllers/policy_utils.go:126\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.startReconciling\n\t/workspace/controllers/policy_utils.go:74\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.(*ClusterAdmissionPolicyReconciler).Reconcile\n\t/workspace/controllers/clusteradmissionpolicy_controller.go:63\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T17:12:15.136Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: cannot reconcile policy-server service: Internal error occurred: failed to allocate a serviceIP: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T17:12:15.149Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "allow-pod-privileged-psp-policy", "namespace": "", "error": "could not read policy server Deployment: etcdserver: leader changed", "errorVerbose": "etcdserver: leader changed\ncould not read policy server Deployment\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.reconcilePolicy\n\t/workspace/controllers/policy_utils.go:113\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.startReconciling\n\t/workspace/controllers/policy_utils.go:74\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.(*ClusterAdmissionPolicyReconciler).Reconcile\n\t/workspace/controllers/clusteradmissionpolicy_controller.go:63\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T17:12:16.748Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-28T21:12:26.522Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "error reconciling validating webhook: cannot reconcile validating webhook: etcdserver: request timed out", "errorVerbose": "cannot reconcile validating webhook: etcdserver: request timed out\nerror reconciling validating webhook\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.reconcilePolicy\n\t/workspace/controllers/policy_utils.go:126\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.startReconciling\n\t/workspace/controllers/policy_utils.go:74\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.(*ClusterAdmissionPolicyReconciler).Reconcile\n\t/workspace/controllers/clusteradmissionpolicy_controller.go:63\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T03:12:26.091Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: error reconciling policy-server deployment: etcdserver: request timed out, possibly due to previous leader failure"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T03:12:26.606Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T05:12:15.805Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: cannot patch PolicyServer Configmap: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T05:12:18.040Z ERROR controller.policyserver Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "update policy server status error: Operation cannot be fulfilled on policyservers.policies.kubewarden.io \"default\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T09:12:12.699Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "could not read policy server Deployment: etcdserver: leader changed", "errorVerbose": "etcdserver: leader changed\ncould not read policy server Deployment\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.reconcilePolicy\n\t/workspace/controllers/policy_utils.go:113\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.startReconciling\n\t/workspace/controllers/policy_utils.go:74\ngit.luolix.top/kubewarden/kubewarden-controller/controllers.(*ClusterAdmissionPolicyReconciler).Reconcile\n\t/workspace/controllers/clusteradmissionpolicy_controller.go:63\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T11:12:12.614Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "cannot retrieve admission policy: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T13:12:12.603Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-loadbalancer", "namespace": "", "error": "update admission policy status error: etcdserver: leader changed"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2022-04-29T15:58:25.222Z ERROR controller.clusteradmissionpolicy Reconciler error {"reconciler group": "policies.kubewarden.io", "reconciler kind": "ClusterAdmissionPolicy", "name": "disallow-service-nodeport", "namespace": "", "error": "update admission policy status error: Operation cannot be fulfilled on clusteradmissionpolicies.policies.kubewarden.io \"disallow-service-nodeport\": the object has been modified; please apply your changes to the latest version and try again"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227

@floriankoch
Copy link
Author

Happend again with this error in the controller , maybe its a controller problem?

2022-05-03T01:42:56.386Z        ERROR   controller.policyserver Reconciler error        {"reconciler group": "policies.kubewarden.io", "reconciler kind": "PolicyServer", "name": "default", "namespace": "", "error": "reconciliation error: cannot lookup Policy server ConfigMap: etcdserver: leader changed"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227

@floriankoch
Copy link
Author

Running with disallow-service-loadbalancer works, no errors, but also not much "traffic"
When adding a pod (maybe mutating?) policy , it breaks after some time , much higher "traffix"

@floriankoch
Copy link
Author

I think the root cause is located in the controller

 State:          Running
      Started:      Thu, 05 May 2022 10:25:31 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 05 May 2022 10:24:46 +0200
      Finished:     Thu, 05 May 2022 10:25:16 +0200
    Ready:          True
    Restart Count:  3
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:        250m
      memory:     512Mi
    Liveness:     http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:    http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ct654 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webhook-server-cert
    Optional:    false
  kube-api-access-ct654:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Warning  Unhealthy  53m (x3 over 53m)  kubelet  Readiness probe failed: Get "http://10.244.31.32:8081/readyz": dial tcp 10.244.31.32:8081: connect: connection refused
  Warning  Unhealthy  53m                kubelet  Liveness probe failed: Get "http://10.244.31.32:8081/healthz": dial tcp 10.244.31.32:8081: connect: connection refused
  Warning  BackOff    53m                kubelet  Back-off restarting failed container
  Normal   Pulled     53m (x3 over 35h)  kubelet  Container image "redacted/kubewarden-controller:v0.5.2" already present on machine
  Normal   Created    53m (x3 over 35h)  kubelet  Created container manager
  Normal   Started    53m (x3 over 35h)  kubelet  Started container manager

@floriankoch
Copy link
Author

@flavio the error happens when the kubewarden controller has readiness problems
Warning Unhealthy 57m (x4 over 175m) kubelet Readiness probe failed: Get "http://10.244.31.32:8081/readyz": dial tcp 10.244.31.32:8081: connect: connection refused

So its the controller not the policy server..

@ereslibre
Copy link
Member

ereslibre commented May 5, 2022

I think we have to work on trying to reproduce this issue.

As an idea: is the machine under high load/pressure? I see several pointers in this direction:

  • The etcd leader changes. This is not necessarily bad, but if the machine is under pressure, the fact that the etcd leader changes very often might be an indicator. Also, errors about the etcd leader having changed during regular operation are "fine" in terms of the controller operation. The action should be retried when such action happens.
  • There are readiness probe failures in policy-server
  • There are readiness probe failures in kubewarden-controller

All this together leads me to think that the machine might be under high load/pressure or having slow disks, and in this situation is common to see things acting in a non-optimal way and finding this kind of errors in the logs. Could it be?

However, we should try to reproduce this problems.

@floriankoch
Copy link
Author

@ereslibre i try my best that you can reproduce this - in my environment i can reproduce this in about ~60min

The environment is an Azure AKS Cluster, i have no insight in the Controlplane - i even do not see the instances

@floriankoch
Copy link
Author

@ereslibre @flavio i updated to the latest versions, cannot repeoduce this anymore, but maybe this ist the real cause
kubewarden/kubewarden-controller#238

I closing this one - the discussion is better suited in the controller bug

flavio added a commit to flavio/policy-server that referenced this issue Jun 20, 2022
Prior to this commit, we used the low level hyper to create our HTTP
server. The code being used was pretty complex due to the low level
nature of this library.
Moreover, the code wasn't robus enough. In certain cases the HTTP server
could fail and cause the whole policy-server to drop incoming requests.

This kind of failures was hard to reproduce, but some users have run
into that.

I was able to reproduce that too with minikube. Malformed client TLS
requests could cause the server to reject them and then enter an error
state.

Instead of implementing all the possible workarounds for these kind of
situations, this code now implements the HTTP and HTTPS server using the
warp crate.

Warp is built on top of hyper, but provides a ready to be consumed high
level API.
Our core business is not implementing HTTP(s) servers, hence by using
this library we make our code more robust and improve its overall
quality.

As a matter of fact, thanks to this change, a lot of obscure code and
repetitive one has been dropped.
Also, a lot of top level dependencies have been removed, because they
are now pulled via warp.

FIXES kubewarden#239

Signed-off-by: Flavio Castelli <fcastelli@suse.com>
@flavio
Copy link
Member

flavio commented Jun 20, 2022

I ran into the issue while testing rc2. I found a fix :)

@flavio flavio reopened this Jun 20, 2022
flavio added a commit to flavio/policy-server that referenced this issue Jun 20, 2022
Prior to this commit, we used the low level hyper to create our HTTP
server. The code being used was pretty complex due to the low level
nature of this library.
Moreover, the code wasn't robus enough. In certain cases the HTTP server
could fail and cause the whole policy-server to drop incoming requests.

This kind of failures was hard to reproduce, but some users have run
into that.

I was able to reproduce that too with minikube. Malformed client TLS
requests could cause the server to reject them and then enter an error
state.

Instead of implementing all the possible workarounds for these kind of
situations, this code now implements the HTTP and HTTPS server using the
warp crate.

Warp is built on top of hyper, but provides a ready to be consumed high
level API.
Our core business is not implementing HTTP(s) servers, hence by using
this library we make our code more robust and improve its overall
quality.

As a matter of fact, thanks to this change, a lot of obscure code and
repetitive one has been dropped.
Also, a lot of top level dependencies have been removed, because they
are now pulled via warp.

FIXES kubewarden#239

Signed-off-by: Flavio Castelli <fcastelli@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants