[bug] Crashing ingress and dispatcher pods when connections are disrupted in a 3 node cluster #764

gab-satchi · 2022-05-13T19:42:47Z

I was performing some chaos engineering exercises after having reliability issues with the broker.

Of course could be a configuration missing (replicated queues?) on the cluster side however 1.4.0 does not seem to recover if a 3 node rabbitmq cluster is being rescheduled (maintainence or autoscaling + PodDistruptionBudget).

Dispatcher seem to crash and the broker-ingress never reconnects. The situation is only resolved by restarting the broker at which point a successful retry logic is initiated within the broker.

In the meantime I have a custom operator that scrapes the logs of dispatchers and brokers and restarts them when appropriate.

Just wondered if this was of interest.

Originally posted by @andrewwebber in #760 (comment)

gab-satchi · 2022-05-13T19:43:01Z

/kind bug

andrewwebber · 2022-05-13T20:06:19Z

I have not created a script but can try my best to outline the steps I take to reproduce the issue.
I assume the issue is closed related to the fact that my RabbitmqCluster does not have replicated queues. I make this assumption as in the rabbitmq management console one can observe that the queue are evenly distributed across nodes in the cluster (e.g not replicated).

As shown above this can result delivery blockages.

Overview

Create a rabbitmq cluster with 3 nodes
Deploy knative-eventing broker components
Deploy a trigger

Perform a rolling restart of the rabbitmq statefulset / cluster

kubectl rollout restart statefulset -n knative-eventing rabbitmq-server

Follow the logs of the broker components. You will observe similar messages

broker-ingress does not crash but never reconnects

{"level":"warn","ts":"2022-05-13T17:54:19.398Z","logger":"rabbitmq-ingress","caller":"ingress/main.go:87","msg":"recreating 
RabbitMQ resources"}

{"level":"warn","ts":"2022-05-13T19:58:32.417Z","logger":"rabbitmq-ingress","caller":"rabbit/setup.go:79","msg":"Lost connection to RabbitMQ, reconnecting. Error: %v{error 26 0  Exception (320) Reason: \"CONNECTION_FORCED - Node was put into maintenance mode\"}"}

trigger dispatcher actually falls into a crashbackoff

{"level":"info","ts":"2022-05-13T19:58:45.091Z","logger":"rabbitmq-dispatcher","caller":"dispatcher/main.go:84","msg":"Setting BackoffDelay","backoffDelay":0.05}
{"level":"fatal","ts":"2022-05-13T19:58:45.092Z","logger":"rabbitmq-dispatcher","caller":"dispatcher/main.go:133","msg":"Failed to connect to RabbitMQ: dial tcp 10.43.0.189:5672: connect: connection refused","stacktrace":"main.(*envConfig).setupRabbitMQ\n\tknative.dev/eventing-rabbitmq/cmd/dispatcher/main.go:133\nmain.main\n\tknative.dev/eventing-rabbitmq/cmd/dispatcher/main.go:86\nruntime.main\n\truntime/proc.go:255"}

Last State:     Terminated
   Reason:       Error
   Exit Code:    1
   Started:      Fri, 13 May 2022 21:58:45 +0200
   Finished:     Fri, 13 May 2022 21:58:45 +0200

Restart the broker-ingress to initiate a reconnect

kubectl rollout restart deployment default-broker-ingress

Observe that the broker-ingress upon restart performs connection retries until the connection to Rabbitmq has been successfully established

{"level":"error","ts":"2022-05-13T20:05:10.378Z","logger":"rabbitmq-ingress","caller":"rabbit/setup.go:47","msg":"failed to connect to RabbitMQ","error":"dial tcp 10.43.0.189:5672: connect: connection refused","stacktrace":"knative.dev/eventing-rabbitmq/pkg/rabbit.(*RabbitMQHelper).SetupRabbitMQ\n\tknative.dev/eventing-rabbitmq/pkg/rabbit/setup.go:47\nmain.(*envConfig).CreateRabbitMQConnections\n\tknative.dev/eventing-rabbitmq/cmd/ingress/main.go:203\nmain.main.func1\n\tknative.dev/eventing-rabbitmq/cmd/ingress/main.go:88"}
{"level":"warn","ts":"2022-05-13T20:05:10.378Z","logger":"rabbitmq-ingress","caller":"rabbit/setup.go:54","msg":"retry number 161"}
{"level":"error","ts":"2022-05-13T20:05:11.378Z","logger":"rabbitmq-ingress","caller":"ingress/main.go:90","msg":"error recreating RabbitMQ connections: dial tcp 10.43.0.189:5672: connect: connection refused, waiting for a retry","stacktrace":"main.main.func1\n\tknative.dev/eventing-rabbitmq/cmd/ingress/main.go:90"}
{"level":"warn","ts":"2022-05-13T20:05:11.378Z","logger":"rabbitmq-ingress","caller":"ingress/main.go:87","msg":"recreating RabbitMQ resources"}

Resources

Broker

apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  name: default
  annotations:
    eventing.knative.dev/broker.class: RabbitMQBroker
spec:
  config:
    apiVersion: rabbitmq.com/v1beta1
    kind: RabbitmqCluster
    name: rabbitmq
    namespace: knative-eventing
  delivery:
    retry: 5

Rabbitmq

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq
  namespace: knative-eventing
  annotations:
    rabbitmq.com/topology-allowed-namespaces: "*"
spec:
  replicas: 3
  resources:
    requests:
      memory: 2Gi
    limits:
      memory: 2Gi
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - rabbitmq
        topologyKey: kubernetes.io/hostname

gabo1208 · 2022-05-13T20:38:24Z

Kk! 🤔 Definitely a scenario we haven't tried so I'll be taking a look next week, thanks @andrewwebber :)!

knative-prow bot added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2022

gabo1208 mentioned this issue May 19, 2022

fixed cleanup logic to rabbit connections #778

Merged

knative-prow bot closed this as completed in #778 May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Crashing ingress and dispatcher pods when connections are disrupted in a 3 node cluster #764

[bug] Crashing ingress and dispatcher pods when connections are disrupted in a 3 node cluster #764

gab-satchi commented May 13, 2022

gab-satchi commented May 13, 2022

andrewwebber commented May 13, 2022 •

edited

Loading

gabo1208 commented May 13, 2022

[bug] Crashing ingress and dispatcher pods when connections are disrupted in a 3 node cluster #764

[bug] Crashing ingress and dispatcher pods when connections are disrupted in a 3 node cluster #764

Comments

gab-satchi commented May 13, 2022

gab-satchi commented May 13, 2022

andrewwebber commented May 13, 2022 • edited Loading

Overview

Resources

gabo1208 commented May 13, 2022

andrewwebber commented May 13, 2022 •

edited

Loading