Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Crashing ingress and dispatcher pods when connections are disrupted in a 3 node cluster #764

Closed
gab-satchi opened this issue May 13, 2022 · 3 comments · Fixed by #778
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@gab-satchi
Copy link
Contributor

I was performing some chaos engineering exercises after having reliability issues with the broker.

Of course could be a configuration missing (replicated queues?) on the cluster side however 1.4.0 does not seem to recover if a 3 node rabbitmq cluster is being rescheduled (maintainence or autoscaling + PodDistruptionBudget).

Dispatcher seem to crash and the broker-ingress never reconnects. The situation is only resolved by restarting the broker at which point a successful retry logic is initiated within the broker.

In the meantime I have a custom operator that scrapes the logs of dispatchers and brokers and restarts them when appropriate.

Just wondered if this was of interest.

Originally posted by @andrewwebber in #760 (comment)

@gab-satchi
Copy link
Contributor Author

/kind bug

@knative-prow knative-prow bot added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2022
@andrewwebber
Copy link

andrewwebber commented May 13, 2022

I have not created a script but can try my best to outline the steps I take to reproduce the issue.
I assume the issue is closed related to the fact that my RabbitmqCluster does not have replicated queues. I make this assumption as in the rabbitmq management console one can observe that the queue are evenly distributed across nodes in the cluster (e.g not replicated).

oyster-from-otter-dispatcher-queue (1)

As shown above this can result delivery blockages.

Overview

  1. Create a rabbitmq cluster with 3 nodes

  2. Deploy knative-eventing broker components

  3. Deploy a trigger

  4. Perform a rolling restart of the rabbitmq statefulset / cluster

    kubectl rollout restart statefulset -n knative-eventing rabbitmq-server
  5. Follow the logs of the broker components. You will observe similar messages

    • broker-ingress does not crash but never reconnects
    {"level":"warn","ts":"2022-05-13T17:54:19.398Z","logger":"rabbitmq-ingress","caller":"ingress/main.go:87","msg":"recreating 
    RabbitMQ resources"}
    
    {"level":"warn","ts":"2022-05-13T19:58:32.417Z","logger":"rabbitmq-ingress","caller":"rabbit/setup.go:79","msg":"Lost connection to RabbitMQ, reconnecting. Error: %v{error 26 0  Exception (320) Reason: \"CONNECTION_FORCED - Node was put into maintenance mode\"}"}
    • trigger dispatcher actually falls into a crashbackoff
    {"level":"info","ts":"2022-05-13T19:58:45.091Z","logger":"rabbitmq-dispatcher","caller":"dispatcher/main.go:84","msg":"Setting BackoffDelay","backoffDelay":0.05}
    {"level":"fatal","ts":"2022-05-13T19:58:45.092Z","logger":"rabbitmq-dispatcher","caller":"dispatcher/main.go:133","msg":"Failed to connect to RabbitMQ: dial tcp 10.43.0.189:5672: connect: connection refused","stacktrace":"main.(*envConfig).setupRabbitMQ\n\tknative.dev/eventing-rabbitmq/cmd/dispatcher/main.go:133\nmain.main\n\tknative.dev/eventing-rabbitmq/cmd/dispatcher/main.go:86\nruntime.main\n\truntime/proc.go:255"}
    Last State:     Terminated
       Reason:       Error
       Exit Code:    1
       Started:      Fri, 13 May 2022 21:58:45 +0200
       Finished:     Fri, 13 May 2022 21:58:45 +0200
  6. Restart the broker-ingress to initiate a reconnect

    kubectl rollout restart deployment default-broker-ingress
  7. Observe that the broker-ingress upon restart performs connection retries until the connection to Rabbitmq has been successfully established

    {"level":"error","ts":"2022-05-13T20:05:10.378Z","logger":"rabbitmq-ingress","caller":"rabbit/setup.go:47","msg":"failed to connect to RabbitMQ","error":"dial tcp 10.43.0.189:5672: connect: connection refused","stacktrace":"knative.dev/eventing-rabbitmq/pkg/rabbit.(*RabbitMQHelper).SetupRabbitMQ\n\tknative.dev/eventing-rabbitmq/pkg/rabbit/setup.go:47\nmain.(*envConfig).CreateRabbitMQConnections\n\tknative.dev/eventing-rabbitmq/cmd/ingress/main.go:203\nmain.main.func1\n\tknative.dev/eventing-rabbitmq/cmd/ingress/main.go:88"}
    {"level":"warn","ts":"2022-05-13T20:05:10.378Z","logger":"rabbitmq-ingress","caller":"rabbit/setup.go:54","msg":"retry number 161"}
    {"level":"error","ts":"2022-05-13T20:05:11.378Z","logger":"rabbitmq-ingress","caller":"ingress/main.go:90","msg":"error recreating RabbitMQ connections: dial tcp 10.43.0.189:5672: connect: connection refused, waiting for a retry","stacktrace":"main.main.func1\n\tknative.dev/eventing-rabbitmq/cmd/ingress/main.go:90"}
    {"level":"warn","ts":"2022-05-13T20:05:11.378Z","logger":"rabbitmq-ingress","caller":"ingress/main.go:87","msg":"recreating RabbitMQ resources"}

Resources

  • Broker
apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  name: default
  annotations:
    eventing.knative.dev/broker.class: RabbitMQBroker
spec:
  config:
    apiVersion: rabbitmq.com/v1beta1
    kind: RabbitmqCluster
    name: rabbitmq
    namespace: knative-eventing
  delivery:
    retry: 5
  • Rabbitmq
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq
  namespace: knative-eventing
  annotations:
    rabbitmq.com/topology-allowed-namespaces: "*"
spec:
  replicas: 3
  resources:
    requests:
      memory: 2Gi
    limits:
      memory: 2Gi
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - rabbitmq
        topologyKey: kubernetes.io/hostname

@gabo1208
Copy link
Contributor

Kk! 🤔 Definitely a scenario we haven't tried so I'll be taking a look next week, thanks @andrewwebber :)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants