Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator is not created by druid operator #105

Closed
chn217 opened this issue Sep 8, 2023 · 25 comments
Closed

Coordinator is not created by druid operator #105

chn217 opened this issue Sep 8, 2023 · 25 comments

Comments

@chn217
Copy link

chn217 commented Sep 8, 2023

We recently performed an upgrade of the Druid operator from version 1.0.0 to version 1.2.0, and during the process, we encountered an issue when attempting to create a new Druid cluster. It's worth noting that there were no changes made to the cluster manifest.

The specific problem we encountered was the absence of a coordinator created by the Druid operator. Upon inspecting the resource list, we noticed that there was no coordinator statefulset present. Strangely, there were no error messages recorded in the Druid operator log. This issue appears to be intermittent, as we have successfully used the Druid operator to create multiple clusters without encountering this problem, and it was only observed in one particular cluster.

Additionally, we observed that the Druid operator log does not seem to contain particularly useful information, and there is a lack of valuable info in the pod logs.

@AdheipSingh
Copy link
Contributor

If the cluster was updated, you can check events from the operator.
```kubectl describe druid -n namespace````. When performing upgrade each node reconciled whether success or failed an event is emitted.

Can you do ```kubectl get druid -n namespace -o yaml ```` and check the status. It should show the coordinator deployent.
operator wont remove or delete any sts, it only deleted pvc's for the statefulset. Do you see any issue on your sts controller ? also incase operator does not find any coordinator , it will re-create it on the next reconcile ( if the desired state does mention coordinator )

@AdheipSingh
Copy link
Contributor

IMHO we should not have any breaking change. Pls confirm @itamar-marom @cyril-corbon .

@itamar-marom
Copy link
Collaborator

Might be defaults changed?

@AdheipSingh
Copy link
Contributor

@itamar-marom which defaults ?

@itamar-marom
Copy link
Collaborator

#83

RollingUpdate as true

@itamar-marom
Copy link
Collaborator

@chn217 how do you deploy a cluster? Is it possible that using Terraform?
When tou create a Druid object, can you check what is the revision of the object?

@chn217
Copy link
Author

chn217 commented Sep 8, 2023

If the cluster was updated, you can check events from the operator. ```kubectl describe druid -n namespace````. When performing upgrade each node reconciled whether success or failed an event is emitted.

Can you do ```kubectl get druid -n namespace -o yaml ```` and check the status. It should show the coordinator deployent. operator wont remove or delete any sts, it only deleted pvc's for the statefulset. Do you see any issue on your sts controller ? also incase operator does not find any coordinator , it will re-create it on the next reconcile ( if the desired state does mention coordinator )

The status of the command output kubectl get druid -n namespace -o yaml doesn't show anything wrong. I've recreated the druid cluster (recreate the nodes and delete/apply the cluster manifest). The druid operator pods were evicted due to this. Not sure if this could be the reason. Any idea why the druid operator doesn't output any useful logs?

The druid operator pod logs:
`Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, manager
W0908 02:26:40.299619 1 main.go:165]
==== Deprecation Warning ======================

Insecure listen address will be removed.
Using --insecure-listen-address won't be possible!

The ability to run kube-rbac-proxy without TLS certificates will be removed.
Not using --tls-cert-file and --tls-private-key-file won't be possible!

For more information, please go to brancz/kube-rbac-proxy#187

===============================================

I0908 02:26:40.299907 1 main.go:218] Valid token audiences:
I0908 02:26:40.300003 1 main.go:344] Generating self signed cert as no cert is provided
I0908 02:26:41.202299 1 main.go:394] Starting TCP socket on 0.0.0.0:8443
I0908 02:26:41.202550 1 main.go:401] Listening securely on 0.0.0.0:8443`

Can we show the logs on the creation of resources?

@chn217
Copy link
Author

chn217 commented Sep 8, 2023

@chn217 how do you deploy a cluster? Is it possible that using Terraform? When tou create a Druid object, can you check what is the revision of the object?

The cluster is deployed to AWS EKS, and the infrastructure code has been automated via AWS CDK. IMHO, I don't think Terraform or CDK could be the reason. Behind the scenes, the kubectl is used to deploy the cluster manifest.

As I mentioned, our code has been working for several months. There is no other code changes other than the druid operator upgrade.

@chn217
Copy link
Author

chn217 commented Sep 8, 2023

@chn217 how do you deploy a cluster? Is it possible that using Terraform? When tou create a Druid object, can you check what is the revision of the object?

The cluster is deployed to AWS EKS, and the infrastructure code has been automated via AWS CDK. IMHO, I don't think Terraform or CDK could be the reason. Behind the scenes, the kubectl is used to deploy the cluster manifest.

As I mentioned, our code has been working for several months. There is no other code changes other than the druid operator upgrade.

#83

RollingUpdate as true

I'm using StatefulSet type for all druid components (router/broker/coordinator/overlord/historical/middleManager).

@AdheipSingh
Copy link
Contributor

@chn217 you are checking logs for the sidecar proxy which runs, pls check the logs of the druid operator container.
kubectl logs -f -c druid-operator ( container name ).

-o yaml is showing empty status ? pls check describe druid <druidCR events also.

Operator emits event logs. I am sure there is some log printed out. Make sure you check the right container

@chn217
Copy link
Author

chn217 commented Sep 8, 2023

Hi @AdheipSingh, sorry I didn't realise that there is a sidecar container for the operator now.

  • kubectl describe druid

Status: Config Maps: druid-eks-brokers-config druid-eks-coordinators-config druid-eks-historicals-config druid-eks-middlemanagers-config druid-eks-overlords-config druid-eks-routers-config eks-druid-common-config Druid Node Status: Druid Node: All Druid Node Condition Status: True Druid Node Condition Type: DruidClusterReady Reason: All Druid Nodes are in Ready Condition Ingress: druid-eks-routers Pod Disruption Budgets: druid-eks-middlemanagers Pods: druid-eks-brokers-0 druid-eks-coordinators-0 druid-eks-historicals-0 druid-eks-historicals-1 druid-eks-historicals-2 druid-eks-middlemanagers-0 druid-eks-middlemanagers-1 druid-eks-middlemanagers-2 druid-eks-overlords-0 druid-eks-routers-0 Services: druid-eks-brokers druid-eks-coordinators druid-eks-historicals druid-eks-middlemanagers druid-eks-overlords druid-eks-routers Stateful Sets: druid-eks-brokers druid-eks-coordinators druid-eks-historicals druid-eks-middlemanagers druid-eks-overlords druid-eks-routers Events: <none>

  • operator logs (note: the worker node has been replaced)
    caof@b0be835a5f1a:~/workplace/Apjsb-druid-swift2/source$ kubectl logs druid-operator-5c998c4c46-s7tff -c manager | grep ERROR 2023-09-08T02:27:10Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"eks","namespace":"default"}, "namespace": "default", "name": "eks", "reconcileID": "40a52702-805c-4dfb-8d1b-e884daf1c227", "error": "StatefulSet.apps \"druid-eks-historicals\" not found"} 2023-09-08T02:27:10Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"eks","namespace":"default"}, "namespace": "default", "name": "eks", "reconcileID": "01ec8e90-1a90-4ffe-baae-43ad1aa307fe", "error": "StatefulSet.apps \"druid-eks-overlords\" not found"} 2023-09-08T02:27:10Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"eks","namespace":"default"}, "namespace": "default", "name": "eks", "reconcileID": "ee9a4940-9f12-4765-8e2c-3c37b48c0157", "error": "StatefulSet.apps \"druid-eks-brokers\" not found"}

@AdheipSingh
Copy link
Contributor

@chn217 statefulset not found, you 'll need to audit who deleted the statefulset.
Can you confirm if you increased storage configuration ( volumeClaimTemplates ) in your PVC's anytime during upgrade ?

Operator performs non-cascading deletion of statefulsets when expanding druid cluster vertically on storage. Even in that case there are logs emitted for each action.

@chn217
Copy link
Author

chn217 commented Sep 11, 2023

@AdheipSingh The issue happened for one of new deployments (new k8s cluster + new druid cluster). As we're unable to recover it, so we went ahead with recreating the node group and redeployment of druid cluster manifest. The coordinator statefulset was seen after that.

We haven't increased the storage configuration.

@chn217
Copy link
Author

chn217 commented Sep 25, 2023

@AdheipSingh I've come across this issue in a new cluster again:

  • kubectl describe druid
    """
    Events:
    Type Reason Age From Message


    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [dev238-druid-common-config:*v1.ConfigMap] in namespace [default]
    Normal DruidOperatorUpdateSuccess 13m druid-operator Updated [dev238:*v1alpha1.Druid].
    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-historicals-config:*v1.ConfigMap] in namespace [default]
    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-historicals:*v1.Service] in namespace [default]
    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-historicals:*v1.StatefulSet] in namespace [default]
    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-overlords-config:*v1.ConfigMap] in namespace [default]
    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-overlords:*v1.Service] in namespace [default]
    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-overlords:*v1.StatefulSet] in namespace [default]
    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-middlemanagers-config:*v1.ConfigMap] in namespace [default]
    Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-middlemanagers:*v1.Service] in namespace [default]
    Normal DruidOperatorCreateSuccess 13m (x5 over 13m) druid-operator (combined from similar events): Successfully created object [druid-dev238-brokers:*v1.StatefulSet] in namespace [default]
    Warning DruidOperatorGetFail 13m druid-operator Failed to get [Object:] due to [StatefulSet.apps "druid-dev238-brokers" not found]
    """

  • druid operator logs
    """
    caof@b0be835a5f1a:~/workplace/Apjsb-druid-swift2/source$ kubectl logs druid-operator-5c998c4c46-xxh79 -c manager
    2023-09-25T10:49:18Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"}
    2023-09-25T10:49:18Z INFO setup starting manager
    2023-09-25T10:49:18Z INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
    2023-09-25T10:49:18Z INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
    I0925 10:49:18.782710 1 leaderelection.go:248] attempting to acquire leader lease default/e6946145.apache.org...
    I0925 10:49:34.491794 1 leaderelection.go:258] successfully acquired lease default/e6946145.apache.org
    2023-09-25T10:49:34Z INFO Starting EventSource {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "source": "kind source: *v1alpha1.Druid"}
    2023-09-25T10:49:34Z INFO Starting Controller {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid"}
    2023-09-25T10:49:34Z DEBUG events druid-operator-5c998c4c46-xxh79_5a3b57ba-6a96-455e-bdca-2634d57b46b8 became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"e6946145.apache.org","uid":"a881a3a4-c2a6-4d18-85f3-dff55e872417","apiVersion":"coordination.k8s.io/v1","resourceVersion":"172731"}, "reason": "LeaderElection"}
    2023-09-25T10:49:34Z INFO Starting workers {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "worker count": 1}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [dev238-druid-common-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175649"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z INFO KubeAPIWarningLogger unknown field "spec.nodes.historicals.volumeClaimTemplates[0].metadata.creationTimestamp"
    2023-09-25T10:56:29Z INFO KubeAPIWarningLogger unknown field "spec.nodes.middlemanagers.volumeClaimTemplates[0].metadata.creationTimestamp"
    2023-09-25T10:56:29Z INFO KubeAPIWarningLogger unknown field "spec.services[0].metadata.creationTimestamp"
    2023-09-25T10:56:29Z DEBUG events Updated [dev238:*v1alpha1.Druid]. {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorUpdateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-historicals-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-historicals:*v1.Service] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-historicals:*v1.StatefulSet] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-overlords-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-overlords:*v1.Service] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-overlords:*v1.StatefulSet] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-middlemanagers-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-middlemanagers:*v1.Service] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:29Z DEBUG events Successfully created object [druid-dev238-middlemanagers:*v1.StatefulSet] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:30Z DEBUG events Successfully created object [druid-dev238-middlemanagers:*v1.PodDisruptionBudget] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:30Z DEBUG events Successfully created object [druid-dev238-brokers-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:30Z DEBUG events Successfully created object [druid-dev238-brokers:*v1.Service] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:30Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"dev238","namespace":"default"}, "namespace": "default", "name": "dev238", "reconcileID": "30861c9a-9d41-41b9-8d83-0f14d837ab28", "error": "StatefulSet.apps "druid-dev238-brokers" not found"}
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
    2023-09-25T10:56:30Z DEBUG events Successfully created object [druid-dev238-brokers:*v1.StatefulSet] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
    2023-09-25T10:56:30Z DEBUG events Failed to get [Object:] due to [StatefulSet.apps "druid-dev238-brokers" not found] {"type": "Warning", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorGetFail"}
    """

  • kubectl get pod
    NAME READY STATUS RESTARTS AGE
    druid-dev238-brokers-0 0/1 Running 5 (67s ago) 18m
    druid-dev238-historicals-0 0/1 Running 9 (17s ago) 18m
    druid-dev238-historicals-1 0/1 Running 9 (16s ago) 18m
    druid-dev238-historicals-2 0/1 Running 8 (5m31s ago) 18m
    druid-dev238-middlemanagers-0 0/1 CrashLoopBackOff 7 (91s ago) 18m
    druid-dev238-middlemanagers-1 0/1 CrashLoopBackOff 7 (87s ago) 18m
    druid-dev238-middlemanagers-2 0/1 CrashLoopBackOff 7 (86s ago) 18m
    druid-dev238-overlords-0 0/1 CrashLoopBackOff 7 (97s ago) 18m
    druid-operator-5c998c4c46-fdb5d 2/2 Running 0 37m
    druid-operator-5c998c4c46-rwvdm 2/2 Running 0 37m
    druid-operator-5c998c4c46-xxh79 2/2 Running 0 37m
    external-dns-769d98f985-trtwb 1/1 Running 0 37m
    zookeeper-0 1/1 Running 0 47m
    zookeeper-1 1/1 Running 0 47m
    zookeeper-2 1/1 Running 0 47m

Any idea?

@AdheipSingh
Copy link
Contributor

@chn217 how come the broker got deleted ? no log on the operator side. Do you have an audit log ? BTW is this a managed k8s offering ?

@chn217
Copy link
Author

chn217 commented Sep 25, 2023

@AdheipSingh I don't really think broker got deleted..
Based on the the events of "kubectl describe druid"
"
Normal DruidOperatorCreateSuccess 13m (x5 over 13m) druid-operator (combined from similar events): Successfully created object [druid-dev238-brokers:*v1.StatefulSet] in namespace [default]
Warning DruidOperatorGetFail 13m druid-operator Failed to get [Object:] due to [StatefulSet.apps "druid-dev238-brokers" not found]
"
It looks to be a race condition. The first event said that the broker sts was created, but the next event suggested it couldn't be found. The 'kubectl get pod' also confirmed that the broker sts exists.

The druid operator log was captured in my previous comment. Here was the error message:
"
2023-09-25T10:56:30Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"dev238","namespace":"default"}, "namespace": "default", "name": "dev238", "reconcileID": "30861c9a-9d41-41b9-8d83-0f14d837ab28", "error": "StatefulSet.apps "druid-dev238-brokers" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
"

This is Amazon EKS. Is there any specific command that I can show the audit log? Thanks

@AdheipSingh
Copy link
Contributor

AdheipSingh commented Sep 25, 2023

ah ok , so just to confirm, you did not see any broker deletion by the operator ?

I agree, race conditions can exist, operator acts like a state machine ( observed state ) , and due to abstractions it deals with the overall system, it is eventually consistent.

@chn217
Copy link
Author

chn217 commented Sep 25, 2023

Thanks. I didn't see any broker sts got deleted. The problem here was that the coordinator was never created (not reconciled unfortunatelly). All the pods stay on CrashLoopBackOff status as Druid Pods need to talk to Coordinator for /status/health apis to work (readiness probes).

@AdheipSingh
Copy link
Contributor

@chn217 did the coordinator issue reoccur again ?

@chn217
Copy link
Author

chn217 commented Sep 25, 2023

@AdheipSingh The symptom looks exactly the same. The coordinator didn't show up from the pod/sts list.
`
kubectl get statefulsets.apps
NAME READY AGE
druid-dev238-brokers 0/1 101m
druid-dev238-historicals 0/3 101m
druid-dev238-middlemanagers 0/3 101m
druid-dev238-overlords 0/1 101m
zookeeper 3/3 12h

kubectl get pod
NAME READY STATUS RESTARTS AGE
druid-dev238-brokers-0 0/1 CrashLoopBackOff 19 (4m50s ago) 102m
druid-dev238-historicals-0 0/1 CrashLoopBackOff 33 (100s ago) 102m
druid-dev238-historicals-1 0/1 CrashLoopBackOff 33 (89s ago) 102m
druid-dev238-historicals-2 0/1 Running 32 (5m44s ago) 102m
druid-dev238-middlemanagers-0 0/1 CrashLoopBackOff 27 (114s ago) 102m
druid-dev238-middlemanagers-1 0/1 CrashLoopBackOff 27 (110s ago) 102m
druid-dev238-middlemanagers-2 0/1 CrashLoopBackOff 27 (89s ago) 102m
druid-dev238-overlords-0 0/1 CrashLoopBackOff 27 (100s ago) 102m
druid-operator-5c998c4c46-fdb5d 2/2 Running 0 121m
druid-operator-5c998c4c46-rwvdm 2/2 Running 0 121m
druid-operator-5c998c4c46-xxh79 2/2 Running 0 121m
external-dns-769d98f985-trtwb 1/1 Running 0 121m
zookeeper-0 1/1 Running 0 131m
zookeeper-1 1/1 Running 0 131m
zookeeper-2 1/1 Running 0 131m
`

@chn217
Copy link
Author

chn217 commented Sep 25, 2023

Also the router sts is not showing up. Not sure change from sts to deployment for query/master would help?

@AdheipSingh
Copy link
Contributor

did this occur when you did an upgrade ?

@chn217
Copy link
Author

chn217 commented Sep 25, 2023

@AdheipSingh No, it happened for a new cluster. Here are the steps that I took to create a new cluster:

  • Create EKS cluster
  • Create Node group
  • Run Druid operator
  • Load the cluster manifest

This issue seems to start to occur from v1.2.0. Previously we are on v1.0.0 where we haven't seen this issue. BTW, this issue is intermittent (another symptom for race condtion).

@AdheipSingh
Copy link
Contributor

@chn217 operator will log if it deletes any node.

@AdheipSingh
Copy link
Contributor

Feel free to re-open and provide sufficient logs stating operator deleted the node. You can find in operator logs and operator events describing the current CR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants