Coordinator is not created by druid operator #105

chn217 · 2023-09-08T05:36:49Z

We recently performed an upgrade of the Druid operator from version 1.0.0 to version 1.2.0, and during the process, we encountered an issue when attempting to create a new Druid cluster. It's worth noting that there were no changes made to the cluster manifest.

The specific problem we encountered was the absence of a coordinator created by the Druid operator. Upon inspecting the resource list, we noticed that there was no coordinator statefulset present. Strangely, there were no error messages recorded in the Druid operator log. This issue appears to be intermittent, as we have successfully used the Druid operator to create multiple clusters without encountering this problem, and it was only observed in one particular cluster.

Additionally, we observed that the Druid operator log does not seem to contain particularly useful information, and there is a lack of valuable info in the pod logs.

AdheipSingh · 2023-09-08T07:52:07Z

If the cluster was updated, you can check events from the operator.
```kubectl describe druid -n namespace````. When performing upgrade each node reconciled whether success or failed an event is emitted.

Can you do ```kubectl get druid -n namespace -o yaml ```` and check the status. It should show the coordinator deployent.
operator wont remove or delete any sts, it only deleted pvc's for the statefulset. Do you see any issue on your sts controller ? also incase operator does not find any coordinator , it will re-create it on the next reconcile ( if the desired state does mention coordinator )

AdheipSingh · 2023-09-08T07:52:59Z

IMHO we should not have any breaking change. Pls confirm @itamar-marom @cyril-corbon .

itamar-marom · 2023-09-08T07:57:05Z

Might be defaults changed?

AdheipSingh · 2023-09-08T08:02:13Z

@itamar-marom which defaults ?

itamar-marom · 2023-09-08T08:36:01Z

#83

RollingUpdate as true

itamar-marom · 2023-09-08T09:49:44Z

@chn217 how do you deploy a cluster? Is it possible that using Terraform?
When tou create a Druid object, can you check what is the revision of the object?

chn217 · 2023-09-08T12:22:35Z

If the cluster was updated, you can check events from the operator. ```kubectl describe druid -n namespace````. When performing upgrade each node reconciled whether success or failed an event is emitted.

Can you do ```kubectl get druid -n namespace -o yaml ```` and check the status. It should show the coordinator deployent. operator wont remove or delete any sts, it only deleted pvc's for the statefulset. Do you see any issue on your sts controller ? also incase operator does not find any coordinator , it will re-create it on the next reconcile ( if the desired state does mention coordinator )

The status of the command output kubectl get druid -n namespace -o yaml doesn't show anything wrong. I've recreated the druid cluster (recreate the nodes and delete/apply the cluster manifest). The druid operator pods were evicted due to this. Not sure if this could be the reason. Any idea why the druid operator doesn't output any useful logs?

The druid operator pod logs:
`Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, manager
W0908 02:26:40.299619 1 main.go:165]
==== Deprecation Warning ======================

Insecure listen address will be removed.
Using --insecure-listen-address won't be possible!

The ability to run kube-rbac-proxy without TLS certificates will be removed.
Not using --tls-cert-file and --tls-private-key-file won't be possible!

For more information, please go to brancz/kube-rbac-proxy#187

===============================================

I0908 02:26:40.299907 1 main.go:218] Valid token audiences:
I0908 02:26:40.300003 1 main.go:344] Generating self signed cert as no cert is provided
I0908 02:26:41.202299 1 main.go:394] Starting TCP socket on 0.0.0.0:8443
I0908 02:26:41.202550 1 main.go:401] Listening securely on 0.0.0.0:8443`

Can we show the logs on the creation of resources?

chn217 · 2023-09-08T12:27:00Z

@chn217 how do you deploy a cluster? Is it possible that using Terraform? When tou create a Druid object, can you check what is the revision of the object?

The cluster is deployed to AWS EKS, and the infrastructure code has been automated via AWS CDK. IMHO, I don't think Terraform or CDK could be the reason. Behind the scenes, the kubectl is used to deploy the cluster manifest.

As I mentioned, our code has been working for several months. There is no other code changes other than the druid operator upgrade.

chn217 · 2023-09-08T12:28:20Z

@chn217 how do you deploy a cluster? Is it possible that using Terraform? When tou create a Druid object, can you check what is the revision of the object?

The cluster is deployed to AWS EKS, and the infrastructure code has been automated via AWS CDK. IMHO, I don't think Terraform or CDK could be the reason. Behind the scenes, the kubectl is used to deploy the cluster manifest.

As I mentioned, our code has been working for several months. There is no other code changes other than the druid operator upgrade.

#83

RollingUpdate as true

I'm using StatefulSet type for all druid components (router/broker/coordinator/overlord/historical/middleManager).

AdheipSingh · 2023-09-08T14:50:39Z

@chn217 you are checking logs for the sidecar proxy which runs, pls check the logs of the druid operator container.
kubectl logs -f -c druid-operator ( container name ).

-o yaml is showing empty status ? pls check describe druid <druidCR events also.

Operator emits event logs. I am sure there is some log printed out. Make sure you check the right container

chn217 · 2023-09-08T23:50:53Z

Hi @AdheipSingh, sorry I didn't realise that there is a sidecar container for the operator now.

kubectl describe druid

Status: Config Maps: druid-eks-brokers-config druid-eks-coordinators-config druid-eks-historicals-config druid-eks-middlemanagers-config druid-eks-overlords-config druid-eks-routers-config eks-druid-common-config Druid Node Status: Druid Node: All Druid Node Condition Status: True Druid Node Condition Type: DruidClusterReady Reason: All Druid Nodes are in Ready Condition Ingress: druid-eks-routers Pod Disruption Budgets: druid-eks-middlemanagers Pods: druid-eks-brokers-0 druid-eks-coordinators-0 druid-eks-historicals-0 druid-eks-historicals-1 druid-eks-historicals-2 druid-eks-middlemanagers-0 druid-eks-middlemanagers-1 druid-eks-middlemanagers-2 druid-eks-overlords-0 druid-eks-routers-0 Services: druid-eks-brokers druid-eks-coordinators druid-eks-historicals druid-eks-middlemanagers druid-eks-overlords druid-eks-routers Stateful Sets: druid-eks-brokers druid-eks-coordinators druid-eks-historicals druid-eks-middlemanagers druid-eks-overlords druid-eks-routers Events: <none>

operator logs (note: the worker node has been replaced)
caof@b0be835a5f1a:~/workplace/Apjsb-druid-swift2/source$ kubectl logs druid-operator-5c998c4c46-s7tff -c manager | grep ERROR 2023-09-08T02:27:10Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"eks","namespace":"default"}, "namespace": "default", "name": "eks", "reconcileID": "40a52702-805c-4dfb-8d1b-e884daf1c227", "error": "StatefulSet.apps \"druid-eks-historicals\" not found"} 2023-09-08T02:27:10Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"eks","namespace":"default"}, "namespace": "default", "name": "eks", "reconcileID": "01ec8e90-1a90-4ffe-baae-43ad1aa307fe", "error": "StatefulSet.apps \"druid-eks-overlords\" not found"} 2023-09-08T02:27:10Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"eks","namespace":"default"}, "namespace": "default", "name": "eks", "reconcileID": "ee9a4940-9f12-4765-8e2c-3c37b48c0157", "error": "StatefulSet.apps \"druid-eks-brokers\" not found"}

AdheipSingh · 2023-09-09T04:04:02Z

@chn217 statefulset not found, you 'll need to audit who deleted the statefulset.
Can you confirm if you increased storage configuration ( volumeClaimTemplates ) in your PVC's anytime during upgrade ?

Operator performs non-cascading deletion of statefulsets when expanding druid cluster vertically on storage. Even in that case there are logs emitted for each action.

chn217 · 2023-09-11T04:07:49Z

@AdheipSingh The issue happened for one of new deployments (new k8s cluster + new druid cluster). As we're unable to recover it, so we went ahead with recreating the node group and redeployment of druid cluster manifest. The coordinator statefulset was seen after that.

We haven't increased the storage configuration.

chn217 · 2023-09-25T11:14:05Z

@AdheipSingh I've come across this issue in a new cluster again:

kubectl describe druid
"""
Events:
Type Reason Age From Message

Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [dev238-druid-common-config:*v1.ConfigMap] in namespace [default]
Normal DruidOperatorUpdateSuccess 13m druid-operator Updated [dev238:*v1alpha1.Druid].
Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-historicals-config:*v1.ConfigMap] in namespace [default]
Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-historicals:*v1.Service] in namespace [default]
Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-historicals:*v1.StatefulSet] in namespace [default]
Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-overlords-config:*v1.ConfigMap] in namespace [default]
Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-overlords:*v1.Service] in namespace [default]
Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-overlords:*v1.StatefulSet] in namespace [default]
Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-middlemanagers-config:*v1.ConfigMap] in namespace [default]
Normal DruidOperatorCreateSuccess 13m druid-operator Successfully created object [druid-dev238-middlemanagers:*v1.Service] in namespace [default]
Normal DruidOperatorCreateSuccess 13m (x5 over 13m) druid-operator (combined from similar events): Successfully created object [druid-dev238-brokers:*v1.StatefulSet] in namespace [default]
Warning DruidOperatorGetFail 13m druid-operator Failed to get [Object:] due to [StatefulSet.apps "druid-dev238-brokers" not found]
"""
druid operator logs
"""
caof@b0be835a5f1a:~/ 2023-09-25T10:49:18Z 2023-09-25T10:49:18Z 2023-09-25T10:49:18Z 2023-09-25T10:49:18Z I0925 10:49:18.782710 I0925 10:49:34.491794 2023-09-25T10:49:34Z 2023-09-25T10:49:34Z 2023-09-25T10:49:34Z 2023-09-25T10:49:34Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:29Z 2023-09-25T10:56:30Z 2023-09-25T10:56:30Z 2023-09-25T10:56:30Z 2023-09-25T10:56:30Z sigs.k8s.io/controll /go/pkg/mod/sigs.k8s sigs.k8s.io/controll /go/pkg/mod/sigs.k8s sigs.k8s.io/controll /go/pkg/mod/sigs.k8s 2023-09-25T10:56:30Z 2023-09-25T10:56:30Z """ workplace/Apjsb-druid-swift2/source$ kubectl logs druid-operator-5c998c4c46-xxh79 -c manager
INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"}
INFO setup starting manager
INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
1 leaderelection.go:248] attempting to acquire leader lease default/e6946145.apache.org...
1 leaderelection.go:258] successfully acquired lease default/e6946145.apache.org
INFO Starting EventSource {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "source": "kind source: *v1alpha1.Druid"}
INFO Starting Controller {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid"}
DEBUG events druid-operator-5c998c4c46-xxh79_5a3b57ba-6a96-455e-bdca-2634d57b46b8 became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"e6946145.apache.org","uid":"a881a3a4-c2a6-4d18-85f3-dff55e872417","apiVersion":"coordination.k8s.io/v1","resourceVersion":"172731"}, "reason": "LeaderElection"}
INFO Starting workers {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "worker count": 1}
DEBUG events Successfully created object [dev238-druid-common-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175649"}, "reason": "DruidOperatorCreateSuccess"}
INFO KubeAPIWarningLogger unknown field "spec.nodes.historicals.volumeClaimTemplates[0].metadata.creationTimestamp"
INFO KubeAPIWarningLogger unknown field "spec.nodes.middlemanagers.volumeClaimTemplates[0].metadata.creationTimestamp"
INFO KubeAPIWarningLogger unknown field "spec.services[0].metadata.creationTimestamp"
DEBUG events Updated [dev238:*v1alpha1.Druid]. {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorUpdateSuccess"}
DEBUG events Successfully created object [druid-dev238-historicals-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-historicals:*v1.Service] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-historicals:*v1.StatefulSet] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-overlords-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-overlords:*v1.Service] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-overlords:*v1.StatefulSet] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-middlemanagers-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-middlemanagers:*v1.Service] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-middlemanagers:*v1.StatefulSet] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-middlemanagers:*v1.PodDisruptionBudget] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-brokers-config:*v1.ConfigMap] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Successfully created object [druid-dev238-brokers:*v1.Service] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"dev238","namespace":"default"}, "namespace": "default", "name": "dev238", "reconcileID": "30861c9a-9d41-41b9-8d83-0f14d837ab28", "error": "StatefulSet.apps "druid-dev238-brokers" not found"}
er-runtime/pkg/internal/controller.(*Controller).reconcileHandler
.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
er-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
er-runtime/pkg/internal/controller.(*Controller).Start.func2.2
.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
DEBUG events Successfully created object [druid-dev238-brokers:*v1.StatefulSet] in namespace [default] {"type": "Normal", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorCreateSuccess"}
DEBUG events Failed to get [Object:] due to [StatefulSet.apps "druid-dev238-brokers" not found] {"type": "Warning", "object": {"kind":"Druid","namespace":"default","name":"dev238","uid":"e03832b9-18c3-4505-b03e-2bbef52fa4cb","apiVersion":"druid.apache.org/v1alpha1","resourceVersion":"175652"}, "reason": "DruidOperatorGetFail"}
kubectl get pod
NAME READY STATUS RESTARTS AGE
druid-dev238-brokers-0 0/1 Running 5 (67s ago) 18m
druid-dev238-historicals-0 0/1 Running 9 (17s ago) 18m
druid-dev238-historicals-1 0/1 Running 9 (16s ago) 18m
druid-dev238-historicals-2 0/1 Running 8 (5m31s ago) 18m
druid-dev238-middlemanagers-0 0/1 CrashLoopBackOff 7 (91s ago) 18m
druid-dev238-middlemanagers-1 0/1 CrashLoopBackOff 7 (87s ago) 18m
druid-dev238-middlemanagers-2 0/1 CrashLoopBackOff 7 (86s ago) 18m
druid-dev238-overlords-0 0/1 CrashLoopBackOff 7 (97s ago) 18m
druid-operator-5c998c4c46-fdb5d 2/2 Running 0 37m
druid-operator-5c998c4c46-rwvdm 2/2 Running 0 37m
druid-operator-5c998c4c46-xxh79 2/2 Running 0 37m
external-dns-769d98f985-trtwb 1/1 Running 0 37m
zookeeper-0 1/1 Running 0 47m
zookeeper-1 1/1 Running 0 47m
zookeeper-2 1/1 Running 0 47m

Any idea?

AdheipSingh · 2023-09-25T12:18:29Z

@chn217 how come the broker got deleted ? no log on the operator side. Do you have an audit log ? BTW is this a managed k8s offering ?

chn217 · 2023-09-25T12:24:03Z

@AdheipSingh I don't really think broker got deleted..
Based on the the events of "kubectl describe druid"
"
Normal DruidOperatorCreateSuccess 13m (x5 over 13m) druid-operator (combined from similar events): Successfully created object [druid-dev238-brokers:*v1.StatefulSet] in namespace [default]
Warning DruidOperatorGetFail 13m druid-operator Failed to get [Object:] due to [StatefulSet.apps "druid-dev238-brokers" not found]
"
It looks to be a race condition. The first event said that the broker sts was created, but the next event suggested it couldn't be found. The 'kubectl get pod' also confirmed that the broker sts exists.

The druid operator log was captured in my previous comment. Here was the error message:
"
2023-09-25T10:56:30Z ERROR Reconciler error {"controller": "druid", "controllerGroup": "druid.apache.org", "controllerKind": "Druid", "Druid": {"name":"dev238","namespace":"default"}, "namespace": "default", "name": "dev238", "reconcileID": "30861c9a-9d41-41b9-8d83-0f14d837ab28", "error": "StatefulSet.apps "druid-dev238-brokers" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
"

This is Amazon EKS. Is there any specific command that I can show the audit log? Thanks

AdheipSingh · 2023-09-25T12:27:17Z

ah ok , so just to confirm, you did not see any broker deletion by the operator ?

I agree, race conditions can exist, operator acts like a state machine ( observed state ) , and due to abstractions it deals with the overall system, it is eventually consistent.

chn217 · 2023-09-25T12:32:08Z

Thanks. I didn't see any broker sts got deleted. The problem here was that the coordinator was never created (not reconciled unfortunatelly). All the pods stay on CrashLoopBackOff status as Druid Pods need to talk to Coordinator for /status/health apis to work (readiness probes).

AdheipSingh · 2023-09-25T12:35:23Z

@chn217 did the coordinator issue reoccur again ?

chn217 · 2023-09-25T12:39:16Z

@AdheipSingh The symptom looks exactly the same. The coordinator didn't show up from the pod/sts list.
`
kubectl get statefulsets.apps
NAME READY AGE
druid-dev238-brokers 0/1 101m
druid-dev238-historicals 0/3 101m
druid-dev238-middlemanagers 0/3 101m
druid-dev238-overlords 0/1 101m
zookeeper 3/3 12h

kubectl get pod
NAME READY STATUS RESTARTS AGE
druid-dev238-brokers-0 0/1 CrashLoopBackOff 19 (4m50s ago) 102m
druid-dev238-historicals-0 0/1 CrashLoopBackOff 33 (100s ago) 102m
druid-dev238-historicals-1 0/1 CrashLoopBackOff 33 (89s ago) 102m
druid-dev238-historicals-2 0/1 Running 32 (5m44s ago) 102m
druid-dev238-middlemanagers-0 0/1 CrashLoopBackOff 27 (114s ago) 102m
druid-dev238-middlemanagers-1 0/1 CrashLoopBackOff 27 (110s ago) 102m
druid-dev238-middlemanagers-2 0/1 CrashLoopBackOff 27 (89s ago) 102m
druid-dev238-overlords-0 0/1 CrashLoopBackOff 27 (100s ago) 102m
druid-operator-5c998c4c46-fdb5d 2/2 Running 0 121m
druid-operator-5c998c4c46-rwvdm 2/2 Running 0 121m
druid-operator-5c998c4c46-xxh79 2/2 Running 0 121m
external-dns-769d98f985-trtwb 1/1 Running 0 121m
zookeeper-0 1/1 Running 0 131m
zookeeper-1 1/1 Running 0 131m
zookeeper-2 1/1 Running 0 131m
`

chn217 · 2023-09-25T12:40:25Z

Also the router sts is not showing up. Not sure change from sts to deployment for query/master would help?

AdheipSingh · 2023-09-25T18:44:35Z

did this occur when you did an upgrade ?

chn217 · 2023-09-25T22:41:41Z

@AdheipSingh No, it happened for a new cluster. Here are the steps that I took to create a new cluster:

Create EKS cluster
Create Node group
Run Druid operator
Load the cluster manifest

This issue seems to start to occur from v1.2.0. Previously we are on v1.0.0 where we haven't seen this issue. BTW, this issue is intermittent (another symptom for race condtion).

AdheipSingh · 2023-10-16T06:34:29Z

@chn217 operator will log if it deletes any node.

AdheipSingh · 2023-10-24T06:06:14Z

Feel free to re-open and provide sufficient logs stating operator deleted the node. You can find in operator logs and operator events describing the current CR.

AdheipSingh closed this as completed Oct 24, 2023

AtoLrn mentioned this issue Jun 13, 2024

Coordinator and router not created #168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinator is not created by druid operator #105

Coordinator is not created by druid operator #105

chn217 commented Sep 8, 2023

AdheipSingh commented Sep 8, 2023

AdheipSingh commented Sep 8, 2023

itamar-marom commented Sep 8, 2023

AdheipSingh commented Sep 8, 2023

itamar-marom commented Sep 8, 2023

itamar-marom commented Sep 8, 2023

chn217 commented Sep 8, 2023 •

edited

Loading

chn217 commented Sep 8, 2023

chn217 commented Sep 8, 2023

AdheipSingh commented Sep 8, 2023

chn217 commented Sep 8, 2023

AdheipSingh commented Sep 9, 2023

chn217 commented Sep 11, 2023

chn217 commented Sep 25, 2023 •

edited

Loading

AdheipSingh commented Sep 25, 2023

chn217 commented Sep 25, 2023 •

edited

Loading

AdheipSingh commented Sep 25, 2023 •

edited

Loading

chn217 commented Sep 25, 2023

AdheipSingh commented Sep 25, 2023

chn217 commented Sep 25, 2023

chn217 commented Sep 25, 2023

AdheipSingh commented Sep 25, 2023

chn217 commented Sep 25, 2023

AdheipSingh commented Oct 16, 2023

AdheipSingh commented Oct 24, 2023

Coordinator is not created by druid operator #105

Coordinator is not created by druid operator #105

Comments

chn217 commented Sep 8, 2023

AdheipSingh commented Sep 8, 2023

AdheipSingh commented Sep 8, 2023

itamar-marom commented Sep 8, 2023

AdheipSingh commented Sep 8, 2023

itamar-marom commented Sep 8, 2023

itamar-marom commented Sep 8, 2023

chn217 commented Sep 8, 2023 • edited Loading

chn217 commented Sep 8, 2023

chn217 commented Sep 8, 2023

AdheipSingh commented Sep 8, 2023

chn217 commented Sep 8, 2023

AdheipSingh commented Sep 9, 2023

chn217 commented Sep 11, 2023

chn217 commented Sep 25, 2023 • edited Loading

AdheipSingh commented Sep 25, 2023

chn217 commented Sep 25, 2023 • edited Loading

AdheipSingh commented Sep 25, 2023 • edited Loading

chn217 commented Sep 25, 2023

AdheipSingh commented Sep 25, 2023

chn217 commented Sep 25, 2023

chn217 commented Sep 25, 2023

AdheipSingh commented Sep 25, 2023

chn217 commented Sep 25, 2023

AdheipSingh commented Oct 16, 2023

AdheipSingh commented Oct 24, 2023

chn217 commented Sep 8, 2023 •

edited

Loading

chn217 commented Sep 25, 2023 •

edited

Loading

chn217 commented Sep 25, 2023 •

edited

Loading

AdheipSingh commented Sep 25, 2023 •

edited

Loading