Add operator election support #3632

barkbay · 2020-08-14T09:43:39Z

This PR adds and enables an election mechanism between several running operators. It allows the user to scale up the number of operator instances that are running at the same time.

Fixes #709

docs/operating-eck/operator-config.asciidoc

anyasabo · 2020-08-14T13:20:11Z

Should we change the defaults to increase the number of replicas, or at least change it for e2e tests to make sure this gets used? It doesn't look like we hardcode references to elastic-operator-0 outside of one doc

anyasabo · 2020-08-14T13:20:29Z

I'd also argue this counts as a >feature not >enhancement

barkbay · 2020-08-14T14:12:07Z

Should we change the defaults to increase the number of replicas, or at least change it for e2e tests to make sure this gets used?

Regarding the defaults in the all-in-one I was wondering the same. I think I would be 👍 to increase it to something like 2 and add a variable in the manifest template.

Regarding our e2e tests I was wondering if #2012 would not be a better place to do that because I'm not sure that our current e2e tests will give us any insight from this feature.

I'd also argue this counts as a >feature not >enhancement

If you already tried to increase the number of replicas, wondering why some reconciliations look "weird" ... then I think it looks like an enhancement 😄 . I'm also fine with the >feature flag.

Co-authored-by: Anya Sabo <anya@sabolicio.us>

anyasabo · 2020-08-14T14:31:40Z

Regarding the defaults in the all-in-one I was wondering the same. I think I would be 👍 to increase it to something like 2 and add a variable in the manifest template.

👍 and point taken on the e2e testing bit, it would still eventually work without this so we would really only be catching the case where theyre deadlocked.

david-kow · 2020-08-19T08:15:45Z

I have some questions regarding this change:

Why someone would use this feature? IIUC, when a primary fails, a secondary will be promoted and will take over. What is the improvement with that behaviour over just starting a new Pod? If it's just time to start, do we know the actual improvement? (I'm asking since it seems to me that it would be minimal - ~0s vs few s).
Did we get any requests for this?
What are the failure scenarios that this mechanism handles on it's own? Are operator pods communicating directly? What if network partition happens and both new and old primary has access to k8s APIs and are still running?

I think that if we want to add this (especially as the default), we should have it extensively tested.

barkbay · 2020-08-19T09:33:29Z

Why someone would use this feature? IIUC, when a primary fails, a secondary will be promoted and will take over. What is the improvement with that behaviour over just starting a new Pod? If it's just time to start, do we know the actual improvement? (I'm asking since it seems to me that it would be minimal - ~0s vs few s).

I'm not sure what you mean by "starting a new Pod", if a container inside a Pod is crashing because of a container runtime failure, a config. error or some network issues then the kubelet will try to restart the container, but the Pod itself will stay in a CrashLoopBackOff state.

Did we get any requests for this?

Not to my knowledge, the idea here is to prevent a user to start simultaneously several operators by increasing the number of replicas. I think it should also cover the case where a Pod is in an Unknown state because the kubelet is partitioned and a controller attempts to start a new instance.

What are the failure scenarios that this mechanism handles on it's own? Are operator pods communicating directly? What if network partition happens and both new and old primary has access to k8s APIs and are still running?

I think that if we want to add this (especially as the default), we should have it extensively tested.

The leader election mechanism is based on the ResourceLock, it already has its own unit tests. The overall mechanism is used by the Kubernetes project itself for the election mechanism of the K8S scheduler or for some built-in controllers. I agree that it should be tested but as mentioned above I think it should be done as part of #2012

There is no communication between the Pods, they are competing to apply an annotation on a ConfigMap, e.g.:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"fcf2a873-8018-43c0-9196-13293a65d9ef","leaseDurationSeconds":15,"acquireTime":"2020-08-19T09:29:49Z","renewTime":"2020-08-19T09:30:15Z","leaderTransitions":0}'
  creationTimestamp: "2020-08-19T09:29:49Z"
  name: elastic-operator-leader
  namespace: elastic-system
  resourceVersion: "2331024"
  selfLink: /api/v1/namespaces/elastic-system/configmaps/elastic-operator-leader
  uid: 9eddb203-16c1-4751-9990-a3f61894cb5c

david-kow · 2020-08-19T09:49:27Z

I'm not sure what you mean by "starting a new Pod"

I didn't phrase it well, I meant what you described.

It all does sound convincing, but it seems to be more a preventive measure for users that added more replicas rather than a reliability improvement, correct? I initially thought the goal is increasing the reliability.

I agree that it should be tested but as mentioned above I think it should be done as part of #2012

Should we wait with this until #2012 is in then?

barkbay · 2020-08-19T12:00:20Z

it seems to be more a preventive measure for users that added more replicas rather than a reliability improvement, correct?

It prevents the operator to start if not elected. By doing that a user can safely run several instances at the same time. It improve the availability, so I would say it is an improvement ?

Should we wait with this until #2012 is in then?

Current e2e tests will ensure that an election happens in normal situations. As mentioned previously we are relying on a feature of the controller runtime, so my feeling was that it is a nice improvement and there is no urgency to add some additional e2e tests.

The only point I'm not sure is that should we keep the default number of replicas to 1 for now, it would make it easier for the user to get the logs. If we want to set the default value to 2 I think it makes sense to also do that for e2e tests as suggested by @anyasabo

pebrc · 2020-08-20T13:51:14Z

The only point I'm not sure is that should we keep the default number of replicas to 1 for now, it would make it easier for the user to get the logs. If we want to set the default value to 2 I think it makes sense to also do that for e2e tests as suggested by @anyasabo

I agree that we should probably keep the number of replicas at 1 but excercise the new functionality in an e2e test.

cmd/manager/main.go

hack/manifest-gen/assets/charts/eck/values.yaml

pebrc · 2020-08-24T14:51:29Z

cmd/manager/main.go

+		CertDir:                 viper.GetString(operator.WebhookCertDirFlag),
+		LeaderElection:          viper.GetBool(operator.EnableLeaderElection),
+		LeaderElectionID:        LeaderElectionConfigMapName,
+		LeaderElectionNamespace: operatorNamespace,


I wonder if it is possible to deploy multiple independent instances of the operator, responsible for different namespaces or sets of namespaces via OLM that would all run in the operators namespace by default IIRC which would then all try to use the same ConfigMap. Is this possible or are there additional safeguards in place in the leader election algorithm, e.g. some form of unique identifier per operator id. If not, should we try to include the operator UUID as an ID to prevent that?

👍 I think it's a good argument for making the leader election config map name configurable.

I don't think we support multiple operators in the same namespace today, but you're right there's an argument that we should and maybe we should start here. Off the top of my head we need to make this variable:

// licensingCfgMapName is the name of the config map used to store licensing information licensingCfgMapName = "elastic-licensing"

barkbay · 2020-08-26T09:19:40Z

Thanks for all the inputs ! I have tried to sum my thoughts on the different topics expressed here:

Making the leader election config map name configurable

IIUC the intention here is to be able to run several operators with distinct responsibilities in a shared namespace.
I don't think it is possible right now:

The ConfigMap which contains the UUID is generated as a per namespace basis. I think I would expect to have distinct UUIDs for each group of operators to be consistent with what would happen if they were deployed in different namespaces.
I believe that the licensing artifacts are also tied to the namespace in which Pods are running, not to a group of operator instances.

Regarding OLM: I'm not sure it is supported, or at least it does not seem to be the first intention. The documentation seems to indicate that an operator group is tied to a namespace:

An Operator is considered a member of an OperatorGroup if the following conditions are true:

The Operator’s CSV exists in the same namespace as the OperatorGroup.

The Operator’s CSV’s InstallModes support the set of namespaces targeted by the OperatorGroup.

Maybe we should discuss if we want this feature in a dedicated issue ? (Kind of related to #2032)

Concurrent execution

While it is possible to have an algorithm which ensures that there is at most one leader at a given point in time, it is not possible to enforce this condition at the workload level (i.e. threads/go routines may still temporarily live in a process which is not the leader anymore). Even the "leader-for-life" election mechanism has its own edge cases.

I would consider that:

An election mechanism based on leases is enough for now.
The manager will quickly shut down the controllers and the workqueues. I don't think any slow I/O would interfere here.

Testing

I agree that we should test this feature to some extent, but I'm not sure we should expect to have it extensively tested. Election mechanisms are mostly disrupted by slow I/O, high latencies, clock screw [rate] inconsistencies, partition conditions on the host or on the process ... I'm not sure it is easy to simulate those conditions in an e2e environment (at least without adding some noise to our e2e tests suite). Killing Pods randomly would be a first step but my feeling is that it would just be happy path testing. Let's follow up this discussion in #2012 for which I'll try to outline a proposal.

sebgl

LGTM once we set back the number of replicas to 1 by default in the manifest.
I think I see this as an additional safeguard in case multiple operators run in parallel, more than a real feature that allows you to achieve HA.

About the points mentioned in the comments:

👍 for considering making the config map name configurable in a later stage, along with other hardcoded resource names
I would vote for not spending too much time writing complicated tests to cover the leader election process, since we just reuse it from client-go and controller-runtime where it is already tested. I think I'm actually OK with no tests at all on our side.

barkbay · 2020-09-03T10:18:18Z

I'm merging this one and will quickly follow up with some e2e tests.

barkbay added 3 commits August 14, 2020 11:30

Introduce leader election

651d58d

Update documentation

315ea0f

Fix typo and sort flag

70f07da

botelastic bot added the triage label Aug 14, 2020

barkbay added the >enhancement Enhancement of existing functionality label Aug 14, 2020

botelastic bot removed the triage label Aug 14, 2020

barkbay added triage v1.3.0 labels Aug 14, 2020

botelastic bot removed the triage label Aug 14, 2020

anyasabo reviewed Aug 14, 2020

View reviewed changes

docs/operating-eck/operator-config.asciidoc Outdated Show resolved Hide resolved

barkbay added >feature Adds or discusses adding a feature to the product and removed >enhancement Enhancement of existing functionality labels Aug 14, 2020

Update docs/operating-eck/operator-config.asciidoc

58c05be

Co-authored-by: Anya Sabo <anya@sabolicio.us>

deploy 2 replicas by default in the all-in-one

5ed2035

merge with origin/master

49d1077

sebgl reviewed Aug 24, 2020

View reviewed changes

cmd/manager/main.go Show resolved Hide resolved

cmd/manager/main.go Show resolved Hide resolved

hack/manifest-gen/assets/charts/eck/values.yaml Outdated Show resolved Hide resolved

pebrc reviewed Aug 24, 2020

View reviewed changes

sebgl approved these changes Aug 26, 2020

View reviewed changes

Set default replicas to 1

f005b7d

sebgl approved these changes Aug 31, 2020

View reviewed changes

Merge with origin

72cca80

barkbay merged commit ff40a5c into elastic:master Sep 3, 2020

ytsarev mentioned this pull request Jun 16, 2021

Make LeaderElection accessible k8gb-io/k8gb#526

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add operator election support #3632

Add operator election support #3632

barkbay commented Aug 14, 2020

anyasabo commented Aug 14, 2020

anyasabo commented Aug 14, 2020

barkbay commented Aug 14, 2020

anyasabo commented Aug 14, 2020

david-kow commented Aug 19, 2020

barkbay commented Aug 19, 2020

david-kow commented Aug 19, 2020

barkbay commented Aug 19, 2020

pebrc commented Aug 20, 2020

pebrc Aug 24, 2020

sebgl Aug 24, 2020

anyasabo Aug 24, 2020

barkbay commented Aug 26, 2020

sebgl left a comment •

edited

Loading

barkbay commented Sep 3, 2020

Add operator election support #3632

Add operator election support #3632

Conversation

barkbay commented Aug 14, 2020

anyasabo commented Aug 14, 2020

anyasabo commented Aug 14, 2020

barkbay commented Aug 14, 2020

anyasabo commented Aug 14, 2020

david-kow commented Aug 19, 2020

barkbay commented Aug 19, 2020

david-kow commented Aug 19, 2020

barkbay commented Aug 19, 2020

pebrc commented Aug 20, 2020

pebrc Aug 24, 2020

Choose a reason for hiding this comment

sebgl Aug 24, 2020

Choose a reason for hiding this comment

anyasabo Aug 24, 2020

Choose a reason for hiding this comment

barkbay commented Aug 26, 2020

Making the leader election config map name configurable

Concurrent execution

Testing

sebgl left a comment • edited Loading

Choose a reason for hiding this comment

barkbay commented Sep 3, 2020

sebgl left a comment •

edited

Loading