Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add operator election support #3632

Merged
merged 8 commits into from
Sep 3, 2020
Merged

Conversation

barkbay
Copy link
Contributor

@barkbay barkbay commented Aug 14, 2020

This PR adds and enables an election mechanism between several running operators. It allows the user to scale up the number of operator instances that are running at the same time.

Fixes #709

@botelastic botelastic bot added the triage label Aug 14, 2020
@barkbay barkbay added the >enhancement Enhancement of existing functionality label Aug 14, 2020
@botelastic botelastic bot removed the triage label Aug 14, 2020
@botelastic botelastic bot removed the triage label Aug 14, 2020
@anyasabo
Copy link
Contributor

Should we change the defaults to increase the number of replicas, or at least change it for e2e tests to make sure this gets used? It doesn't look like we hardcode references to elastic-operator-0 outside of one doc

@anyasabo
Copy link
Contributor

I'd also argue this counts as a >feature not >enhancement

@barkbay
Copy link
Contributor Author

barkbay commented Aug 14, 2020

Should we change the defaults to increase the number of replicas, or at least change it for e2e tests to make sure this gets used?

Regarding the defaults in the all-in-one I was wondering the same. I think I would be 👍 to increase it to something like 2 and add a variable in the manifest template.

Regarding our e2e tests I was wondering if #2012 would not be a better place to do that because I'm not sure that our current e2e tests will give us any insight from this feature.

I'd also argue this counts as a >feature not >enhancement

If you already tried to increase the number of replicas, wondering why some reconciliations look "weird" ... then I think it looks like an enhancement 😄 . I'm also fine with the >feature flag.

@barkbay barkbay added >feature Adds or discusses adding a feature to the product and removed >enhancement Enhancement of existing functionality labels Aug 14, 2020
Co-authored-by: Anya Sabo <anya@sabolicio.us>
@anyasabo
Copy link
Contributor

Regarding the defaults in the all-in-one I was wondering the same. I think I would be 👍 to increase it to something like 2 and add a variable in the manifest template.

👍 and point taken on the e2e testing bit, it would still eventually work without this so we would really only be catching the case where theyre deadlocked.

@david-kow
Copy link
Contributor

I have some questions regarding this change:

  • Why someone would use this feature? IIUC, when a primary fails, a secondary will be promoted and will take over. What is the improvement with that behaviour over just starting a new Pod? If it's just time to start, do we know the actual improvement? (I'm asking since it seems to me that it would be minimal - ~0s vs few s).
  • Did we get any requests for this?
  • What are the failure scenarios that this mechanism handles on it's own? Are operator pods communicating directly? What if network partition happens and both new and old primary has access to k8s APIs and are still running?

I think that if we want to add this (especially as the default), we should have it extensively tested.

@barkbay
Copy link
Contributor Author

barkbay commented Aug 19, 2020

  • Why someone would use this feature? IIUC, when a primary fails, a secondary will be promoted and will take over. What is the improvement with that behaviour over just starting a new Pod? If it's just time to start, do we know the actual improvement? (I'm asking since it seems to me that it would be minimal - ~0s vs few s).

I'm not sure what you mean by "starting a new Pod", if a container inside a Pod is crashing because of a container runtime failure, a config. error or some network issues then the kubelet will try to restart the container, but the Pod itself will stay in a CrashLoopBackOff state.

  • Did we get any requests for this?

Not to my knowledge, the idea here is to prevent a user to start simultaneously several operators by increasing the number of replicas. I think it should also cover the case where a Pod is in an Unknown state because the kubelet is partitioned and a controller attempts to start a new instance.

  • What are the failure scenarios that this mechanism handles on it's own? Are operator pods communicating directly? What if network partition happens and both new and old primary has access to k8s APIs and are still running?

I think that if we want to add this (especially as the default), we should have it extensively tested.

The leader election mechanism is based on the ResourceLock, it already has its own unit tests. The overall mechanism is used by the Kubernetes project itself for the election mechanism of the K8S scheduler or for some built-in controllers. I agree that it should be tested but as mentioned above I think it should be done as part of #2012

There is no communication between the Pods, they are competing to apply an annotation on a ConfigMap, e.g.:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"fcf2a873-8018-43c0-9196-13293a65d9ef","leaseDurationSeconds":15,"acquireTime":"2020-08-19T09:29:49Z","renewTime":"2020-08-19T09:30:15Z","leaderTransitions":0}'
  creationTimestamp: "2020-08-19T09:29:49Z"
  name: elastic-operator-leader
  namespace: elastic-system
  resourceVersion: "2331024"
  selfLink: /api/v1/namespaces/elastic-system/configmaps/elastic-operator-leader
  uid: 9eddb203-16c1-4751-9990-a3f61894cb5c

@david-kow
Copy link
Contributor

I'm not sure what you mean by "starting a new Pod"

I didn't phrase it well, I meant what you described.

It all does sound convincing, but it seems to be more a preventive measure for users that added more replicas rather than a reliability improvement, correct? I initially thought the goal is increasing the reliability.

I agree that it should be tested but as mentioned above I think it should be done as part of #2012

Should we wait with this until #2012 is in then?

@barkbay
Copy link
Contributor Author

barkbay commented Aug 19, 2020

it seems to be more a preventive measure for users that added more replicas rather than a reliability improvement, correct?

It prevents the operator to start if not elected. By doing that a user can safely run several instances at the same time. It improve the availability, so I would say it is an improvement ?

Should we wait with this until #2012 is in then?

Current e2e tests will ensure that an election happens in normal situations. As mentioned previously we are relying on a feature of the controller runtime, so my feeling was that it is a nice improvement and there is no urgency to add some additional e2e tests.

The only point I'm not sure is that should we keep the default number of replicas to 1 for now, it would make it easier for the user to get the logs. If we want to set the default value to 2 I think it makes sense to also do that for e2e tests as suggested by @anyasabo

@pebrc
Copy link
Collaborator

pebrc commented Aug 20, 2020

The only point I'm not sure is that should we keep the default number of replicas to 1 for now, it would make it easier for the user to get the logs. If we want to set the default value to 2 I think it makes sense to also do that for e2e tests as suggested by @anyasabo

I agree that we should probably keep the number of replicas at 1 but excercise the new functionality in an e2e test.

cmd/manager/main.go Show resolved Hide resolved
cmd/manager/main.go Show resolved Hide resolved
hack/manifest-gen/assets/charts/eck/values.yaml Outdated Show resolved Hide resolved
CertDir: viper.GetString(operator.WebhookCertDirFlag),
LeaderElection: viper.GetBool(operator.EnableLeaderElection),
LeaderElectionID: LeaderElectionConfigMapName,
LeaderElectionNamespace: operatorNamespace,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is possible to deploy multiple independent instances of the operator, responsible for different namespaces or sets of namespaces via OLM that would all run in the operators namespace by default IIRC which would then all try to use the same ConfigMap. Is this possible or are there additional safeguards in place in the leader election algorithm, e.g. some form of unique identifier per operator id. If not, should we try to include the operator UUID as an ID to prevent that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think it's a good argument for making the leader election config map name configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support multiple operators in the same namespace today, but you're right there's an argument that we should and maybe we should start here. Off the top of my head we need to make this variable:

	// licensingCfgMapName is the name of the config map used to store licensing information
	licensingCfgMapName = "elastic-licensing"

@barkbay
Copy link
Contributor Author

barkbay commented Aug 26, 2020

Thanks for all the inputs ! I have tried to sum my thoughts on the different topics expressed here:

Making the leader election config map name configurable

IIUC the intention here is to be able to run several operators with distinct responsibilities in a shared namespace.
I don't think it is possible right now:

  • The ConfigMap which contains the UUID is generated as a per namespace basis. I think I would expect to have distinct UUIDs for each group of operators to be consistent with what would happen if they were deployed in different namespaces.
  • I believe that the licensing artifacts are also tied to the namespace in which Pods are running, not to a group of operator instances.

Regarding OLM: I'm not sure it is supported, or at least it does not seem to be the first intention. The documentation seems to indicate that an operator group is tied to a namespace:

An Operator is considered a member of an OperatorGroup if the following conditions are true:

  • The Operator’s CSV exists in the same namespace as the OperatorGroup.
  • The Operator’s CSV’s InstallModes support the set of namespaces targeted by the OperatorGroup.

Maybe we should discuss if we want this feature in a dedicated issue ? (Kind of related to #2032)

Concurrent execution

While it is possible to have an algorithm which ensures that there is at most one leader at a given point in time, it is not possible to enforce this condition at the workload level (i.e. threads/go routines may still temporarily live in a process which is not the leader anymore). Even the "leader-for-life" election mechanism has its own edge cases.

I would consider that:

  • An election mechanism based on leases is enough for now.
  • The manager will quickly shut down the controllers and the workqueues. I don't think any slow I/O would interfere here.

Testing

I agree that we should test this feature to some extent, but I'm not sure we should expect to have it extensively tested. Election mechanisms are mostly disrupted by slow I/O, high latencies, clock screw [rate] inconsistencies, partition conditions on the host or on the process ... I'm not sure it is easy to simulate those conditions in an e2e environment (at least without adding some noise to our e2e tests suite). Killing Pods randomly would be a first step but my feeling is that it would just be happy path testing. Let's follow up this discussion in #2012 for which I'll try to outline a proposal.

Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once we set back the number of replicas to 1 by default in the manifest.
I think I see this as an additional safeguard in case multiple operators run in parallel, more than a real feature that allows you to achieve HA.

About the points mentioned in the comments:

  • 👍 for considering making the config map name configurable in a later stage, along with other hardcoded resource names
  • I would vote for not spending too much time writing complicated tests to cover the leader election process, since we just reuse it from client-go and controller-runtime where it is already tested. I think I'm actually OK with no tests at all on our side.

@barkbay
Copy link
Contributor Author

barkbay commented Sep 3, 2020

I'm merging this one and will quickly follow up with some e2e tests.

@barkbay barkbay merged commit ff40a5c into elastic:master Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Adds or discusses adding a feature to the product v1.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support leader election to be more flexible about operator deployment
5 participants