Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turns on leader election by default and adds e2e tests #724

Merged
merged 1 commit into from
Nov 26, 2018

Conversation

mhrivnak
Copy link
Member

@mhrivnak mhrivnak commented Nov 9, 2018

Adds a readiness probe by default that reflects whether a pod is the leader.

fixes #598

Motivation for the change:

see #530

@mhrivnak mhrivnak added kind/feature Categorizes issue or PR as related to a new feature. area/testing Issue related to testing the operator-sdk and subcomponents labels Nov 9, 2018
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 9, 2018
Copy link
Member

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall, just couple of nits and questions

command:
- stat
- /tmp/operator-sdk-leader
initialDelaySeconds: 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask about the numbers, what is the reason behind 4 seconds delay?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small enough to not cause a lot of delay, but large enough to give the process a chance to start. Otherwise it's just a feeling. :) It's hard to pick these numbers well. It could probably be 2 seconds or 6 seconds with just as much success. I'm certainly open to any suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern was that the delay might not be enough. What happens then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will try again 10 seconds later, so I don't see an opportunity for it to cause a problem.

test/e2e/memcached_test.go Outdated Show resolved Hide resolved
test/e2e/memcached_test.go Outdated Show resolved Hide resolved
Copy link
Member

@estroz estroz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after nit about sed.

test/e2e/memcached_test.go Outdated Show resolved Hide resolved
Copy link
Member

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@mhrivnak mhrivnak force-pushed the leader-testing branch 2 times, most recently from 947719a to ae4e1da Compare November 19, 2018 20:37
log.Error(err, "")
os.Exit(1)
}
defer os.Remove(f.Name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhrivnak What do you think about providing a helper to abstract away Setting and Unsetting the Readiness check of an operator?
e.g SetReady() and UnsetReady(). That way the user can unset the Readiness if necessary based on the operator setup logic.
Plus it's a bit cleaner than having to create the file in main.go e.g:
https://github.com/coreos/etcd-operator/blob/master/pkg/util/probe/readyz.go#L31-L35
https://github.com/coreos/etcd-operator/blob/master/pkg/controller/informer.go#L51

And as for the actual probe, do you have any thoughts on using a CMD probe(stat file) vs an HTTP Get probe.
https://github.com/coreos/etcd-operator/blob/master/cmd/operator/main.go#L105

I've seen a few other operators use the HTTP probe, although I'm not sure that's a standard for upstream components like the scheduler, apiserver etc.
https://github.com/operator-framework/operator-metering/blob/master/manifests/deploy/generic/helm-operator/metering-operator-deployment.yaml#L86-L89
https://github.com/coreos/prometheus-operator/blob/master/pkg/prometheus/statefulset.go#L512-L516

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the helper idea. I moved that part to a new package; see what you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the probe type, http is useful if you want to dynamically determine readiness at the time a probe runs. For our current use case, I think it's overkill to have an entire web handler, and probably listen on yet another socket, just to return a boolean value that's being stored as a singleton in memory. It's much simpler to save that boolean state to the filesystem (by creating or deleting a file at a known path) and let k8s access that state on-disk through normal POSIX means.

You probably saw this, but for completeness, the main example for liveness/readiness probes uses the same file-existence technique: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-command

That said, I made an interface to set and unset the state so that if a future operator dev has a need for an http-based readiness probe, or any other kind besides the simple file-based probe, that can be facilitated.

@hasbro17
Copy link
Contributor

Overall SGTM. Just a comment on possibly abstracting the readiness probe check as helper to make things cleaner.

Adds a readiness probe by default that reflects whether a pod is the leader.

fixes operator-framework#598
Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issue related to testing the operator-sdk and subcomponents kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add e2e test for leader election
5 participants