Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master reboot can cause etcd/ etcd-events pods to not start up because of races #5578

Closed
kanantheswaran-splunk opened this issue Aug 3, 2018 · 1 comment

Comments

@kanantheswaran-splunk
Copy link

1. What kops version are you running? The command kops version, will display
this information.

Version 1.9.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.9.9 (we use this version because we really, really need this fix without which we encounter data plane outages when rolling masters)

3. What cloud provider are you using?

AWS (CoreOS AMI)

4. What commands did you run? What is the simplest way to reproduce this issue?

Masters rebooted at different times after successful initial set up. Since it seems to be an initialization race (see below), this cannot be reproduced at will.

5. What happened after the commands executed?

One master was trying to run both etcd pods (main and events), another was running just the main etcd and wasn't even trying to set up the events pod, and the third wasn't trying to run either.

When system was inspected, manifest files were found for both etcd and etcd-events pods.

6. What did you expect to happen?

Etcd main and events should have correctly started up on all masters.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

This is not relevant to the problem at hand.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

This is not relevant to the problem at hand.

9. Anything else do we need to know?

We believe there is a race condition between the actions that cause the etcd volumes to be mounted on the local filesystem and the kubelet being able to read and run the etcd manifest files.

Error messages we saw were as follows:

Aug 03 05:39:57 XXX.compute.internal kubelet[1173]: E0803 05:39:57.378908    1173 file.go:139] Can't get metadata for "/etc/kubernetes/manifests/etcd-events.manifest": stat /etc/kubernetes/manifests/etcd-events.manifest: no such file or directory
Aug 03 05:39:57 XXX.compute.internal kubelet[1173]: E0803 05:39:57.379946    1173 file.go:139] Can't get metadata for "/etc/kubernetes/manifests/etcd.manifest": stat /etc/kubernetes/manifests/etcd.manifest: no such file or directory

When masters are initially set up for a new cluster protokube has already mounted the etcd volumes, written the manifests and symlinked them under /etc/kubernetes/manifests before the kubelet service is started.

On reboot, protokube remounts the volumes but since the kubelet service is already present, it can start too soon before the volumes have been mounted. Thus it fails to resolve the symlinks and start one or both the etcd pods (based on which volumes have been mounted at that instant).

It also seems that there is no filesystem event after that point that will cause the kubelet to re-discover the fact that the etcd pods are not running and run them.

A simple solution might be to set up the kubelet service to not autostart on boot since protokube will start it anyway.

@kanantheswaran-splunk
Copy link
Author

OK, kops is actually doing the right thing here - protokube only starts the kubelet service after all the mounts are created. The issue really was that we were installing a systemd service (via hooks) that had a Requires=kubelet line and nodeup was starting all the hooks on startup which inadvertently ended up starting the kubelet service before protokube explicitly started it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant