Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElasticSearch statefulset broken after 1 hour #192

Closed
tomfotherby opened this issue Jun 8, 2017 · 4 comments
Closed

ElasticSearch statefulset broken after 1 hour #192

tomfotherby opened this issue Jun 8, 2017 · 4 comments

Comments

@tomfotherby
Copy link

tomfotherby commented Jun 8, 2017

I brought up a Kubernetes cluster with Tack in an existing VPC in us-east-1 and all was good until suddenly the first pod in the ElasticSearch StatefulSet was killed.

I confirmed from CloudTrail and the ASG Activity History that the autoscaler had removed a Worker which, by chance, had a ElasticSearch pod on it. I can see the 25G EBS volume that the StatefulSet volumeClaimTemplates had provisioned, is now unattached.

The second ElasticSearch pod, was assigned to a master node, so is unaffected by scaling events. One solution would be to force both ElasticSearch pods to use master nodes.

Here we see the statefulset is broken:

$ kubectl get statefulset -n kube-system
NAME                    DESIRED   CURRENT   AGE
elasticsearch-logging   2         1         5h

The elasticsearch-logging-1 pod exists but the elasticsearch-logging-0 pod is missing:

$ kubectl get pods -n kube-system -l k8s-app=elasticsearch-logging
NAME                      READY     STATUS    RESTARTS   AGE
elasticsearch-logging-1   1/1       Running   0          5h

This command explains the cause of the failure, i.e it's trying to attach a ebs volume to a now non-existent node:

$ get events -n kube-system
LASTSEEN   FIRSTSEEN   COUNT     NAME                      KIND      SUBOBJECT   TYPE      REASON        SOURCE         MESSAGE
3s         6h          195       elasticsearch-logging-0   Pod                   Warning   FailedMount   attachdetach   Failed to attach volume "pvc-e88bb223-4c34-11e7-bb12-0afa88f15a64" on node "ip-10-56-0-138.ec2.internal" with: error finding instance ip-10-56-0-138.ec2.internal: instance not found

This command shows there is some problem deleting the node (even though it does not show up in kubectl get nodes):

$ kubectl get events
LASTSEEN   FIRSTSEEN   COUNT     NAME                          KIND      SUBOBJECT   TYPE      REASON         SOURCE              MESSAGE
3s         4h          3386      ip-10-56-0-138.ec2.internal   Node                  Normal    DeletingNode   controllermanager   Node ip-10-56-0-138.ec2.internal event: Deleting Node ip-10-56-0-138.ec2.internal because it's not present according to cloud provider

(FYI: I think this log spam is a separate issue fixed in kubernetes/kubernetes#45923)

Checking the autoscaler info also shows there's still 6 registered nodes:

kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2017-06-08 17:11:20.848329171 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=5 unready=0 notStarted=0 longNotStarted=0 registered=6)
...

I'm not sure how to tell kubernetes to truly forget the old ip-10-56-0-138 worker node or to stop trying to mount the volume to instance that doesn't exist.

@tomfotherby
Copy link
Author

tomfotherby commented Jun 8, 2017

My cluster never recovered. There is some deadlock occurring related to EBS persistent volumes and the autoscaler. I ran out of time and energy investigating but had to press on so I did make clean and re-created the cluster after changing addons/logging/elasticsearch-logging.yml to include a nodeSelector to force it to be placed on a master node so the issue can't re-occur:

    spec:
      nodeSelector:
        # Force ES pods to be on master nodes because otherwise the autoscaler may
        # Shutdown the node and the statefulset gets left unable to function
        # due to a bug with EBS attachments or something, not sure exactly.
        node-role.kubernetes.io/master: ''

Feel free to close this issue if you think it's a rare event or statefulset bug.

@tomfotherby
Copy link
Author

I'm not 100% sure but I think my problem is fixed in PR kubernetes/kubernetes#46463 :

Fix AWS EBS volumes not getting detached from node if routine to verify volumes are attached runs while the node is down.

Which I found from the Kuberbetes v1.7.0-beta.1 CHANGELOG. So hopefully coming 28/Jun/17.

@cemo
Copy link

cemo commented Jun 9, 2017

There was some issues regarding etcd version as well. The current version 3.0.10 is problematic.

Would you add an entry to update your etcd version as well?

        - name: 10-environment.conf
          content: |
            [Service]
            Environment="ETCD_IMAGE_TAG=v3.0.17"
            Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${ fqdn }:2379"
            Environment="ETCD_CERT_FILE=/etc/ssl/certs/k8s-etcd.pem"
            Environment="ETCD_CLIENT_CERT_AUTH=true"
           ....

like this.

@bruj0
Copy link
Contributor

bruj0 commented Jun 9, 2017

I too have problems with stateful sets described at #185 but i think thats a Kubernetes problems, regarding the autoscaler i had problems with it too so i just disabled it entirely .
I dont think its ready for production yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants