etcd-operator panics on self-hosted bootkube #851

janwillies · 2017-03-02T15:24:23Z

I'm running a self-hosted bootkube cluster (see kubernetes-retired/bootkube#346) and when trying to scale etcd I ran into the following problems:
Scaling etcd:

kubectl --namespace=kube-system get cluster.etcd kube-etcd -o json > etcd.json && \
vim etcd.json && \
curl -H 'Content-Type: application/json' -X PUT --data @etcd.json http://127.0.0.1:8080/apis/etcd.coreos.com/v1beta1/namespaces/kube-system/clusters/kube-etcd

Output:

{
  "apiVersion": "etcd.coreos.com/v1beta1",
  "kind": "Cluster",
  "metadata": {
    "name": "kube-etcd",
    "namespace": "kube-system",
    "selfLink": "/apis/etcd.coreos.com/v1beta1/namespaces/kube-system/clusters/kube-etcd",
    "uid": "1b3c4d81-feef-11e6-9fc2-0026558252a6",
    "resourceVersion": "96374",
    "creationTimestamp": "2017-03-02T02:22:38Z"
  },
  "spec": {
    "selfHosted": {
      "bootMemberClientEndpoint": "http://10.7.183.59:12379"
    },
    "size": 3,
    "version": "3.1.0"
  },
  "status": {
    "conditions": null,
    "controlPaused": false,
    "currentVersion": "",
    "phase": "Failed",
    "reason": "cluster failed to be created",
    "size": 0,
    "targetVersion": ""
  }
}

etcd-operator log:

time="2017-03-02T15:07:22Z" level=info msg="etcd-operator Version: 0.2.1"
time="2017-03-02T15:07:22Z" level=info msg="Git SHA: ded9a44"
time="2017-03-02T15:07:22Z" level=info msg="Go Version: go1.7.5"
time="2017-03-02T15:07:22Z" level=info msg="Go OS/Arch: linux/amd64" 
time="2017-03-02T15:07:22Z" level=info msg="finding existing clusters..." pkg=controller 
time="2017-03-02T15:07:22Z" level=info msg="ignore failed cluster kube-etcd" pkg=controller
time="2017-03-02T15:07:22Z" level=info msg="starts running from watch version: 96182" pkg=controller 
time="2017-03-02T15:07:22Z" level=info msg="start watching at 96182" pkg=controller 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xe8 pc=0x9844b7]

goroutine 76 [running]:
panic(0x15975a0, 0xc42000a050)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/coreos/etcd-operator/pkg/cluster.(*Cluster).send(0x0, 0xc4205d8340)
        /home/ubuntu/code/golang/src/github.com/coreos/etcd-operator/pkg/cluster/cluster.go:212 +0x37
github.com/coreos/etcd-operator/pkg/cluster.(*Cluster).Update(0x0, 0xc420136380)
        /home/ubuntu/code/golang/src/github.com/coreos/etcd-operator/pkg/cluster/cluster.go:356 +0x77
github.com/coreos/etcd-operator/pkg/controller.(*Controller).Run.func2(0xc420064300, 0xc4200980a0)
        /home/ubuntu/code/golang/src/github.com/coreos/etcd-operator/pkg/controller/controller.go:167 +0x21e
created by github.com/coreos/etcd-operator/pkg/controller.(*Controller).Run
        /home/ubuntu/code/golang/src/github.com/coreos/etcd-operator/pkg/controller/controller.go:182 +0x315

I'm guessing it's because etcd-operator panics&restarts and can’t find the already-running etcd cluster ("size": 0)

@xiang90 @hongchaodeng

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-03-02T15:31:05Z

trying to scale etcd I ran into the following problems:

How did you scale etcd? What requests you sent to etcd operator?

Can you reproduce this issue? Any reproduce steps?

janwillies · 2017-03-02T15:37:29Z

I've updated my comment to make it more clear.

In the current cluster state I can reproduce this everytime. Let me setup a new cluster, and see if I can reproduce as well

janwillies · 2017-03-02T16:09:32Z

On a new cluster, I killed the operator a few times, scaled up and down more than once but can't reproduce anymore. I'll leave it up to you to close this issue

xiang90 · 2017-03-02T16:44:16Z

@janwillies OK. I think we might hit a race. Just want to make sure it does not happen all the time and kind of confirm my guess. we will get fixed for you soon.

fix coreos#851

xiang90 added the kind/bug label Mar 2, 2017

hongchaodeng added a commit to hongchaodeng/etcd-operator that referenced this issue Mar 2, 2017

controller: check failed status before processing cluster event

3cf2c74

fix coreos#851

hongchaodeng mentioned this issue Mar 2, 2017

controller: check failed status before processing cluster event #852

Merged

hongchaodeng added a commit to hongchaodeng/etcd-operator that referenced this issue Mar 2, 2017

controller: check failed status before processing cluster event

4e59068

fix coreos#851

hongchaodeng added a commit to hongchaodeng/etcd-operator that referenced this issue Mar 2, 2017

controller: check failed status before processing cluster event

165c1d9

fix coreos#851

hongchaodeng added a commit to hongchaodeng/etcd-operator that referenced this issue Mar 2, 2017

controller: check failed status before processing cluster event

deb74de

fix coreos#851

hongchaodeng added a commit to hongchaodeng/etcd-operator that referenced this issue Mar 2, 2017

controller: check failed status before processing cluster event

e8c3be4

fix coreos#851

hongchaodeng closed this as completed in #852 Mar 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd-operator panics on self-hosted bootkube #851

etcd-operator panics on self-hosted bootkube #851

janwillies commented Mar 2, 2017 •

edited

Loading

xiang90 commented Mar 2, 2017

janwillies commented Mar 2, 2017

janwillies commented Mar 2, 2017

xiang90 commented Mar 2, 2017

etcd-operator panics on self-hosted bootkube #851

etcd-operator panics on self-hosted bootkube #851

Comments

janwillies commented Mar 2, 2017 • edited Loading

xiang90 commented Mar 2, 2017

janwillies commented Mar 2, 2017

janwillies commented Mar 2, 2017

xiang90 commented Mar 2, 2017

janwillies commented Mar 2, 2017 •

edited

Loading