etcd deployment fails with DCOS if framework found in Zookeeper #95

Radek44 · 2016-02-22T04:32:28Z

I setup etcd on my cluster using DCOS CLI a first time and it worked. I then uninstalled it. A couple days later I decided to reinstall but since, every installation is failing.
It seems that the reason for this is that the framework is found in Zookeeper but fails at restoring. Here is the failure trace I got through the stderr file in mesos (just changed the IPs with x.x.x.x (agent) and y.y.y.y(mesos master):

+ /work/bin/etcd-mesos-scheduler -alsologtostderr=true -framework-name=etcd -cluster-size=3 -master=zk://master.mesos:2181/mesos -zk-framework-persist=zk://master.mesos:2181/etcd -v=1 -auto-reseed=true -reseed-timeout=240 -sandbox-disk-limit=4096 -sandbox-cpu-limit=1 -sandbox-mem-limit=2048 -admin-port=3356 -driver-port=3357 -artifact-port=3358 -framework-weburi=http://etcd.marathon.mesos:3356/stats
I0222 04:14:30.573426       7 app.go:218] Found stored framework ID in Zookeeper, attempting to re-use: b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003
I0222 04:14:30.575267       7 scheduler.go:209] found failover_timeout = 168h0m0s
I0222 04:14:30.575363       7 scheduler.go:323] Initializing mesos scheduler driver
I0222 04:14:30.575473       7 scheduler.go:792] Starting the scheduler driver...
I0222 04:14:30.575552       7 http_transporter.go:407] listening on x.x.x.x port 3357
I0222 04:14:30.575588       7 scheduler.go:809] Mesos scheduler driver started with PID=scheduler(1)@10.32.0.4:3357
I0222 04:14:30.575625       7 scheduler.go:821] starting master detector *zoo.MasterDetector: &{client:<nil> leaderNode: bootstrapLock:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} bootstrapFunc:0x7991c0 ignoreInstalled:0 minDetectorCyclePeriod:1000000000 done:0xc2080548a0 cancel:0x7991b0}
I0222 04:14:30.575746       7 scheduler.go:999] Scheduler driver running.  Waiting to be stopped.
I0222 04:14:30.575776       7 scheduler.go:663] running instances: 0 desired: 3 offers: 0
I0222 04:14:30.575799       7 scheduler.go:671] PeriodicLaunchRequestor skipping due to Immutable scheduler state.
I0222 04:14:30.575811       7 scheduler.go:1033] Admin HTTP interface Listening on port 3356
I0222 04:14:30.607180       7 scheduler.go:374] New master master@y.y.y.y:5050 detected
I0222 04:14:30.607306       7 scheduler.go:435] No credentials were provided. Attempting to register scheduler without authentication.
I0222 04:14:30.607466       7 scheduler.go:922] Reregistering with master: master@172.16.0.7:5050
I0222 04:14:30.607656       7 scheduler.go:881] will retry registration in 1.254807398s if necessary
I0222 04:14:30.610527       7 scheduler.go:769] Handling framework error event.
I0222 04:14:30.610636       7 scheduler.go:1081] Aborting framework [&FrameworkID{Value:*b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003,XXX_unrecognized:[],}]
I0222 04:14:30.610890       7 scheduler.go:1062] stopping messenger
I0222 04:14:30.610985       7 messenger.go:269] stopping messenger..
I0222 04:14:30.611076       7 http_transporter.go:476] stopping HTTP transport
I0222 04:14:30.611168       7 scheduler.go:1065] Stop() complete with status DRIVER_ABORTED error <nil>
I0222 04:14:30.611262       7 scheduler.go:1051] Sending error via withScheduler: Framework has been removed
I0222 04:14:30.611366       7 scheduler.go:298] stopping scheduler event queue..
I0222 04:14:30.611504       7 http_transporter.go:450] HTTP server stopped because of shutdown
I0222 04:14:30.611598       7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611687       7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611779       7 scheduler.go:250] finished processing scheduler events

Any suggestions on how to fix the deployment?

The text was updated successfully, but these errors were encountered:

jdef · 2016-02-22T10:56:11Z

yep, we need better uninstall instructions for etcd on DCOS.

go to <dcos-hostname>/exhibitor and view the node tree. you should see etcd as a child of root. delete it. then try to re-install.

jdef · 2016-02-22T10:56:47Z

#91

Radek44 · 2016-02-22T18:03:25Z

Brilliant. thank you @jdef this worked.

Radek44 · 2016-02-22T18:11:00Z

Quick addition - it looks like as soon as I try to scale out etcd using marathon (going from default 1 instance to 3 as recommended) the deployment of the 3 instances fails for the same reason.

jdef · 2016-02-23T14:05:59Z

@spacejam is this supported? I was under the impression that cluster size
should be determined at framework startup time, and only then.

On Mon, Feb 22, 2016 at 1:11 PM, Radek Dabrowski notifications@github.com
wrote:

Quick addition - it looks like as soon as I try to scale out etcd using
marathon (going from default 1 instance to 3 as recommended) the deployment
of the 3 instances fails for the same reason.

—
Reply to this email directly or view it on GitHub
#95 (comment)
.

spacejam · 2016-02-23T19:11:28Z

That's correct, @jdef. Marathon starts the etcd-mesos scheduler, rather than the instances of etcd (the instances are managed by what marathon or another higher-order supervisor framework starts). Marathon will show 1 instance running because there is only 1 etcd-mesos framework running with a particular configuration. The number of etcd instances should be determined at initialization time when submitting the app definition to marathon, for instance with the CLUSTER_SIZE env var in the provided example marathon spec:

{
  "id": "etcd",
  "container": {
    "docker": {
      "forcePullImage": true,
      "image": "mesosphere/etcd-mesos:0.1.0-alpha-target-23-24-25"
    },
    "type": "DOCKER"
  },
  "cpus": 0.2,
  "env": {
    "FRAMEWORK_NAME": "etcd",
    "WEBURI": "http://etcd.marathon.mesos:$PORT0/stats",
    "MESOS_MASTER": "zk://master.mesos:2181/mesos",
    "ZK_PERSIST": "zk://master.mesos:2181/etcd",
    "AUTO_RESEED": "true",
    "RESEED_TIMEOUT": "240",
    "CLUSTER_SIZE": "3",
    "CPU_LIMIT": "1",
    "DISK_LIMIT": "4096",
    "MEM_LIMIT": "2048",
    "VERBOSITY": "1"
  },
  "healthChecks": [
    {
      "gracePeriodSeconds": 60,
      "intervalSeconds": 30,
      "maxConsecutiveFailures": 0,
      "path": "/healthz",
      "portIndex": 0,
      "protocol": "HTTP"
    }
  ],
  "instances": 1,
  "mem": 128.0,
  "ports": [
    0,
    1,
    2
  ]
}

spacejam · 2016-02-23T19:14:54Z

actually, since you're using DCOS, you can specify the "cluster-size" configuration option for etcd to be something other than 3, but 3 is the default and recommended unless you are willing to trade slower writes for faster reads and additional availability with a cluster of 5.

pires added this to the 0.2 milestone Apr 22, 2017

pires self-assigned this Apr 22, 2017

pires added bug priority/P1 labels Apr 22, 2017

pires mentioned this issue Apr 23, 2017

[PROPOSAL] Add framework deployment collision handling strategy #122

Open

mesosphere-backup unassigned pires Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd deployment fails with DCOS if framework found in Zookeeper #95

etcd deployment fails with DCOS if framework found in Zookeeper #95

Radek44 commented Feb 22, 2016

jdef commented Feb 22, 2016

jdef commented Feb 22, 2016

Radek44 commented Feb 22, 2016

Radek44 commented Feb 22, 2016

jdef commented Feb 23, 2016

spacejam commented Feb 23, 2016

spacejam commented Feb 23, 2016

etcd deployment fails with DCOS if framework found in Zookeeper #95

etcd deployment fails with DCOS if framework found in Zookeeper #95

Comments

Radek44 commented Feb 22, 2016

jdef commented Feb 22, 2016

jdef commented Feb 22, 2016

Radek44 commented Feb 22, 2016

Radek44 commented Feb 22, 2016

jdef commented Feb 23, 2016

spacejam commented Feb 23, 2016

spacejam commented Feb 23, 2016