Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd deployment fails with DCOS if framework found in Zookeeper #95

Open
Radek44 opened this issue Feb 22, 2016 · 7 comments
Open

etcd deployment fails with DCOS if framework found in Zookeeper #95

Radek44 opened this issue Feb 22, 2016 · 7 comments

Comments

@Radek44
Copy link

Radek44 commented Feb 22, 2016

I setup etcd on my cluster using DCOS CLI a first time and it worked. I then uninstalled it. A couple days later I decided to reinstall but since, every installation is failing.
It seems that the reason for this is that the framework is found in Zookeeper but fails at restoring. Here is the failure trace I got through the stderr file in mesos (just changed the IPs with x.x.x.x (agent) and y.y.y.y(mesos master):

+ /work/bin/etcd-mesos-scheduler -alsologtostderr=true -framework-name=etcd -cluster-size=3 -master=zk://master.mesos:2181/mesos -zk-framework-persist=zk://master.mesos:2181/etcd -v=1 -auto-reseed=true -reseed-timeout=240 -sandbox-disk-limit=4096 -sandbox-cpu-limit=1 -sandbox-mem-limit=2048 -admin-port=3356 -driver-port=3357 -artifact-port=3358 -framework-weburi=http://etcd.marathon.mesos:3356/stats
I0222 04:14:30.573426       7 app.go:218] Found stored framework ID in Zookeeper, attempting to re-use: b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003
I0222 04:14:30.575267       7 scheduler.go:209] found failover_timeout = 168h0m0s
I0222 04:14:30.575363       7 scheduler.go:323] Initializing mesos scheduler driver
I0222 04:14:30.575473       7 scheduler.go:792] Starting the scheduler driver...
I0222 04:14:30.575552       7 http_transporter.go:407] listening on x.x.x.x port 3357
I0222 04:14:30.575588       7 scheduler.go:809] Mesos scheduler driver started with PID=scheduler(1)@10.32.0.4:3357
I0222 04:14:30.575625       7 scheduler.go:821] starting master detector *zoo.MasterDetector: &{client:<nil> leaderNode: bootstrapLock:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} bootstrapFunc:0x7991c0 ignoreInstalled:0 minDetectorCyclePeriod:1000000000 done:0xc2080548a0 cancel:0x7991b0}
I0222 04:14:30.575746       7 scheduler.go:999] Scheduler driver running.  Waiting to be stopped.
I0222 04:14:30.575776       7 scheduler.go:663] running instances: 0 desired: 3 offers: 0
I0222 04:14:30.575799       7 scheduler.go:671] PeriodicLaunchRequestor skipping due to Immutable scheduler state.
I0222 04:14:30.575811       7 scheduler.go:1033] Admin HTTP interface Listening on port 3356
I0222 04:14:30.607180       7 scheduler.go:374] New master master@y.y.y.y:5050 detected
I0222 04:14:30.607306       7 scheduler.go:435] No credentials were provided. Attempting to register scheduler without authentication.
I0222 04:14:30.607466       7 scheduler.go:922] Reregistering with master: master@172.16.0.7:5050
I0222 04:14:30.607656       7 scheduler.go:881] will retry registration in 1.254807398s if necessary
I0222 04:14:30.610527       7 scheduler.go:769] Handling framework error event.
I0222 04:14:30.610636       7 scheduler.go:1081] Aborting framework [&FrameworkID{Value:*b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003,XXX_unrecognized:[],}]
I0222 04:14:30.610890       7 scheduler.go:1062] stopping messenger
I0222 04:14:30.610985       7 messenger.go:269] stopping messenger..
I0222 04:14:30.611076       7 http_transporter.go:476] stopping HTTP transport
I0222 04:14:30.611168       7 scheduler.go:1065] Stop() complete with status DRIVER_ABORTED error <nil>
I0222 04:14:30.611262       7 scheduler.go:1051] Sending error via withScheduler: Framework has been removed
I0222 04:14:30.611366       7 scheduler.go:298] stopping scheduler event queue..
I0222 04:14:30.611504       7 http_transporter.go:450] HTTP server stopped because of shutdown
I0222 04:14:30.611598       7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611687       7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611779       7 scheduler.go:250] finished processing scheduler events

Any suggestions on how to fix the deployment?

@jdef
Copy link
Contributor

jdef commented Feb 22, 2016

yep, we need better uninstall instructions for etcd on DCOS.

go to <dcos-hostname>/exhibitor and view the node tree. you should see etcd as a child of root. delete it. then try to re-install.

@jdef
Copy link
Contributor

jdef commented Feb 22, 2016

#91

@Radek44
Copy link
Author

Radek44 commented Feb 22, 2016

Brilliant. thank you @jdef this worked.

@Radek44
Copy link
Author

Radek44 commented Feb 22, 2016

Quick addition - it looks like as soon as I try to scale out etcd using marathon (going from default 1 instance to 3 as recommended) the deployment of the 3 instances fails for the same reason.

@jdef
Copy link
Contributor

jdef commented Feb 23, 2016

@spacejam is this supported? I was under the impression that cluster size
should be determined at framework startup time, and only then.

On Mon, Feb 22, 2016 at 1:11 PM, Radek Dabrowski notifications@github.com
wrote:

Quick addition - it looks like as soon as I try to scale out etcd using
marathon (going from default 1 instance to 3 as recommended) the deployment
of the 3 instances fails for the same reason.


Reply to this email directly or view it on GitHub
#95 (comment)
.

@spacejam
Copy link
Contributor

That's correct, @jdef. Marathon starts the etcd-mesos scheduler, rather than the instances of etcd (the instances are managed by what marathon or another higher-order supervisor framework starts). Marathon will show 1 instance running because there is only 1 etcd-mesos framework running with a particular configuration. The number of etcd instances should be determined at initialization time when submitting the app definition to marathon, for instance with the CLUSTER_SIZE env var in the provided example marathon spec:

{
  "id": "etcd",
  "container": {
    "docker": {
      "forcePullImage": true,
      "image": "mesosphere/etcd-mesos:0.1.0-alpha-target-23-24-25"
    },
    "type": "DOCKER"
  },
  "cpus": 0.2,
  "env": {
    "FRAMEWORK_NAME": "etcd",
    "WEBURI": "http://etcd.marathon.mesos:$PORT0/stats",
    "MESOS_MASTER": "zk://master.mesos:2181/mesos",
    "ZK_PERSIST": "zk://master.mesos:2181/etcd",
    "AUTO_RESEED": "true",
    "RESEED_TIMEOUT": "240",
    "CLUSTER_SIZE": "3",
    "CPU_LIMIT": "1",
    "DISK_LIMIT": "4096",
    "MEM_LIMIT": "2048",
    "VERBOSITY": "1"
  },
  "healthChecks": [
    {
      "gracePeriodSeconds": 60,
      "intervalSeconds": 30,
      "maxConsecutiveFailures": 0,
      "path": "/healthz",
      "portIndex": 0,
      "protocol": "HTTP"
    }
  ],
  "instances": 1,
  "mem": 128.0,
  "ports": [
    0,
    1,
    2
  ]
}

@spacejam
Copy link
Contributor

actually, since you're using DCOS, you can specify the "cluster-size" configuration option for etcd to be something other than 3, but 3 is the default and recommended unless you are willing to trade slower writes for faster reads and additional availability with a cluster of 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants