Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using multiple CPUs with VirtualBox causes etcd thrashing #1997

Closed
josh-padnick opened this issue Dec 24, 2014 · 5 comments
Closed

Using multiple CPUs with VirtualBox causes etcd thrashing #1997

josh-padnick opened this issue Dec 24, 2014 · 5 comments

Comments

@josh-padnick
Copy link

I've been following the official documentation on running a CoreOS Cluster with Vagrant and Virtualbox. If I use the default settings, things work fine, but I've discovered that when I make the following change:

# $vb_cpus = 1
$vb_cpus = 4

I start to see the same etcd thrashing previously reported (e.g. #868). Interestingly, the pace of leader changes seems to approximately double as I double the value of $vb_cpus. So, I get no thrashing at all with $vb_cpus=1, some with $vb_cpus=2, twice as much with $vb_cpus=4 and four times as much with $vb_cpus=8.

Dec 24 17:21:28 core-01 etcd[964]: [etcd] Dec 24 17:21:28.802 INFO      | 5757c02eb212428599af0215ec124cf7: state changed from 'leader' to 'follower'.
Dec 24 17:21:28 core-01 etcd[964]: [etcd] Dec 24 17:21:28.802 INFO      | 5757c02eb212428599af0215ec124cf7: term #1 started.
Dec 24 17:21:28 core-01 etcd[964]: [etcd] Dec 24 17:21:28.802 INFO      | 5757c02eb212428599af0215ec124cf7: leader changed from '5757c02eb212428599af0215ec124cf7' to ''.
Dec 24 17:21:32 core-01 etcd[964]: [etcd] Dec 24 17:21:32.051 INFO      | 5757c02eb212428599af0215ec124cf7: warning: heartbeat near election timeout: 292.050288ms
Dec 24 17:21:38 core-01 etcd[964]: [etcd] Dec 24 17:21:38.006 INFO      | 5757c02eb212428599af0215ec124cf7: warning: heartbeat near election timeout: 202.358606ms
Dec 24 17:21:39 core-01 etcd[964]: [etcd] Dec 24 17:21:39.513 INFO      | 5757c02eb212428599af0215ec124cf7: state changed from 'follower' to 'candidate'.
Dec 24 17:21:39 core-01 etcd[964]: [etcd] Dec 24 17:21:39.514 INFO      | 5757c02eb212428599af0215ec124cf7: leader changed from '7b44b21f81e04214a71dc1664a6cf4b3' to ''.
Dec 24 17:21:39 core-01 etcd[964]: [etcd] Dec 24 17:21:39.516 INFO      | 5757c02eb212428599af0215ec124cf7: state changed from 'candidate' to 'leader'.
Dec 24 17:21:39 core-01 etcd[964]: [etcd] Dec 24 17:21:39.516 INFO      | 5757c02eb212428599af0215ec124cf7: leader changed from '' to '5757c02eb212428599af0215ec124cf7'.
Dec 24 17:21:47 core-01 etcd[964]: [etcd] Dec 24 17:21:47.523 INFO      | 5757c02eb212428599af0215ec124cf7: state changed from 'leader' to 'follower'.
Dec 24 17:21:47 core-01 etcd[964]: [etcd] Dec 24 17:21:47.523 INFO      | 5757c02eb212428599af0215ec124cf7: term #3 started.
Dec 24 17:21:47 core-01 etcd[964]: [etcd] Dec 24 17:21:47.524 INFO      | 5757c02eb212428599af0215ec124cf7: leader changed from '5757c02eb212428599af0215ec124cf7' to ''.
Dec 24 17:21:48 core-01 etcd[964]: [etcd] Dec 24 17:21:48.220 INFO      | 5757c02eb212428599af0215ec124cf7: warning: heartbeat near election timeout: 194.779302ms
Dec 24 17:21:48 core-01 etcd[964]: [etcd] Dec 24 17:21:48.978 INFO      | 5757c02eb212428599af0215ec124cf7: warning: heartbeat near election timeout: 202.787912ms
Dec 24 17:21:52 core-01 etcd[964]: [etcd] Dec 24 17:21:52.922 INFO      | 5757c02eb212428599af0215ec124cf7: warning: heartbeat near election timeout: 193.765179ms

I want to use multiple CPUs because I was thinking about compiling my code within a container, but now I'm thinking I'm better off just sharing a volume with my VM and container and compiling on my host machine.

Is this a bug, or am I missing something? Here are some other stats on my environment:

core@core-01 ~ $ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=494.5.0
uname -a
Darwin Joshs-MacBook-Pro.local 14.0.0 Darwin Kernel Version 14.0.0: Fri Sep 19 00:26:44 PDT 2014; root:xnu-2782.1.97~2/RELEASE_X86_64 x86_64
josh$ vagrant version
Installed Version: 1.7.1
Latest Version: 1.7.1

Running VirtualBox 4.3.20 r96996

@kelseyhightower
Copy link
Contributor

@josh-padnick How many CPUs does your physical machine have? If I'm reading this right, you're attempting to use a total of 12 CPUs, which each machine having 4 CPUs.

A quick google search leads me to believe this maybe a virtual box bug:

@josh-padnick
Copy link
Author

@kelseyhightower Thx for looking into this.

I'm on a Macbook Retina Pro 15-inch, which has a quad-core Intel Core i7. Since the i7 has hyper-threading, my understanding is this appears as 8 logical CPUs to the OS and indeed, VirtualBox allows me to select up to 8 CPUs (Screenshot, Multi-Core + Hyper-Threading).

The line $vb_cpus = 4 comes right from https://github.com/coreos/coreos-vagrant/blob/master/Vagrantfile, which in turn is referenced as part of the CoreOS docs Running CoreOS on Vagrant.

This has the effect of setting each of the VMs to 4 CPUs (as you wrote), but VirtualBox manages the parallel processing, so I don't think it would be correct to think of this as a "total" of 12 CPUs.

Inspired by your link, I found a Reddit thread that suggests that VirtualBox has to wait until all requested CPUs (e.g. all 4 CPUs) are available before it will run a process. I haven't officially verified this, but there does seem to be some slowness and overhead associated with using multiple CPUs in VirtualBox.

Based on what I've read, my recommendations are to:

  1. Clarify in the docs that multiple VirtualBox CPUs may be slower than a single one and/or explicitly mention when it would make sense to use more than 1 CPU
  2. Consider increasing ectd's timeout before it elects a new leader to account for the multiple CPUs slowness.

Note that I have not researched the "multiple CPUs on Virtualbox" in depth.

@kelseyhightower
Copy link
Contributor

@josh-padnick Thanks for the response. I like your suggestion for the doc improvements and I'll start working on those today.

@yichengq
Copy link
Contributor

Clarify in the docs that multiple VirtualBox CPUs may be slower than a single one and/or explicitly mention when it would make sense to use more than 1 CPU

This is not etcd related, so i don't think we need to document it.

Consider increasing ectd's timeout before it elects a new leader to account for the multiple CPUs slowness.

You can configure etcd timeout through '--heartbeat-interval' and '--election-timeout'.

@yichengq
Copy link
Contributor

Close this due to low activity. Feel free to reopen it if you have more thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants