Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memory storage for etcd #845

Open
aojea opened this issue Sep 7, 2019 · 48 comments
Open

Use memory storage for etcd #845

aojea opened this issue Sep 7, 2019 · 48 comments
Assignees
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@aojea
Copy link
Contributor

aojea commented Sep 7, 2019

What would you like to be added:

Configure etcd storage in memory to improve the performance

Why is this needed:

etcd causes a very high disk io, and this can cause performance issues, especially if there are several kind clusters running in the same system, because you end with a lot of process writing to disk causing latency and affecting the other applications using the same disk,

Since #779 , the var filesystems was no longer running on the container filesystem, improving the performance, however, the etcd storage continues to be on the disk, as we can see in the pod manifest:

  etcd-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:  DirectoryOrCreate

Ideally, we should have /var/lib/etcd/in memory, since the clusters are created to be created and destroyed and the information shouldn't be persistent.

I have doubts about the best approach:

  • Should be this modified in kind creating a new tmpfs volume for etcd?
  • Can this be modified in kubeadm so we can mount the etcd-data in memory or in another location of the node that's in memory?
  • ...

** NOTES **

etcd io accumulated iotop -a

26206 be/4 root          0.00 B    192.00 K  0.00 %  1.04 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26196 be/4 root          0.00 B    224.00 K  0.00 %  0.98 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26288 be/4 root          0.00 B    216.00 K  0.00 %  0.94 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26249 be/4 root          0.00 B    180.00 K  0.00 %  0.88 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26266 be/4 root          0.00 B     52.00 K  0.00 %  0.47 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26187 be/4 root          0.00 B     52.00 K  0.00 %  0.42 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26267 be/4 root          0.00 B     48.00 K  0.00 %  0.37 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26192 be/4 root          0.00 B     60.00 K  0.00 %  0.36 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26263 be/4 root          0.00 B     52.00 K  0.00 %  0.31 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26261 be/4 root          0.00 B     64.00 K  0.00 %  0.28 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
19155 be/4 root          0.00 B      0.00 B  0.00 %  0.19 % [kworker/1:2]
26286 be/4 root          0.00 B     28.00 K  0.00 %  0.18 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26289 be/4 root          0.00 B     32.00 K  0.00 %  0.16 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
  578 be/4 root          0.00 B      2.00 M  0.00 %  0.16 % [btrfs-transacti]
26268 be/4 root          0.00 B     28.00 K  0.00 %  0.11 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
@aojea aojea added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 7, 2019
@aojea
Copy link
Contributor Author

aojea commented Sep 7, 2019

/cc @BenTheElder @neolit123

@neolit123
Copy link
Member

Can this be modified in kubeadm so we can mount the etcd-data in memory or in another location of the node that's in memory?

kubeadm passes --data-dir=/var/lib/etcd to etcd and mounts this directory using hostPath.
we can just try:

        emptyDir:
          medium: Memory

but this means kubeadm init / join commands need to:

  1. use phases to skip / customize the "manifests" phase
    or
  2. deploy etcd, patch manifest, restart static pod

etcd causes a very high disk io, and this can cause performance issues, especially if there are several kind clusters running in the same system, because you end with a lot of process writing to disk causing latency and affecting the other applications using the same disk,

k/k master just moved to 3.3.15, while 1.15 uses an older version.
is this a regression? and IDLE cluster should not have high disk i/o.

if this disk i/o suddenly became a problem this should be in a k/k issue.

@BenTheElder
Copy link
Member

Etcd is going to be writing all the constantly updated objects, no? (Eg node status)

It would be trivial to test kind with memory backed etcd by adjusting node creation, but I don't think you'd ever run a real cluster not on disk... 🤔

@aojea
Copy link
Contributor Author

aojea commented Sep 7, 2019

Etcd is going to be writing all the constantly updated objects, no? (Eg node status)

yeah, data need to persist to disk to provide consistency

It would be trivial to test kind with memory backed etcd by adjusting node creation, but I don't think you'd ever run a real cluster not on disk... 🤔

Absolutely, real clusters must use disks, this is only meant to be used for testing, my rationale is that these k8s cluster are ephemeral, thus the etcd clusters don't need to "persist" data on disk

Can this be patched with the kind config? It will be enough with passing a different folder than --data-dir=/var/lib/etcd

@BenTheElder
Copy link
Member

You can test this more or less with no changes by making a tmpfs on the host and configuring it to mount there on a control plane.

You could also edit the kind control plane creation process to put a tmpfs here on the node

We should experiment, but I think we do eventually want durable etcd for certain classes of testing..

@BenTheElder
Copy link
Member

Also worth pointing out:

  • our CI is backed by SSD
  • I'm not aware of any other cluster implementation not backing etcd with disk, including eg hack/local-up-cluster

@aojea
Copy link
Contributor Author

aojea commented Sep 7, 2019

yeah, for k8s CI is not a big problem, but for users that run kind locally, it is. It took me a while to understand what was slowing down my system until I've found that my kind clusters were causing big latency in one of my disks.
I just want to test and document the differences :)

@aojea
Copy link
Contributor Author

aojea commented Sep 7, 2019

ok, here is how to run etcd using memory storage for reference

  1. Create the memory storaga
sudo mkdir /tmp/etcd
sudo mount -t tmpfs  /tmp/etcd
  1. Mount it on the control nodes
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
  extraMounts:
  - containerPath: /var/lib/etcd
    hostPath: /tmp/etcd

@aojea aojea closed this as completed Sep 7, 2019
@aojea
Copy link
Contributor Author

aojea commented Oct 4, 2019

/reopen per conversation in slack https://kubernetes.slack.com/archives/CEKK1KTN2/p1570202642295000?thread_ts=1570196798.288800&cid=CEKK1KTN2

I'd like to find a way to make this easier to configure, mainly for people that want to use kind in their laptops and not in CIs, etcd writing constantly to disk directly is no adding any benefit in this particular scenario

@aojea aojea reopened this Oct 4, 2019
@aojea
Copy link
Contributor Author

aojea commented Oct 7, 2019

You could also edit the kind control plane creation process to put a tmpfs here on the node

I think this will work

We should experiment, but I think we do eventually want durable etcd for certain classes of testing..

I was thinking more about this, and can't see the "durability" difference between using a folder inside the container or using a tmpfs volume for the etcd data dir, the data will be available as long as the container is alive, no?

However, etcd writing to a tmpfs volume will be a big performance improvement, at a cost of less memory available, of course

home/aojeagarcia/docker/volumes/5d2d2cab7dcb7c93b9a8a5f8591462caf4fbca5c332e663aa4628702b3d2dc50/_data/lib/etcd/member # du -sh *
1.5M    snap
245M    wal

@neolit123
Copy link
Member

However, etcd writing to a tmpfs volume will be a big performance improvement, at a cost of less memory available, of course

i'd be interested if this will prevent me from testing 3 CP setups with kind on my setup.
it doesn't have RAM for 4 CPs :)

@BenTheElder
Copy link
Member

I was thinking more about this, and can't see the "durability" difference between using a folder inside the container or using a tmpfs volume for the etcd data dir, the data will be available as long as the container is alive, no?

It's NOT a folder inside the container, it's on a volume.

When we fix kind to survive host reboots (and we will) then this will break it again.

It also will consume more RAM of course.

@aojea
Copy link
Contributor Author

aojea commented Oct 7, 2019

It's NOT a folder inside the container, it's on a volume.

https://github.com/kubernetes-sigs/kind/blob/master/pkg/internal/cluster/providers/docker/provision.go#L164-L169

I see it now 🤦‍♂️

@aojea
Copy link
Contributor Author

aojea commented Oct 15, 2019

can this be causing timeouts in the CI with slow disks?

@BenTheElder
Copy link
Member

#928 (comment)

^^ possibly for istio, doesn't look like Kubernetes CI is seeing timeouts at this point. That's not the pattern with the broken pipe.

Even for istio, I doubt it's "because they aren't doing this" but it could be "because they are otherwise using too much IOPs for the allocated disks" IIRC they are also on GCP PD-SSD which is quite fast.

@BenTheElder BenTheElder added kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Nov 6, 2019
@BenTheElder
Copy link
Member

for CI I think the better pattern I want to try is to use a pool of PDs from some storage class to replace the emptyDir.

I've been mulling how we could do this and persist some of the images in a clean and sane way, but imo this is well out of scope for the kind project.

@aojea
Copy link
Contributor Author

aojea commented Nov 6, 2019

for CI I think the better pattern I want to try is to use a pool of PDs from some storage class to replace the emptyDir.

I've been mulling how we could do this and persist some of the images in a clean and sane way, but imo this is well out of scope for the kind project.

I think that this is only an issue for people using kind in their laptops or workstations, totally agree with you on the CI use case

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 5, 2020
@BenTheElder
Copy link
Member

did we wind up testing this in CI?

@aojea
Copy link
Contributor Author

aojea commented May 29, 2020

The bootstrapping process with kubeadm is suspiciously long

If you are not afraid of security, and if is possible in kubeadm ( I really don't know) avoid the certificate generation ... Maybe is possible to include some well known certificate

@warmchang
Copy link

There is an unsafe "--unsafe-no-fsync" flag added in etcd to disables fsync.

FYI: etcd-io/etcd#11946

@BenTheElder
Copy link
Member

Yeah, we're very interested in that once it's available in kubeadm's etcd.

ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 15, 2020
Currently we encounter bad performance of KIND
cluster on DinD setup, we get 'etcdserver: timeout errors'
that causes jobs to fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#845

Signed-off-by: Or Mergi <ormergi@redhat.com>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 16, 2020
Currently we encounter bad performance of KIND
cluster on DinD setup, we get 'etcdserver: timeout errors'
that causes jobs to fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#845

Signed-off-by: Or Mergi <ormergi@redhat.com>
@BenTheElder
Copy link
Member

Circling back because this came up again today: I experimented with tempfs + the unsafe no fsync flag late last year and didn't see measurable improvements on my hardware (couple different dev machines), YMMV, this still doesn't seem to be a clear win even when persistence is not interesting, it depends on the usage and hardware.

@aojea
Copy link
Contributor Author

aojea commented Jan 20, 2021

for CIs like github actions there is a measurable difference when running the e2e test suite :)

@dprotaso
Copy link

for CIs like github actions there is a measurable difference when running the e2e test suite :)

Yeah - it's just another potential failure mode that would be nice to avoid

@cnfatal
Copy link

cnfatal commented Mar 8, 2022

Below config file works well to run etcd in memory

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  etcd:
    local:
      dataDir: /tmp/etcd

The /tmp and /run dir in kind node mount at a tmpfs.

on podman :

"--tmpfs", "/tmp", // various things depend on working /tmp
"--tmpfs", "/run", // systemd wants a writable /run

on docker:

"--tmpfs", "/tmp", // various things depend on working /tmp
"--tmpfs", "/run", // systemd wants a writable /run

@aojea
Copy link
Contributor Author

aojea commented Sep 28, 2022

as pointed out by Ben , we are going to have a performance hit because of etcd

All single node v3.x clusters are affected. Fix is expected to come with a 4-10% performance degradation, making single node cluster performance more in line with multi-node clusters. No performance change is expected for multi-node clusters.

kubernetes/kubernetes#112690

@BenTheElder
Copy link
Member

BenTheElder commented Sep 28, 2022

This should work for all current supported Kubernetes versions and is slightly terser:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |
  kind: ClusterConfiguration
  etcd:
    local:
      dataDir: /tmp/etcd

@BenTheElder
Copy link
Member

Maybe let's make a page to cover performance @aojea?

We have other related commonly discovered issues that are only in "known issues" currently. We could leave stub entries but move performance considerations to a new docs page that covers this technique + inotify limits etc.

I think the config to enable this is small enough to just document and it's too breaking to e.g. enable by default.

We can also suggest other tunable flags and host configs some of which kind shouldn't touch itself.

@aojea
Copy link
Contributor Author

aojea commented Sep 29, 2022

agree, these are recurrent questions, better to aggregate this information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

10 participants