Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: gracefully handle ENOSPC #4744

Closed
heyitsanthony opened this issue Mar 10, 2016 · 8 comments
Closed

etcdserver: gracefully handle ENOSPC #4744

heyitsanthony opened this issue Mar 10, 2016 · 8 comments
Assignees

Comments

@heyitsanthony
Copy link
Contributor

I started V3 on a small tmpfs and put until it exhausted all space. etcd promptly died.

On restart, I get:

13:37:49 etcd1 | Starting etcd1 on port 5000
13:37:49 etcd1 | 2016-03-10 13:37:49.701673 I | etcdmain: etcd Version: 2.3.0-alpha.1+git
13:37:49 etcd1 | 2016-03-10 13:37:49.701730 I | etcdmain: Git SHA: 939af03
13:37:49 etcd1 | 2016-03-10 13:37:49.701735 I | etcdmain: Go Version: go1.5.3
13:37:49 etcd1 | 2016-03-10 13:37:49.701743 I | etcdmain: Go OS/Arch: linux/386
13:37:49 etcd1 | 2016-03-10 13:37:49.701749 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
13:37:49 etcd1 | 2016-03-10 13:37:49.701755 W | etcdmain: no data-dir provided, using default data-dir ./infra1.etcd
13:37:49 etcd1 | 2016-03-10 13:37:49.701780 N | etcdmain: the server is already initialized as member before, starting as etcd member...
13:37:49 etcd1 | 2016-03-10 13:37:49.701826 I | etcdmain: listening for peers on http://127.0.0.1:12380
13:37:49 etcd1 | 2016-03-10 13:37:49.701844 I | etcdmain: listening for client requests on http://127.0.0.1:12379
13:37:49 etcd1 | 2016-03-10 13:37:49.701865 I | etcdmain: listening for client rpc on 127.0.0.1:2378
13:37:49 etcd1 | 2016-03-10 13:37:49.701993 W | snap: skipped unexpected non snapshot file db
13:37:49 etcd1 | 2016-03-10 13:37:49.702002 I | etcdserver: name = infra1
13:37:49 etcd1 | 2016-03-10 13:37:49.702007 I | etcdserver: data dir = infra1.etcd
13:37:49 etcd1 | 2016-03-10 13:37:49.702011 I | etcdserver: member dir = infra1.etcd/member
13:37:49 etcd1 | 2016-03-10 13:37:49.702016 I | etcdserver: heartbeat = 100ms
13:37:49 etcd1 | 2016-03-10 13:37:49.702020 I | etcdserver: election = 1000ms
13:37:49 etcd1 | 2016-03-10 13:37:49.702024 I | etcdserver: snapshot count = 10000
13:37:49 etcd1 | 2016-03-10 13:37:49.702031 I | etcdserver: advertise client URLs = http://127.0.0.1:12379
13:37:50 etcd1 | 2016-03-10 13:37:50.953741 I | etcdserver: restarting member 8211f1d0f64f3269 in cluster 7230e3513973170f at commit index 491
13:37:50 etcd1 | 2016-03-10 13:37:50.953829 I | raft: 8211f1d0f64f3269 became follower at term 2
13:37:50 etcd1 | 2016-03-10 13:37:50.953843 I | raft: newRaft 8211f1d0f64f3269 [peers: [], term: 2, commit: 491, applied: 0, lastindex: 491, lastterm: 2]
13:37:51 etcd1 | 2016-03-10 13:37:51.080931 I | etcdserver: starting server... [version: 2.3.0-alpha.1+git, cluster version: to_be_decided]
13:37:51 etcd1 | 2016-03-10 13:37:51.081159 I | etcdhttp: pprof is enabled under /debug/pprof
13:37:51 etcd1 | 2016-03-10 13:37:51.081247 N | etcdserver: added local member 8211f1d0f64f3269 [http://127.0.0.1:12380] to cluster 7230e3513973170f
13:37:51 etcd1 | 2016-03-10 13:37:51.081325 N | etcdserver: set the initial cluster version to 2.3
13:37:51 etcd1 | 2016-03-10 13:37:51.294675 I | storage: cannot commit tx (write infra1.etcd/member/snap/db: no space left on device)
13:37:51 etcd1 | Terminating etcd1

Possible (partial) solution: introduce a tiny quota layer to storage or backend that checks the db size before submitting a put.

@xiang90
Copy link
Contributor

xiang90 commented Mar 10, 2016

duplicate with #2393?

@heyitsanthony
Copy link
Contributor Author

Similar but not quite the same. This one is both limiting storage space utilization and being able to start in case of ENOSPC.

@xiang90
Copy link
Contributor

xiang90 commented Mar 10, 2016

@heyitsanthony

being able to start in case of ENOSPC.

I do not quite follow this? How? More aggressively clean up snaps/wals?

@heyitsanthony
Copy link
Contributor Author

@xiang90 My thinking was some sort of recovery/maintenance mode. It could reject puts but still serve gets, compactions, defrags, and deletes.

@xiang90
Copy link
Contributor

xiang90 commented Mar 10, 2016

@heyitsanthony Will this node continue participating raft or it will just ignore all raft requests? How will this affect raft layer?

@heyitsanthony
Copy link
Contributor Author

I don't think the node would be capable of participating in raft. For the multi-node case the enospc node could throw away its data and reload via raft snapshot once some keys have been deleted (but maybe this should require user-intervention since it's sort of risky). For single node case, the deletes could go through as normal.

@gyuho
Copy link
Contributor

gyuho commented Dec 12, 2017

Related #8935 (comment).

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants