-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error instead of panic when disk is full or we cause corruption #10588
Comments
(Note: I'm sending a patch to switch from Fatal to Panic, but the same point applies.) |
purpleidea
added a commit
to purpleidea/etcd
that referenced
this issue
Mar 27, 2019
When using the embed package to embed etcd, sometimes the storage prefix being used might be full. In this case, this code path triggers, causing an: `etcdserver: create wal error: no space left on device` error, which causes a fatal. A fatal differs from a panic in that it also calls os.Exit(1). In this situation, the calling program that embeds the etcd server will be abruptly killed, which prevents it from cleaning up safely, and giving a proper error message. Depending on what the calling program is, this can cause corruption and data loss. This patch switches the fatal to a panic. Ideally this would be a regular error which would get propagated upwards to the StartEtcd command, but in the meantime at least this can be caught with recover(). This fixes the most common fatal that I've experienced, but there are surely more that need looking into. If possible, the errors should be threaded down into the code path so that embedding etcd can be more robust. Fixes: etcd-io#10588
Patch available in #10591 |
I'd like to backport this to v3.3.13 which will probably be the next minor release. Can anyone give me guidelines on where to send such a patch please? |
purpleidea
added a commit
to purpleidea/etcd
that referenced
this issue
Mar 29, 2019
When using the embed package to embed etcd, sometimes the storage prefix being used might be full. In this case, this code path triggers, causing an: `etcdserver: create wal error: no space left on device` error, which causes a fatal. A fatal differs from a panic in that it also calls os.Exit(1). In this situation, the calling program that embeds the etcd server will be abruptly killed, which prevents it from cleaning up safely, and giving a proper error message. Depending on what the calling program is, this can cause corruption and data loss. This patch switches the fatal to a panic. Ideally this would be a regular error which would get propagated upwards to the StartEtcd command, but in the meantime at least this can be caught with recover(). This fixes the most common fatal that I've experienced, but there are surely more that need looking into. If possible, the errors should be threaded down into the code path so that embedding etcd can be more robust. Fixes: etcd-io#10588 This is a cherry-picked version of upstream: 368f70a
jingyih
pushed a commit
to jingyih/etcd
that referenced
this issue
May 10, 2019
When using the embed package to embed etcd, sometimes the storage prefix being used might be full. In this case, this code path triggers, causing an: `etcdserver: create wal error: no space left on device` error, which causes a fatal. A fatal differs from a panic in that it also calls os.Exit(1). In this situation, the calling program that embeds the etcd server will be abruptly killed, which prevents it from cleaning up safely, and giving a proper error message. Depending on what the calling program is, this can cause corruption and data loss. This patch switches the fatal to a panic. Ideally this would be a regular error which would get propagated upwards to the StartEtcd command, but in the meantime at least this can be caught with recover(). This fixes the most common fatal that I've experienced, but there are surely more that need looking into. If possible, the errors should be threaded down into the code path so that embedding etcd can be more robust. Fixes: etcd-io#10588
MartinWeindel
pushed a commit
to MartinWeindel/etcd
that referenced
this issue
Jun 12, 2019
When using the embed package to embed etcd, sometimes the storage prefix being used might be full. In this case, this code path triggers, causing an: `etcdserver: create wal error: no space left on device` error, which causes a fatal. A fatal differs from a panic in that it also calls os.Exit(1). In this situation, the calling program that embeds the etcd server will be abruptly killed, which prevents it from cleaning up safely, and giving a proper error message. Depending on what the calling program is, this can cause corruption and data loss. This patch switches the fatal to a panic. Ideally this would be a regular error which would get propagated upwards to the StartEtcd command, but in the meantime at least this can be caught with recover(). This fixes the most common fatal that I've experienced, but there are surely more that need looking into. If possible, the errors should be threaded down into the code path so that embedding etcd can be more robust. Fixes: etcd-io#10588
hexfusion
pushed a commit
to openshift/etcd
that referenced
this issue
Sep 26, 2019
When using the embed package to embed etcd, sometimes the storage prefix being used might be full. In this case, this code path triggers, causing an: `etcdserver: create wal error: no space left on device` error, which causes a fatal. A fatal differs from a panic in that it also calls os.Exit(1). In this situation, the calling program that embeds the etcd server will be abruptly killed, which prevents it from cleaning up safely, and giving a proper error message. Depending on what the calling program is, this can cause corruption and data loss. This patch switches the fatal to a panic. Ideally this would be a regular error which would get propagated upwards to the StartEtcd command, but in the meantime at least this can be caught with recover(). This fixes the most common fatal that I've experienced, but there are surely more that need looking into. If possible, the errors should be threaded down into the code path so that embedding etcd can be more robust. Fixes: etcd-io#10588 This is a cherry-picked version of upstream: 368f70a
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If the storage prefix where the etcd server runs is full, it can panic with:
This is very bad because anyone using
embed
can have their program killed, and you can cause corruption and data loss instead of allowing a clean shutdown.This happens here:
etcd/etcdserver/raft.go
Line 432 in 9c2b88d
A simple reproducer is here (just fill your /tmp disk first):
This makes using the
embed
package more difficult, since we'd have to catch the panic. Instead this should causeembed.StartEtcd
to error!I'm sure there are other places where the code could panic. We should never panic, and instead propagate the errors to the top-level where they are returned and/or logged, to have an orderly shutdown without data corruption.
Thanks!
The text was updated successfully, but these errors were encountered: