Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error instead of panic when disk is full or we cause corruption #10588

Closed
purpleidea opened this issue Mar 27, 2019 · 3 comments · Fixed by #10591
Closed

Error instead of panic when disk is full or we cause corruption #10588

purpleidea opened this issue Mar 27, 2019 · 3 comments · Fixed by #10591

Comments

@purpleidea
Copy link
Contributor

If the storage prefix where the etcd server runs is full, it can panic with:

2019-03-26 21:25:16.972621 C | etcdserver: create wal error: no space left on device

This is very bad because anyone using embed can have their program killed, and you can cause corruption and data loss instead of allowing a clean shutdown.

This happens here:

plog.Fatalf("create wal error: %v", err)

A simple reproducer is here (just fill your /tmp disk first):

package main

import (
	"fmt"
	"log"
	"os"
	"time"

	//etcd "github.com/coreos/etcd/clientv3"
	"github.com/coreos/etcd/embed"
	etcdtypes "github.com/coreos/etcd/pkg/types"
)

func server() error {

	dataDir := "/tmp/etcdserver1"
	if err := os.MkdirAll(dataDir, 0770); err != nil {
		return fmt.Errorf("couldn't mkdir: %s; err: %+v", dataDir, err)
	}

	memberName := "h1"
	aPUrls, _ := etcdtypes.NewURLs([]string{"http://localhost:2380"})
	aCUrls, _ := etcdtypes.NewURLs([]string{"http://localhost:2379"})

	initialPeerURLsMap := make(etcdtypes.URLsMap)
	initialPeerURLsMap[memberName] = aPUrls

	// embed etcd
	cfg := embed.NewConfig()
	cfg.Name = memberName // hostname
	cfg.Dir = dataDir
	cfg.LCUrls = aCUrls
	cfg.LPUrls = aPUrls
	cfg.ACUrls = aCUrls
	cfg.APUrls = aPUrls
	cfg.StrictReconfigCheck = false // XXX: workaround https://github.com/coreos/etcd/issues/6305

	cfg.InitialCluster = initialPeerURLsMap.String() // including myself!
	cfg.ClusterState = embed.ClusterStateFlagNew
	//cfg.ClusterState = embed.ClusterStateFlagExisting

	if err := cfg.Validate(); err != nil {
		return fmt.Errorf("server config is invalid: %+v", err)
	}

	log.Printf("server: starting...")
	server, err := embed.StartEtcd(cfg)
	if err != nil {
		log.Printf("start error")
		return err
	}

	log.Printf("start finished")

	defer server.Close()       // this blocks until server has stopped
	defer server.Server.Stop() // trigger a shutdown

	select {
	case <-server.Server.ReadyNotify(): // we hang here if things are bad
		log.Printf("server: ready") // it didn't hang!

	case <-server.Server.StopNotify(): // it's going down now...
		log.Printf("server: stopped: %v", err)
		return err

	case <-time.After(time.Duration(10) * time.Second):
		log.Printf("server: timeout")
		return err
	}

	serverID := uint64(server.Server.ID()) // store member id for internal use
	log.Printf("server: id: %d", serverID)

	for {
		select {
		case err, ok := <-server.Err():
			if !ok { // server shut down
				log.Printf("server: shutdown")
				return err
			}
		}
	}

	return nil
}

func main() {
	log.Printf("hello")

	err := server()
	if err != nil {
		log.Fatal(err)
	}

	log.Printf("goodbye")
}

This makes using the embed package more difficult, since we'd have to catch the panic. Instead this should cause embed.StartEtcd to error!

I'm sure there are other places where the code could panic. We should never panic, and instead propagate the errors to the top-level where they are returned and/or logged, to have an orderly shutdown without data corruption.

Thanks!

@purpleidea
Copy link
Contributor Author

(Note: I'm sending a patch to switch from Fatal to Panic, but the same point applies.)

purpleidea added a commit to purpleidea/etcd that referenced this issue Mar 27, 2019
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.

This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().

This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.

Fixes: etcd-io#10588
@purpleidea
Copy link
Contributor Author

Patch available in #10591

@purpleidea
Copy link
Contributor Author

I'd like to backport this to v3.3.13 which will probably be the next minor release. Can anyone give me guidelines on where to send such a patch please?

purpleidea added a commit to purpleidea/etcd that referenced this issue Mar 29, 2019
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.

This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().

This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.

Fixes: etcd-io#10588

This is a cherry-picked version of upstream: 368f70a
jingyih pushed a commit to jingyih/etcd that referenced this issue May 10, 2019
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.

This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().

This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.

Fixes: etcd-io#10588
MartinWeindel pushed a commit to MartinWeindel/etcd that referenced this issue Jun 12, 2019
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.

This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().

This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.

Fixes: etcd-io#10588
hexfusion pushed a commit to openshift/etcd that referenced this issue Sep 26, 2019
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.

This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().

This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.

Fixes: etcd-io#10588

This is a cherry-picked version of upstream: 368f70a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant