Possible WAL corruption with loss of un-fsynced writes:`max entry size limit exceeded` #14098

aphyr · 2022-06-07T19:52:47Z

What happened?

The lazyfs filesystem lets us simulate the effects of a power failure by losing writes which were not explicitly fsync'ed to disk. When we run etcd 3.5.3 on lazyfs, killing etcd and then losing un-fsynced writes can reliably put etcd into an unbootable state. Every time we try to start the node, it complains:

{"level":"fatal","ts":"2022-06-07T15:08:26.505-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

We're still sanding bugs off of lazyfs, so it's possible this might be an issue in the filesystem itself. That said, this might also point to a problem with how etcd writes WAL files, so I'd like to check and see if this looks plausible to y'all. I know there's been some issues with data file corruption on process crash in the past; this approach might help find more bugs like that!

I've attached a full test run from Jepsen, which includes tarballs of the data directories for each node. Take a look at n1/ as an example: 20220607T150758.000-0400.zip.

This happens both with and without --experimental-initial-corrupt-check.

What did you expect to happen?

I expect that etcd ought to start up without crashing, even if we lose un-fsynced writes.

How can we reproduce it (as minimally and precisely as possible)?

Check out https://github.com/jepsen-io/etcd at adfc820826a947625c94d836b4017b4eaac7064d, and run:

lein run test -w append --concurrency 2n --time-limit 300 --rate 500 --lazyfs --nemesis kill

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster n1=http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380 --snapshot-count 100 --experimental-initial-corrupt-check

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

root@n1:/opt/etcd# ./etcdctl member list -w table --endpoints=n1:2379
+------------------+---------+------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  | NAME |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| 1153c9690d2b2284 | started |   n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 |      false |
| 4824313a421b2502 | started |   n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 |      false |
| 4d6e27d122507e9c | started |   n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 |      false |
| a1ffd5acd6a88a6a | started |   n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 |      false |
| afa39e55dee6dc2e | started |   n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 |      false |
+------------------+---------+------+-----------------------------+-----------------------------+------------+

root@n1:/opt/etcd# ./etcdctl --endpoints=http://192.168.122.105:2379,http://192.168.122.103:2379,http://192.168.122.104:2379,http://192.168.122.102:2379,http://192.168.122.101:2379 endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.122.105:2379 | 4824313a421b2502 |   3.5.3 |  4.2 MB |      true |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.103:2379 | 1153c9690d2b2284 |   3.5.3 |  4.2 MB |     false |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.104:2379 | 4d6e27d122507e9c |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.102:2379 | a1ffd5acd6a88a6a |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.101:2379 | afa39e55dee6dc2e |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

{"level":"fatal","ts":"2022-06-07T15:13:18.464-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

The text was updated successfully, but these errors were encountered:

ahrtr · 2022-06-12T04:53:42Z

Is this issue the same as 14102 ?

aphyr · 2022-06-13T12:11:58Z

Possibly! However, since they're different kinds of faults and yield different etcd crashes, I suspect they're different issues. I also wouldn't be surprised if #14102 turns out to encompass a half-dozen different issues, just based on the number of distinct ways I've seen it fail so far.

ahrtr · 2022-06-14T07:03:59Z

This is also related to 14025

ahrtr · 2022-06-17T08:18:10Z

One WAL entry's size is 13563782407139376 bytes, see log below. It's about 13563TB, obviously it isn't correct.

2022/06/17 16:12:16 Failed reading WAL: wal: max entry size limit exceeded, recBytes: 13563782407139376, fileSize(64000000) - offset(196120) - padBytes(0) = entryLimit(63803880)

There are two possible reasons:

It's the filesystem's bug, and it persisted a wrong value for the WAL entry size;
It's caused by bit-flip somehow.

ahrtr · 2022-06-17T08:24:29Z

I think the best thing to do for now is to let etcd fail to get started in this situation (data files corrupted, including WAL file) , and it's exactly the current behavior.

In the future, we may deliver a solution to recover the data file from a point in time.

aphyr · 2022-06-21T17:59:57Z

So we've traced this behavior to (we think) an issue with lazyfs: truncation filled with ASCII '0' characters (0x30), rather than 0x00. Etcd's WAL reader scanned for 0x00 to determine the end of the file, and in this case got 0x30 and... maybe interpreted those as a part of the size field?

I'm not exactly sure what the correct behavior here is, filesystem-wise (perhaps @devzizu could chime in?), but for the time being we've replaced truncated bytes with 0x00, and that seems to have eliminated this particular crash.

Instead, we get a new kind of crash! Here's an example:

{"level":"panic","ts":"2022-06-21T13:24:29.359-0400","logger":"raft","caller":"etcdserver/zap_raft.go:101","msg":"tocommit(56444) is out of range [lastIndex(2894)]. Was the raft log corrupted, truncated, or lost?","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101\ngo.etcd.io/etcd/raft/v3.(*raftLog).commitTo\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237\ngo.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508\ngo.etcd.io/etcd/raft/v3.stepFollower\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434\ngo.etcd.io/etcd/raft/v3.(*raft).Step\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975\ngo.etcd.io/etcd/raft/v3.(*node).run\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356"}
{"level":"info","ts":"2022-06-21T13:24:29.359-0400","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"a1ffd5acd6a88a6a"}
panic: tocommit(56444) is out of range [lastIndex(2894)]. Was the raft log corrupted, truncated, or lost?

goroutine 167 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00021e480, 0x0, 0x0, 0x0)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*SugaredLogger).log(0xc00012c028, 0x4, 0x124ecb9, 0x5d, 0xc0012a4100, 0x2, 0x2, 0x0, 0x0, 0x0)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Panicf(...)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:159
go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc0000dc090, 0x124ecb9, 0x5d, 0xc0012a4100, 0x2, 0x2)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101 +0x7d
go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc0001f6000, 0xdc7c)
	/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237 +0x135
go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508 +0x54
go.etcd.io/etcd/raft/v3.stepFollower(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434 +0x478
go.etcd.io/etcd/raft/v3.(*raft).Step(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975 +0xa55
go.etcd.io/etcd/raft/v3.(*node).run(0xc000238180)
	/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356 +0x798
created by go.etcd.io/etcd/raft/v3.RestartNode
	/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:244 +0x330

devzizu · 2022-06-22T07:55:10Z

Hey!

I'm not exactly sure what the correct behavior here is, filesystem-wise (perhaps @devzizu could chime in?), but for the time being we've replaced truncated bytes with 0x00, and that seems to have eliminated this particular crash.

That's right, any filesystem should return null bytes on read operations (0x00) for the truncated file (in case of increasing the size). My apologies with the LazyFS's bug, as @aphyr said, I was writing 0x30 (ascii '0') instead of 0x00, because that helped me debugging at the time. Also, I thought it wouldn't be a huge deal because I was thinking that applications relied on some kind of max readable offset.

Feel free to report bugs or ask me anything about LazyFS, it will be a pleasure to help!

stale · 2022-09-21T02:36:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

serathius · 2022-09-21T09:13:42Z

Closing per #14098 (comment)

SELVAKUMARU · 2023-10-24T06:58:11Z

error starting etcd: wal: max entry size limit in sensu-go

what is the solution for this? Thanking You

aphyr added the type/bug label Jun 7, 2022

aphyr mentioned this issue Jun 14, 2022

Introduce etcd linearizability tests #14045

Closed

33 tasks

ahrtr added type/feature and removed type/bug labels Jun 17, 2022

ahrtr mentioned this issue Jun 22, 2022

inconsistent data in etcd 3.5.4 #14139

Closed

stale bot added the stale label Sep 21, 2022

serathius closed this as completed Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible WAL corruption with loss of un-fsynced writes:`max entry size limit exceeded` #14098

Possible WAL corruption with loss of un-fsynced writes:`max entry size limit exceeded` #14098

aphyr commented Jun 7, 2022 •

edited

Loading

ahrtr commented Jun 12, 2022

aphyr commented Jun 13, 2022

ahrtr commented Jun 14, 2022

ahrtr commented Jun 17, 2022

ahrtr commented Jun 17, 2022

aphyr commented Jun 21, 2022

devzizu commented Jun 22, 2022

stale bot commented Sep 21, 2022

serathius commented Sep 21, 2022

SELVAKUMARU commented Oct 24, 2023 •

edited

Loading

Possible WAL corruption with loss of un-fsynced writes:max entry size limit exceeded #14098

Possible WAL corruption with loss of un-fsynced writes:max entry size limit exceeded #14098

Comments

aphyr commented Jun 7, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented Jun 12, 2022

aphyr commented Jun 13, 2022

ahrtr commented Jun 14, 2022

ahrtr commented Jun 17, 2022

ahrtr commented Jun 17, 2022

aphyr commented Jun 21, 2022

devzizu commented Jun 22, 2022

stale bot commented Sep 21, 2022

serathius commented Sep 21, 2022

SELVAKUMARU commented Oct 24, 2023 • edited Loading

Possible WAL corruption with loss of un-fsynced writes:`max entry size limit exceeded` #14098

Possible WAL corruption with loss of un-fsynced writes:`max entry size limit exceeded` #14098

aphyr commented Jun 7, 2022 •

edited

Loading

SELVAKUMARU commented Oct 24, 2023 •

edited

Loading