Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd should be resiliant to packet manipulation #15704

Open
ptabor opened this issue Apr 12, 2023 · 6 comments
Open

Etcd should be resiliant to packet manipulation #15704

ptabor opened this issue Apr 12, 2023 · 6 comments

Comments

@ptabor
Copy link
Contributor

ptabor commented Apr 12, 2023

What happened?

etcd v3.4.21 crashed when was a follower in 3 nodes cluster, soon after some networking hiccup:

panic: runtime error: index out of range [0] with length 0
goroutine 229 [running]:
   go.etcd.io/etcd/raft.(*MemoryStorage).firstIndex(...)
   go/src/go.etcd.io/etcd/raft/storage.go:162
   go.etcd.io/etcd/raft.(*MemoryStorage).Append(0xc0007e2780, 0xc003f5bae0, 0x1, 0x1, 0x0, 0x0)
   go/src/go.etcd.io/etcd/raft/storage.go:249 +0x63a
   go.etcd.io/etcd/etcdserver.(*raftNode).start.func1(0xc000300630, 0xc000042740, 0x3b9aca00)
   go/src/go.etcd.io/etcd/etcdserver/raft.go:310 +0x6c5

What did you expect to happen?

Reconnect and keep working.

How can we reproduce it (as minimally and precisely as possible)?

I don't have repro. It happened once.

Anything else we need to know?

No response

Etcd version (please run commands below)

v3.4.21

Etcd configuration (command line flags or environment variables)

No response

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

DEFAULT 2023-04-03T13:56:15.665952143Z | mvcc: finished scheduled compaction at 509345669 (took 31.263387ms)
DEFAULT 2023-04-03T13:57:19.878587391Z | rafthttp: stopping peer 8fb74f91fa5fb582...
DEFAULT 2023-04-03T13:57:19.878591031Z 2023-04-03 13:57:19.878365 I | rafthttp: closed the TCP streaming connection with peer 8fb74f91fa5fb582 (stream MsgApp v2 writer)
DEFAULT 2023-04-03T13:57:19.878591391Z 2023-04-03 13:57:19.878387 I | rafthttp: stopped streaming with peer 8fb74f91fa5fb582 (writer)
DEFAULT 2023-04-03T13:57:19.879559241Z 2023-04-03 13:57:19.879192 I | rafthttp: closed the TCP streaming connection with peer 8fb74f91fa5fb582 (stream Message writer)
DEFAULT 2023-04-03T13:57:19.879560541Z 2023-04-03 13:57:19.879220 I | rafthttp: stopped streaming with peer 8fb74f91fa5fb582 (writer)
DEFAULT 2023-04-03T13:57:19.879885631Z  2023-04-03 13:57:19.879633 I | rafthttp: stopped HTTP pipelining with peer 8fb74f91fa5fb582
DEFAULT 2023-04-03T13:57:19.880165801Z 2023-04-03 13:57:19.880113 W | rafthttp: lost the TCP streaming connection with peer 8fb74f91fa5fb582 (stream MsgApp v2 reader)
DEFAULT 2023-04-03T13:57:19.880272901Z 2023-04-03 13:57:19.880194 E | rafthttp: failed to read 8fb74f91fa5fb582 on stream MsgApp v2 (context canceled)
DEFAULT 2023-04-03T13:57:19.880308521Z 2023-04-03 13:57:19.880206 I | rafthttp: peer 8fb74f91fa5fb582 became inactive (message send to peer failed)
DEFAULT 2023-04-03T13:57:19.881324261Z 2023-04-03 13:57:19.880447 I | rafthttp: stopped streaming with peer 8fb74f91fa5fb582 (stream MsgApp v2 reader)
DEFAULT 2023-04-03T13:57:19.881325561Z   2023-04-03 13:57:19.880639 W | rafthttp: lost the TCP streaming connection with peer 8fb74f91fa5fb582 (stream Message reader)
DEFAULT 2023-04-03T13:57:19.881325891Z 2023-04-03 13:57:19.880677 I | rafthttp: stopped streaming with peer 8fb74f91fa5fb582 (stream Message reader)
DEFAULT 2023-04-03T13:57:19.881326171Z 2023-04-03 13:57:19.880723 I | rafthttp: stopped peer 8fb74f91fa5fb582
DEFAULT 2023-04-03T13:57:19.881326491Z 2023-04-03 13:57:19.880755 I | rafthttp: stopping peer a708f991c86e646e...
DEFAULT 2023-04-03T13:57:19.882414421Z  | rafthttp: closed the TCP streaming connection with peer a708f991c86e646e (stream MsgApp v2 writer)
DEFAULT 2023-04-03T13:57:19.882415590Z  | rafthttp: stopped streaming with peer a708f991c86e646e (writer)
DEFAULT 2023-04-03T13:57:19.883056701Z  | rafthttp: closed the TCP streaming connection with peer a708f991c86e646e (stream Message writer)
DEFAULT 2023-04-03T13:57:19.883057970ZI | rafthttp: stopped streaming with peer a708f991c86e646e (writer)
DEFAULT 2023-04-03T13:57:19.887332830Z  | rafthttp: stopped HTTP pipelining with peer a708f991c86e646e
DEFAULT 2023-04-03T13:57:19.887335330Z  | rafthttp: lost the TCP streaming connection with peer a708f991c86e646e (stream MsgApp v2 reader)
DEFAULT 2023-04-03T13:57:19.887335640Z | rafthttp: failed to read a708f991c86e646e on stream MsgApp v2 (context canceled)
DEFAULT 2023-04-03T13:57:19.887335910Z | rafthttp: peer a708f991c86e646e became inactive (message send to peer failed)
DEFAULT 2023-04-03T13:57:19.887336200Z | rafthttp: stopped streaming with peer a708f991c86e646e (stream MsgApp v2 reader)
DEFAULT 2023-04-03T13:57:19.887336470Z | rafthttp: lost the TCP streaming connection with peer a708f991c86e646e (stream Message reader)
DEFAULT 2023-04-03T13:57:19.887336760Z | rafthttp: stopped streaming with peer a708f991c86e646e (stream Message reader)
DEFAULT 2023-04-03T13:57:19.887337070Z  | rafthttp: stopped peer a708f991c86e646e

DEFAULT 2023-04-03T13:57:19.893996129Z panic: runtime error: index out of range [0] with length 0
DEFAULT 2023-04-03T13:57:19.893999729Z goroutine 229 [running]:
DEFAULT 2023-04-03T13:57:19.894000099Z  go.etcd.io/etcd/raft.(*MemoryStorage).firstIndex(...)
DEFAULT 2023-04-03T13:57:19.894000369Z  go/src/go.etcd.io/etcd/raft/storage.go:162
DEFAULT 2023-04-03T13:57:19.894000639Z  go.etcd.io/etcd/raft.(*MemoryStorage).Append(0xc0007e2780, 0xc003f5bae0, 0x1, 0x1, 0x0, 0x0)
DEFAULT 2023-04-03T13:57:19.894000919Z go/src/go.etcd.io/etcd/raft/storage.go:249 +0x63a
DEFAULT 2023-04-03T13:57:19.894001159Z go.etcd.io/etcd/etcdserver.(*raftNode).start.func1(0xc000300630, 0xc000042740, 0x3b9aca00)
DEFAULT 2023-04-03T13:57:19.894001409Z go/src/go.etcd.io/etcd/etcdserver/raft.go:310 +0x6c5
DEFAULT 2023-04-03T13:57:19.894001689Z  created by go.etcd.io/etcd/etcdserver.(*raftNode).start
DEFAULT 2023-04-03T13:57:19.894001959Z  /src/go.etcd.io/etcd/etcdserver/raft.go:165 +0x52
DEFAULT 2023-04-03T13:57:19.919440817Z  + error_code=2
@ptabor
Copy link
Contributor Author

ptabor commented Apr 12, 2023

The crash is in:

etcd/raft/storage.go

Lines 161 to 163 in 85b640c

func (ms *MemoryStorage) firstIndex() uint64 {
return ms.ents[0].Index + 1
}

It seems the mc.ents always should have at least 1 item.
The only place where there is theoretical risks of going to 0 is:

etcd/raft/storage.go

Lines 263 to 265 in 85b640c

case uint64(len(ms.ents)) > offset:
ms.ents = append([]pb.Entry{}, ms.ents[:offset]...)
ms.ents = append(ms.ents, entries...)

The surrounding networking problems suggests that the entries might have been truncated/corrupted.
The server runs on mTLS so we can expect truncation, but not a corruption in the middle.

@ptabor
Copy link
Contributor Author

ptabor commented Apr 12, 2023

But actually entries should never be empty. If they were, this would fail:

offset := entries[0].Index - ms.ents[0].Index

@ahrtr
Copy link
Member

ahrtr commented Apr 13, 2023

The surrounding networking problems suggests that the entries might have been truncated/corrupted.

It's the only possible reason based on the implementation. cc @tbg @pavelkalinnikov @nvanbenschoten PTAL, thx

@ahrtr
Copy link
Member

ahrtr commented Apr 13, 2023

Or somehow the entries are not continuous.

@pav-kv
Copy link
Contributor

pav-kv commented Apr 13, 2023

Looks puzzling. The ms.ents field can't be empty just by construction. The only questionable case indeed should panic earlier (#15704 (comment)) if the appended entries slice is empty. Since it's not then the result of append shouldn't be empty.

A meta-comment. This isn't the first panic in raft when some kind of message corruption is suspect (and a recent one was indeed due to corruption: #15595). It looks like some sanity/invariant checks in raft would be beneficial, e.g. like the ones around entry indices that I hacked up in the linked issue (0c85e6ec).

@serathius
Copy link
Member

serathius commented Apr 13, 2023

Agree that we should consider adding invariant checks, however I would like to us to look holistically at the problem of "etcd being resilient to some packet manipulation", where we also work on defining "some".

Protecting in raft is only one level of safeguards we could introduce. The issue reported in etcd cluster with peer mTLS enabled, which means that we cannot even trust TCP checksumming in more extreme situations.

@etcd-io etcd-io deleted a comment from stale bot Oct 27, 2023
@serathius serathius changed the title index out of range [0] in raft.(*MemoryStorage).Append Etcd is vulnerable to packet manipulation Oct 27, 2023
@serathius serathius changed the title Etcd is vulnerable to packet manipulation Etcd should be resiliant to packet manipulation Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants