-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd panic in robustness tests blackhole failpoint #15595
Comments
Looks like an issue with raft |
cc @tbg @pavelkalinnikov |
Could be an issue with raft, though I would instead expect it to be some issue with how etcd implements with raft. Hard to know without a further investigation. Let me talk it through with Pavel |
The issue is pretty hard to reproduce, only once or twice per hundred runs. Please let me know what data you need to confirm where the issue is. |
… is fixed Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
… is fixed Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
… is fixed Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
… is fixed Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
tests/robustness: Disable testing network blackhole until #15595 is fixed
I've investigated this a bit. I can see that this is running on raft @ eaa6808e1f7a. The panic occurs here, e.g. one I'm looking at reports:
From the stack trace we're seeing that We're also seeing that there are two panics happening, at around the same time, with almost matching indexes. Presumably the cluster has a leader and 2 followers. The leader sends a weird The question is how we get a |
I think we can add some ad-hoc logging or assertions (both on the sending and receiving end, to localize where it occurs) to confirm that the |
@serathius I've created a branch in my fork on top of the commit you're running against: https://github.com/pavelkalinnikov/raft/commits/msg-app-invariant. If you could run the test with this commit included (or cherry-picked, whatever works for you), this might help localizing where the issue occurs. |
Awesome! On it. |
Updated #15600 |
Got it, there are two stacktraces writen at once from two separate members, you can distinguish them by name (TestRobustnessClusterOfSize3LowTraffic-test-2, TestRobustnessClusterOfSize3LowTraffic-test-0) or pid (714482, 714483) /src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-1) (714484): {"level":"warn","ts":"2023-03-31T16:42:20.590682+0200","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6108155669721489885,"retry-timeout":"500ms"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-1) (714484): {"level":"info","ts":"2023-03-31T16:42:20.59113+0200","caller":"traceutil/trace.go:171","msg":"trace[530800350] linearizableReadLoop","detail":"{readStateIndex:281; appliedIndex:284; }","duration":"1.002163677s","start":"2023-03-31T16:42:19.588942+0200","end":"2023-03-31T16:42:20.591106+0200","steps":["trace[530800350] 'read index received' (duration: 1.002157415s)","trace[530800350] 'applied index is now lower than readState.Index' (duration: 5.2µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): {"level":"info","ts":"2023-03-31T16:42:20.59132+0200","caller":"traceutil/trace.go:171","msg":"trace[1978863180] linearizableReadLoop","detail":"{readStateIndex:281; appliedIndex:282; }","duration":"1.020670431s","start":"2023-03-31T16:42:19.570632+0200","end":"2023-03-31T16:42:20.591302+0200","steps":["trace[1978863180] 'read index received' (duration: 1.020666434s)","trace[1978863180] 'applied index is now lower than readState.Index' (duration: 2.895µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): {"level":"info","ts":"2023-03-31T16:42:20.591933+0200","caller":"traceutil/trace.go:171","msg":"trace[858852315] linearizableReadLoop","detail":"{readStateIndex:284; appliedIndex:283; }","duration":"1.022533199s","start":"2023-03-31T16:42:19.569387+0200","end":"2023-03-31T16:42:20.59192+0200","steps":["trace[858852315] 'read index received' (duration: 1.022344014s)","trace[858852315] 'applied index is now lower than readState.Index' (duration: 188.474µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): {"level":"panic","ts":"2023-03-31T16:42:20.592092+0200","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90","msg":"entries invariant broken: entry[0].Index is 290, want 285","stacktrace":"go.etcd.io/raft/v3.(*raftLog).maybeAppend\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90\ngo.etcd.io/raft/v3.(*raft).handleAppendEntries\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521\ngo.etcd.io/raft/v3.stepFollower\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476\ngo.etcd.io/raft/v3.(*raft).Step\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025\ngo.etcd.io/raft/v3.(*node).run\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): {"level":"panic","ts":"2023-03-31T16:42:20.592137+0200","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90","msg":"entries invariant broken: entry[0].Index is 290, want 285","stacktrace":"go.etcd.io/raft/v3.(*raftLog).maybeAppend\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90\ngo.etcd.io/raft/v3.(*raft).handleAppendEntries\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521\ngo.etcd.io/raft/v3.stepFollower\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476\ngo.etcd.io/raft/v3.(*raft).Step\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025\ngo.etcd.io/raft/v3.(*node).run\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): panic: entries invariant broken: entry[0].Index is 290, want 285
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): goroutine 193 [running]:
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): panic: entries invariant broken: entry[0].Index is 290, want 285
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): goroutine 47 [running]:
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x36?, {0x0?, 0x0?, 0xc00017ca00?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x36?, {0x0?, 0x0?, 0xc00033a6e0?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000c1a00, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0004ec1a0, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap.(*SugaredLogger).log(0xc00024c080, 0x4, {0x108b574?, 0x40dc27?}, {0xc0001f86a0?, 0xece6c0?, 0xc000295401?}, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap.(*SugaredLogger).Panicf(...)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap@v1.24.0/sugar.go:189
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap.(*SugaredLogger).log(0xc00011c318, 0x4, {0x108b574?, 0x40dc27?}, {0xc001a94db0?, 0xece6c0?, 0xc0002de501?}, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap.(*SugaredLogger).Panicf(...)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap@v1.24.0/sugar.go:189
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0x10368e0?, {0x108b574?, 0x2?}, {0xc0001f86a0?, 0x0?, 0x7f2caa95c878?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0x0?, {0x108b574?, 0x2?}, {0xc001a94db0?, 0x0?, 0x1d45d40?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:102 +0x49
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:102 +0x49
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raftLog).maybeAppend(0xc0004d45b0, 0x11c, 0x2, 0x11c, {0xc001ca5c20?, 0x1, 0x1})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raftLog).maybeAppend(0xc0004e3f10, 0x11c, 0x2, 0x11c, {0xc001a30f30?, 0x1, 0x1})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raft).handleAppendEntries(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raft).handleAppendEntries(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521 +0x77
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521 +0x77
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.stepFollower(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476 +0x2f8
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.stepFollower(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476 +0x2f8
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raft).Step(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raft).Step(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025 +0x1375
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*node).run(0xc0006942a0)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025 +0x1375
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*node).run(0xc0006ae840)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357 +0x98a
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357 +0x98a
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): created by go.etcd.io/raft/v3.StartNode
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): created by go.etcd.io/raft/v3.StartNode
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:231 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:231 +0x65 |
The strange part is that this fires at the latest stage: on the follower upon receiving |
@serathius What does the robustness "blackhole" mode do? Does it fiddle with the messages content, or drop them, or something else? |
We run proxy before each members peer port and blackhole the traffic. Lines 103 to 115 in 6a995d2
It should just drop data. |
Looking at the implementation here, I see that the modifiers are applied to a byte stream. Is it possible that, as a result of turning this thing on/off, some message is assembled from non-contiguous parts of the byte stream? Say, could this |
@serathius Also, is it possible that this stream drops only a part of the message (which represents a prefix of |
The problem with such an application-level "dropping" is that it doesn't look like a stream corruption from the TCP point of view. The stream is tampered with, but to the TCP receiver this is still a valid stream, so it just feeds a modified stream to the application layer. All sorts of application-level corruption can happen this way. The real TCP-level packet loss would have been spotted by the receiver, and it wouldn't be able to proceed from the middle of the stream when it recovers. UPD: It might be HTTP that gRPC relies on, not necessarily directly TCP (not sure what the actual stack is). The proxy that you linked tampers with the stream above HTTP, right? And this gets retransmitted over HTTP, and fed into gRPC at the receiver end. |
I think you might be right, I have tested peer communication with TLS and I'm unable to reproduce. Need to investigatehow the proxy manages to execute man in the middle attack (I don't provide certs to it). As for proxy layer, it looks like L4 proxy (TCP) and not L7 (http, grpc). However one thing I noticed that packates buffered and code that drops traffic runs on bytes and not whole packets, which means they could be split in the middle. That would be the incorrect behavior you mentioned that results in non-natural network failure. |
Hmm, |
…d-io#15595 is fixed" This reverts commit 013e25f. Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
…d-io#15595 is fixed" This reverts commit 013e25f. Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
…d-io#15595 is fixed" This reverts commit 013e25f. Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Closing based on fact that TLS fixes the issue. Replacing |
Based on Fu Wei's idea, we employ blocking on L7 but without using external tools. The main idea is to read out `X-PeerURLs` from the header, since this is how peer traffic identifies itself to others. As we also know that all node will create direct connections with its peers, so the traffic blocking will actually happen at all nodes' proxy (contrary to the current design, where only the proxy of the peer that's being blackholed. However, this way of blocking (by `X-PeerURLs` from the header) will miss certain types of traffic to certain endpoints, e.g. /members, /version, and /raft/probing. See below for the log extract, as it's header doesn't contain X-PeerURLs information. - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /members - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /version - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /raft/probing In order to read out `X-PeerURLs` from the header, we need to terminate the SSL connection, as we can't drop cleartext traffic (ref [1]). Thus, a new option `e2e.WithSSLTerminationProxy(true)` is introduced, which will change the network flow into A -- B's SSL termination proxy - B's transparent proxy - B ^ newly introduced ^ in the original codebase This prototype doesn't address - blocking only RX or TX traffic - slowness when performing test cleanup , and the coding convention needs to be improved as it's still a PoC References: [1] etcd-io#15595
Based on Fu Wei's idea, we employ blocking on L7 but without using external tools. The main idea is to read out `X-PeerURLs` from the header, since this is how peer traffic identifies itself to others. As we also know that all nodes will create direct connections with their peers, so the traffic blocking will actually happen at all nodes' proxies (contrary to the current design, where only the proxy of the peer is being blackholed. However, this way of blocking (by `X-PeerURLs` from the header) will miss certain types of traffic to certain endpoints, e.g. /members, /version, and /raft/probing. See below for the log extract, as its header doesn't contain X-PeerURLs information. - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /members - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /version - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /raft/probing In order to read out `X-PeerURLs` from the header, we need to terminate the SSL connection, as we can't drop cleartext traffic (ref [1]). Thus, a new option `e2e.WithSSLTerminationProxy(true)` is introduced, which will change the network flow into ``` A -- B's SSL termination proxy - B's transparent proxy - B ^ newly introduced ^ in the original codebase ``` This prototype doesn't address - blocking only RX or TX traffic - slowness when performing test cleanup , and the coding convention needs to be improved as it's still a PoC References: [1] etcd-io#15595
Based on Fu Wei's idea, we employ blocking on L7 but without using external tools. The main idea is to read out `X-PeerURLs` from the header, since this is how peer traffic identifies itself to others. As we also know that all nodes will create direct connections with their peers, so the traffic blocking will actually happen at all nodes' proxies (contrary to the current design, where only the proxy of the peer is being blackholed. However, this way of blocking (by `X-PeerURLs` from the header) will miss certain types of traffic to certain endpoints, e.g. /members, /version, and /raft/probing. See below for the log extract, as its header doesn't contain X-PeerURLs information. - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /members - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /version - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /raft/probing In order to read out `X-PeerURLs` from the header, we need to terminate the SSL connection, as we can't drop cleartext traffic (ref [1]). Thus, a new option `e2e.WithSSLTerminationProxy(true)` is introduced, which will change the network flow into ``` A -- B's SSL termination proxy - B's transparent proxy - B ^ newly introduced ^ in the original codebase ``` This prototype doesn't address - blocking only RX or TX traffic - slowness when performing test cleanup , and the coding convention needs to be improved as it's still a PoC References: [1] etcd-io#15595
Based on Fu Wei's idea, we employ blocking on L7 but without using external tools. The main idea is to read out `X-PeerURLs` from the header, since this is how peer traffic identifies itself to others. As we also know that all nodes will create direct connections with their peers, so the traffic blocking will actually happen at all nodes' proxies (contrary to the current design, where only the proxy of the peer is being blackholed. However, this way of blocking (by `X-PeerURLs` from the header) will miss certain types of traffic to certain endpoints, e.g. /members, /version, and /raft/probing. See below for the log extract, as its header doesn't contain X-PeerURLs information. - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /members - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /version - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /raft/probing In order to read out `X-PeerURLs` from the header, we need to terminate the SSL connection, as we can't drop cleartext traffic (ref [1]). Thus, a new option `e2e.WithSSLTerminationProxy(true)` is introduced, which will change the network flow into ``` A -- B's SSL termination proxy - B's transparent proxy - B ^ newly introduced ^ in the original codebase ``` This prototype doesn't address - blocking only RX or TX traffic - slowness when performing test cleanup , and the coding convention needs to be improved as it's still a PoC References: [1] etcd-io#15595
Based on Fu Wei's idea, we employ blocking on L7 but without using external tools. [Main ideas] The main idea is 1) utilize the `X-PeerURLs` field from the header, as we know all traffic to peer will contain this header field. 2) As we also know that all nodes will create direct connections with their peers, so the traffic blocking will have to happen at all nodes' proxies, contrary to the current design, where only the proxy of the peer is being blackholed. Based on the main ideas, we introduce a SSL termination proxy so we can obtain the `X-PeerURLs` field from the header. [Issues] There are 2 known issues with this approach 1) still leaking some traffic. But the leaked traffic (as discussed later won't affect the blackhole idea that we would like to achieve (as stream and pipeline traffic between raft nodes are now properly terminated) 2) we would need to employ SSL termination proxy, which might lead to a small performance hit For 1), as this way of blocking (by utilizing `X-PeerURLs` from the header) will miss certain types of traffic to certain endpoints, as the traffic to some endpoints doesn't have the `X-PeerURLs` field. Currently, here are the known ones: /members, /version, and /raft/probing. As you can see from the log, its header doesn't contain the `X-PeerURLs` field, but only the following fields: - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /members - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /version - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /raft/probing For 2) in order to read out `X-PeerURLs` from the header, we need to terminate the SSL connection, as we can't drop cleartext traffic (ref [1]). Thus, a new option `e2e.WithSSLTerminationProxy(true)` is introduced, which will change the network flow into ``` A -- B's SSL termination proxy - B's transparent proxy - B ^ newly introduced ^ in the original codebase ``` [Known improvements required before turning RFC into PR] The prototype needs to be further improved for code review after fixing the following issues: - blocking only RX or TX traffic (as currently a call to `blackholeTX` or `blackholeRX` will black both TX and RX traffic instead of just the specified one. - slowness when performing test cleanup (I think this is related to the SSL timeout setting, but I haven't looked into it yet) - coding style improvements References: [1] etcd-io#15595
Based on Fu Wei's idea, we employ blocking on L7 but without using external tools. [Main ideas] The main idea is 1) utilize the `X-PeerURLs` field from the header, as we know all traffic to peer will contain this header field. 2) As we also know that all nodes will create direct connections with their peers, so the traffic blocking will have to happen at all nodes' proxies, contrary to the current design, where only the proxy of the peer is being blackholed. Based on the main ideas, we introduce a SSL termination proxy so we can obtain the `X-PeerURLs` field from the header. [Issues] There are 2 known issues with this approach 1) still leaking some traffic. But the leaked traffic (as discussed later won't affect the blackhole idea that we would like to achieve (as stream and pipeline traffic between raft nodes are now properly terminated) 2) we would need to employ SSL termination proxy, which might lead to a small performance hit For 1), as this way of blocking (by utilizing `X-PeerURLs` from the header) will miss certain types of traffic to certain endpoints, as the traffic to some endpoints doesn't have the `X-PeerURLs` field. Currently, here are the known ones: /members, /version, and /raft/probing. As you can see from the log, its header doesn't contain the `X-PeerURLs` field, but only the following fields: - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /members - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /version - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /raft/probing For 2) in order to read out `X-PeerURLs` from the header, we need to terminate the SSL connection, as we can't drop cleartext traffic (ref [1]). Thus, a new option `e2e.WithSSLTerminationProxy(true)` is introduced, which will change the network flow into ``` A -- B's SSL termination proxy - B's transparent proxy - B ^ newly introduced ^ in the original codebase ``` [Known improvements required before turning RFC into PR] The prototype needs to be further improved for code review after fixing the following issues: - blocking only RX or TX traffic (as currently a call to `blackholeTX` or `blackholeRX` will black both TX and RX traffic instead of just the specified one. - slowness when performing test cleanup (I think this is related to the SSL timeout setting, but I haven't looked into it yet) - coding style improvements References: [1] etcd-io#15595 Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
Based on Fu Wei's idea, we employ blocking on L7 but without using external tools. [Main ideas] The main idea is 1) utilize the `X-PeerURLs` field from the header, as we know all traffic to peer will contain this header field. 2) As we also know that all nodes will create direct connections with their peers, so the traffic blocking will have to happen at all nodes' proxies, contrary to the current design, where only the proxy of the peer is being blackholed. Based on the main ideas, we introduce a SSL termination proxy so we can obtain the `X-PeerURLs` field from the header. [Issues] There are 2 known issues with this approach 1) still leaking some traffic. But the leaked traffic (as discussed later won't affect the blackhole idea that we would like to achieve (as stream and pipeline traffic between raft nodes are now properly terminated) 2) we would need to employ SSL termination proxy, which might lead to a small performance hit For 1), as this way of blocking (by utilizing `X-PeerURLs` from the header) will miss certain types of traffic to certain endpoints, as the traffic to some endpoints doesn't have the `X-PeerURLs` field. Currently, here are the known ones: /members, /version, and /raft/probing. As you can see from the log, its header doesn't contain the `X-PeerURLs` field, but only the following fields: - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /members - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /version - map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] /raft/probing For 2) in order to read out `X-PeerURLs` from the header, we need to terminate the SSL connection, as we can't drop cleartext traffic (ref [1]). Thus, a new option `e2e.WithSSLTerminationProxy(true)` is introduced, which will change the network flow into ``` A -- B's SSL termination proxy - B's transparent proxy - B ^ newly introduced ^ in the original codebase ``` [Known improvements required before turning RFC into PR] The prototype needs to be further improved for code review after fixing the following issues: - blocking only RX or TX traffic (as currently a call to `blackholeTX` or `blackholeRX` will black both TX and RX traffic instead of just the specified one. - slowness when performing test cleanup (I think this is related to the SSL timeout setting, but I haven't looked into it yet) - coding style improvements References: [1] etcd-io#15595 Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
What happened?
Robustness Nightly are failing due to etcd panic.
https://github.com/etcd-io/etcd/actions/runs/4562802681
What did you expect to happen?
Etcd should not panic due to network issues
How can we reproduce it (as minimally and precisely as possible)?
Run robustness tests blackhole tests enough times.
See https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml
Anything else we need to know?
No response
Etcd version (please run commands below)
N/A
Etcd configuration (command line flags or environment variables)
N/A
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
N/A
Relevant log output
The text was updated successfully, but these errors were encountered: