Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectivity: Add pod-to-pod-no-frag to check MTU misconfigurations #2610

Merged
merged 3 commits into from
Jun 25, 2024

Conversation

brb
Copy link
Member

@brb brb commented Jun 20, 2024

CI run - cilium/cilium#33286. Only ci-ipsec and ci-eks failures are relevant. I will fix them in a PR for cilium/cilium, which bumps CLI vsn.

To include iproute2 [1].

[1]: cilium/alpine-curl#123

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb changed the title WIP: Check MTU connectivity: Add pod-to-pod-no-frag to check MTU misconfigurations Jun 21, 2024
@brb brb added the area/CI Continuous Integration testing issue or flake label Jun 21, 2024
@brb brb mentioned this pull request Jun 21, 2024
@brb brb marked this pull request as ready for review June 21, 2024 10:00
@brb brb requested review from a team as code owners June 21, 2024 10:00
Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of nits

connectivity/check/context.go Outdated Show resolved Hide resolved
connectivity/tests/pod.go Show resolved Hide resolved
connectivity/tests/pod.go Outdated Show resolved Hide resolved
connectivity/tests/pod.go Outdated Show resolved Hide resolved
To pass "do not fragment" and to specify ping payload size in a
subsequent commit.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My ci-structure review:

I'm going to repost a previous review comment from another PR because I am seeing the same pattern being repeated here: #2568 (comment)

Is this really the best place to have this kind of test? I don't think I understand the point of moving away from Ginkgo-style tests if we are going to port the same logic and testing pattern into the CLI.

connectivity/tests/pod.go Outdated Show resolved Hide resolved
@brb
Copy link
Member Author

brb commented Jun 24, 2024

Is this really the best place to have this kind of test? I don't think I understand the point of moving away from Ginkgo-style tests if we are going to port the same logic and testing pattern into the CLI.

@christarazi I get your concerns regarding flakiness. Unfortunately, with Ginkgo deprecation we don't have any other place to add such e2e tests. There was an effort to create a E2E framework during the hyperjump, but it turned to be a massive investment. Maybe this topic can be picked again by some working group.

At the same time, IMO such tests are helpful for end users to diagnose non-obvious Cilium connectivity issues (which misconfigured MTU can lead to). And what I hear from users that they extensively use cilium-cli connectivity test to rely on infra deployment correctness.

(Ginkgo/gomega as a testing framework is / was a terrible foundation for Cilium's E2E tests. The argumentation https://www.reddit.com/r/golang/comments/1azj63h/comment/ks1srp2/ resonates well with us).

The new test case is intended to check whether MTU is properly set in
pod netns.

The check sends an ICMP Echo with DF set and payload. The payload size
is derived from pod MTU. No reply means that MTU is misconfigured
somewhere in the packet path.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brb

I get your concerns regarding flakiness. Unfortunately, with Ginkgo deprecation we don't have any other place to add such e2e tests. There was an effort to create a E2E framework during the hyperjump, but it turned to be a massive investment. Maybe this topic can be picked again by some working group.

I am mostly worried about lines 198-230 in connectivity/tests/pod.go because it seems to me that checking for the MTU on the node itself for the pod doesn't necessarily need to be done at this level. Couldn't it be validated at endpoint creation time that the appropriate MTU value was passed to the container?

The lines beyond the above lines mentioned is fine because IMO that is testing potential fragmentation issues with MTU. Ideally, that should be done in a BPF program unit test IIUC, but this is better than issuing ad-hoc commands to validate a parameter value (i.e. MTU).

It would be nice to have an issue to track the work to migrate to a better testing style, otherwise this code will just sit here until it causes us so many flakes that we then have to look at it again. (I realize I didn't mention this in the previous PR I posted a review comment like this to; will do that there as well).

For now I will approve to unblock because it seems the tradeoff of not having this sort of coverage is worse than not having it at all while we build better tests. Ideally we aren't adding more tests with this pattern going forward.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 25, 2024
@brb
Copy link
Member Author

brb commented Jun 25, 2024

Couldn't it be validated at endpoint creation time that the appropriate MTU value was passed to the container?

Unfortunately, noup. In-between pod-to-pod there are multiple network devices involved (some managed by Cilium), which MTU misconfiguration might not be reflected in an endpoint's MTU.

Ideally, that should be done in a BPF program unit test IIUC

The BPF unit tests do not involve netdevs. Perhaps, we should start thinking about introducing integration tests, in which cilium-agent loads BPF progs via pkg/datapath and then we inject a packet (e.g., via libpcap). That could be a replacement for the test cases from this PR. Maybe a topic for the next hive time (if yes, which WG?)?

@brb brb merged commit 9636a3e into main Jun 25, 2024
15 checks passed
@brb brb deleted the pr/brb/ci-wg-mtu branch June 25, 2024 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants