-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectivity: Add upgrade tests #1683
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have instructions on which order we should run the commands?
It's in the last commit msg 😎 |
for pod, count := range restartCount { | ||
if prevCount, found := prevRestartCount[pod]; !found { | ||
t.Fatalf("Could not found Pod %s restart count", pod) | ||
} else if prevCount != count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it about Pod restarts here that implies that a connection was reset? Is there a one-shot connection, which if it fails, fails a livenessProve
for the Pod?
A link to the manifest or documentation about this would be helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, a Pod restart means that a long-lived connection was interrupted. I put some docs in the beginning of this file. Please let me know if more details are needed.
For now I reused the same migrate-svc deployments from the ginkgo test suite. In the future, we should at least rename them, document and extend their use-cases (e.g., to test N/S traffic interruptions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part. Commented with some suggested improvements.
To flush CT of each Cilium node after running all connectivity tests. It's needed when running connectivity tests multiple times on the same cluster, and the L7 netpol tests might interfere with other tests [1]. [1]: cilium/cilium#17459 Signed-off-by: Martynas Pumputis <m@lambda.lt>
The new test case checks whether there are no interruptions in long-lived E/W LB flows. The upgrade test consists of two steps: * "cli connectivity test --include-upgrade-test --upgrade-test-setup": deploys migrate-svc (to be renamed) pods, and stores restart counters. * Do Cilium upgrade * "cli connectivity test --include-upgrade-test --test post-upgrade": checks restart counters of migrate-svc pods, and compares against the previously stored counters (counters mismatch means interruptions in flow) Signed-off-by: Martynas Pumputis <m@lambda.lt>
cmd.Flags().BoolVar(¶ms.IncludeUpgradeTest, "include-upgrade-test", false, "Include upgrade test") | ||
cmd.Flags().BoolVar(¶ms.UpgradeTestSetup, "upgrade-test-setup", false, "Set up upgrade test dependencies") | ||
cmd.Flags().StringVar(¶ms.UpgradeTestResultPath, "upgrade-test-result-path", "/tmp/cilium-upgrade-test-restart-counts", "Upgrade test temporary result file (used internally)") | ||
cmd.Flags().BoolVar(¶ms.FlushCT, "flush-ct", false, "Flush conntrack of Cilium on each node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we hide this flag given connectivity tests are supposed to be usable by folks in their production cluster without side-effects? The blast radius of a misuse here is quite high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see yes, this known issue: cilium/cilium#17459 🙏
Please see each individual commit msg when reviewing.
Testing in cilium/cilium#25790 (red builds can be ignored, as they are due to Cilium issues which the upgrade tests have caught).
Related cilium/cilium#25037.