-
Notifications
You must be signed in to change notification settings - Fork 172
Make e2e test repairs more robust #1164
Make e2e test repairs more robust #1164
Conversation
Made several debuggability and functional improvements: * Added timestamps to all output to correlate with logs. Timestamps are of the form "seconds since Unix epoch" which isn't human friendly but is identical to the timestamps produced by HNC logs. * Fully deleted the HNC deployment in RecoverHNC since HNC doesn't really seem to recover well if various things are changed without restarting the pod. * Make the post-recovery test far more robust by not ignoring failures to delete the test namespaces. * Stop force-deleting pods in the rolebinding test and instead just wait a few moments. I found that we don't actually need to wait for the pod object to be fully deleted on the server for it to stop getting in the way; the container in the pod appears to stop running ~instantly while the pod can occasionally hang around for over 60s. All of these were added as I saw failures in the affected code. Tested: Ran a set of five flaky tests with and without these changes (while also including PR kubernetes-retired#1163). Without, at least one of them failed virtually every time; with this change, they passed 5/5 times on GKE plus 2/2 times on Kind.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianludwin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This change fixes todos in the cherrypick for kubernetes-retired#1150 (see issue kubernetes-retired#1149). It simplifies and restructures a lot of the logic to make it easier to follow while looking at less data (e.g. a lot more focus on anchor.Status.State). It also adds a lots more documentation. Tested: all e2e tests pass on GKE 1.18 when combined with fixes to the e2e tests (PRs kubernetes-retired#1160, kubernetes-retired#1162, kubernetes-retired#1163 and kubernetes-retired#1164).
This change fixes todos in the cherrypick for kubernetes-retired#1150 (see issue kubernetes-retired#1149). It simplifies and restructures a lot of the logic to make it easier to follow while looking at less data (e.g. a lot more focus on anchor.Status.State). It also adds a lots more documentation. Tested: all e2e tests pass on GKE 1.18 when combined with fixes to the e2e tests (PRs kubernetes-retired#1160, kubernetes-retired#1162, kubernetes-retired#1163 and kubernetes-retired#1164).
/assign @yiqigao217 |
// Do NOT use CleanupNamespaces because that just assumes that if it can't delete a namespace that | ||
// everthing's fine, but this is a poor assumption if HNC has just been repaired. | ||
// | ||
// TODO: if CleanupNamespaces ever starts using labels to select namespaces to delete, then get | ||
// rid of this hack. | ||
if err := TryRunQuietly("kubectl get ns", a); err == nil { | ||
MustRunWithTimeout(30, "kubectl hns set", a, "-a") | ||
MustRunWithTimeout(30, "kubectl delete ns", a) | ||
} | ||
if err := TryRunQuietly("kubectl get ns", b); err == nil { | ||
MustRunWithTimeout(30, "kubectl annotate ns", b, "hnc.x-k8s.io/subnamespaceOf-") | ||
MustRunWithTimeout(30, "kubectl delete ns", b) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this looked the same to me as CleanupNamespaces()
except skipping de-annotating a and no setting AC for b. What makes a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CleanupNamespaces is TryRunQuietly, this is MustRunWithTimeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see.
/lgtm
/hold
/assign @rjbez17
Test-only changes don't need double-approval, though @rjbez17 please lmk if you have any suggestions and I'll follow them up. /hold cancel |
Made several debuggability and functional improvements:
of the form "seconds since Unix epoch" which isn't human friendly but
is identical to the timestamps produced by HNC logs.
really seem to recover well if various things are changed without
restarting the pod.
to delete the test namespaces.
a few moments. I found that we don't actually need to wait for the pod
object to be fully deleted on the server for it to stop getting in the
way; the container in the pod appears to stop running ~instantly while
the pod can occasionally hang around for over 60s.
All of these were added as I saw failures in the affected code.
Tested: Ran a set of five flaky tests with and without these changes
(while also including PR #1163). Without, at least one of them failed
virtually every time; with this change, they passed 5/5 times on GKE
plus 2/2 times on Kind.