Skip to content
This repository has been archived by the owner on Jun 26, 2023. It is now read-only.

Make e2e test repairs more robust #1164

Merged
merged 1 commit into from
Oct 1, 2020

Conversation

adrianludwin
Copy link
Contributor

Made several debuggability and functional improvements:

  • Added timestamps to all output to correlate with logs. Timestamps are
    of the form "seconds since Unix epoch" which isn't human friendly but
    is identical to the timestamps produced by HNC logs.
  • Fully deleted the HNC deployment in RecoverHNC since HNC doesn't
    really seem to recover well if various things are changed without
    restarting the pod.
  • Make the post-recovery test far more robust by not ignoring failures
    to delete the test namespaces.
  • Stop force-deleting pods in the rolebinding test and instead just wait
    a few moments. I found that we don't actually need to wait for the pod
    object to be fully deleted on the server for it to stop getting in the
    way; the container in the pod appears to stop running ~instantly while
    the pod can occasionally hang around for over 60s.

All of these were added as I saw failures in the affected code.

Tested: Ran a set of five flaky tests with and without these changes
(while also including PR #1163). Without, at least one of them failed
virtually every time; with this change, they passed 5/5 times on GKE
plus 2/2 times on Kind.

Made several debuggability and functional improvements:

* Added timestamps to all output to correlate with logs. Timestamps are
  of the form "seconds since Unix epoch" which isn't human friendly but
  is identical to the timestamps produced by HNC logs.
* Fully deleted the HNC deployment in RecoverHNC since HNC doesn't
  really seem to recover well if various things are changed without
  restarting the pod.
* Make the post-recovery test far more robust by not ignoring failures
  to delete the test namespaces.
* Stop force-deleting pods in the rolebinding test and instead just wait
  a few moments. I found that we don't actually need to wait for the pod
  object to be fully deleted on the server for it to stop getting in the
  way; the container in the pod appears to stop running ~instantly while
  the pod can occasionally hang around for over 60s.

All of these were added as I saw failures in the affected code.

Tested: Ran a set of five flaky tests with and without these changes
(while also including PR kubernetes-retired#1163). Without, at least one of them failed
virtually every time; with this change, they passed 5/5 times on GKE
plus 2/2 times on Kind.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianludwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 30, 2020
adrianludwin added a commit to adrianludwin/multi-tenancy that referenced this pull request Sep 30, 2020
This change fixes todos in the cherrypick for kubernetes-retired#1150 (see issue kubernetes-retired#1149).
It simplifies and restructures a lot of the logic to make it easier to
follow while looking at less data (e.g. a lot more focus on
anchor.Status.State). It also adds a lots more documentation.

Tested: all e2e tests pass on GKE 1.18 when combined with fixes to the
e2e tests (PRs kubernetes-retired#1160, kubernetes-retired#1162, kubernetes-retired#1163 and kubernetes-retired#1164).
adrianludwin added a commit to adrianludwin/multi-tenancy that referenced this pull request Sep 30, 2020
This change fixes todos in the cherrypick for kubernetes-retired#1150 (see issue kubernetes-retired#1149).
It simplifies and restructures a lot of the logic to make it easier to
follow while looking at less data (e.g. a lot more focus on
anchor.Status.State). It also adds a lots more documentation.

Tested: all e2e tests pass on GKE 1.18 when combined with fixes to the
e2e tests (PRs kubernetes-retired#1160, kubernetes-retired#1162, kubernetes-retired#1163 and kubernetes-retired#1164).
@adrianludwin
Copy link
Contributor Author

/assign @yiqigao217
/assign @rjbez17

Comment on lines +273 to +285
// Do NOT use CleanupNamespaces because that just assumes that if it can't delete a namespace that
// everthing's fine, but this is a poor assumption if HNC has just been repaired.
//
// TODO: if CleanupNamespaces ever starts using labels to select namespaces to delete, then get
// rid of this hack.
if err := TryRunQuietly("kubectl get ns", a); err == nil {
MustRunWithTimeout(30, "kubectl hns set", a, "-a")
MustRunWithTimeout(30, "kubectl delete ns", a)
}
if err := TryRunQuietly("kubectl get ns", b); err == nil {
MustRunWithTimeout(30, "kubectl annotate ns", b, "hnc.x-k8s.io/subnamespaceOf-")
MustRunWithTimeout(30, "kubectl delete ns", b)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looked the same to me as CleanupNamespaces() except skipping de-annotating a and no setting AC for b. What makes a difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CleanupNamespaces is TryRunQuietly, this is MustRunWithTimeout.

Copy link
Contributor

@yiqigao217 yiqigao217 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see.
/lgtm
/hold
/assign @rjbez17

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Oct 1, 2020
@adrianludwin
Copy link
Contributor Author

Test-only changes don't need double-approval, though @rjbez17 please lmk if you have any suggestions and I'll follow them up.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020
@k8s-ci-robot k8s-ci-robot merged commit 26e92d8 into kubernetes-retired:master Oct 1, 2020
@adrianludwin adrianludwin deleted the repair-e2e branch October 5, 2020 15:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants