Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced Audit tests flaking #60719

Closed
liggitt opened this issue Mar 2, 2018 · 18 comments · Fixed by #60739
Closed

Advanced Audit tests flaking #60719

liggitt opened this issue Mar 2, 2018 · 18 comments · Fixed by #60739
Assignees
Labels
area/audit kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/auth Categorizes an issue or PR as relevant to SIG Auth.
Milestone

Comments

@liggitt
Copy link
Member

liggitt commented Mar 2, 2018

Started flaking yesterday: https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=Advanced%20Audit

Seems to coincide with #60237

/assign @tallclair @crassirostris

need to triage this ASAP to know if we need to roll back that PR

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 2, 2018
@liggitt liggitt added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/auth Categorizes an issue or PR as relevant to SIG Auth. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. status/approved-for-milestone area/audit labels Mar 2, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 2, 2018
@liggitt liggitt added the kind/bug Categorizes issue or PR as related to a bug. label Mar 2, 2018
@liggitt liggitt added this to the v1.10 milestone Mar 2, 2018
@liggitt liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Mar 2, 2018
@liggitt
Copy link
Member Author

liggitt commented Mar 2, 2018

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.10 milestone Mar 2, 2018
@liggitt
Copy link
Member Author

liggitt commented Mar 2, 2018

/milestone v1.10

@BenTheElder
Copy link
Member

/milestone v1.10
(should not work, lack of perms)

@k8s-ci-robot
Copy link
Contributor

@BenTheElder: You must be a member of the kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone v1.10
(should not work, lack of perms)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tallclair
Copy link
Member

From apiserver log for run ci-kubernetes-e2e-gci-gce/22334:

I0302 17:08:26.869897       1 flags.go:27] FLAG: --audit-log-batch-buffer-size="10000"
I0302 17:08:26.869904       1 flags.go:27] FLAG: --audit-log-batch-max-size="400"
I0302 17:08:26.869910       1 flags.go:27] FLAG: --audit-log-batch-max-wait="30s"
I0302 17:08:26.869919       1 flags.go:27] FLAG: --audit-log-batch-throttle-burst="15"
I0302 17:08:26.869925       1 flags.go:27] FLAG: --audit-log-batch-throttle-enable="true"
I0302 17:08:26.869931       1 flags.go:27] FLAG: --audit-log-batch-throttle-qps="10"
I0302 17:08:26.869941       1 flags.go:27] FLAG: --audit-log-format="json"
I0302 17:08:26.869947       1 flags.go:27] FLAG: --audit-log-maxage="0"
I0302 17:08:26.869953       1 flags.go:27] FLAG: --audit-log-maxbackup="0"
I0302 17:08:26.869960       1 flags.go:27] FLAG: --audit-log-maxsize="2000000000"
I0302 17:08:26.869966       1 flags.go:27] FLAG: --audit-log-mode="batch"
I0302 17:08:26.869972       1 flags.go:27] FLAG: --audit-log-path="/var/log/kube-apiserver-audit.log"
I0302 17:08:26.869979       1 flags.go:27] FLAG: --audit-policy-file="/etc/audit_policy.config"
I0302 17:08:26.869986       1 flags.go:27] FLAG: --audit-webhook-batch-buffer-size="10000"
I0302 17:08:26.869992       1 flags.go:27] FLAG: --audit-webhook-batch-initial-backoff="10s"
I0302 17:08:26.869999       1 flags.go:27] FLAG: --audit-webhook-batch-max-size="400"
I0302 17:08:26.870005       1 flags.go:27] FLAG: --audit-webhook-batch-max-wait="30s"
I0302 17:08:26.870012       1 flags.go:27] FLAG: --audit-webhook-batch-throttle-burst="15"
I0302 17:08:26.870018       1 flags.go:27] FLAG: --audit-webhook-batch-throttle-enable="false"
I0302 17:08:26.870024       1 flags.go:27] FLAG: --audit-webhook-batch-throttle-qps="10"
I0302 17:08:26.870031       1 flags.go:27] FLAG: --audit-webhook-config-file=""
I0302 17:08:26.870037       1 flags.go:27] FLAG: --audit-webhook-initial-backoff="10s"
I0302 17:08:26.870044       1 flags.go:27] FLAG: --audit-webhook-mode="batch"

I'm pretty sure this is what's happening: the default audit mode changed to batch for the logging backend, but the audit test expects the logs to be there immediately.

Recommended Action:

  1. Revert the defaults on the audit flags. I don't think we can change the default behavior for backwards compatibility.
  2. Add retry logic to the audit test so that it can handle async logs.
  3. (optional) update the e2e configuration to use batch mode with a very short timeout.

If we rollback #60237 it will cause problems with the scale tests that were re-enabled in kubernetes/test-infra#7000

@krzyzacy
Copy link
Member

krzyzacy commented Mar 2, 2018

http://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gce
starts to look pretty bad
priority/critical-urgent
cc @jberkus

@tallclair
Copy link
Member

Actually, the forward fix would break scalability anyway. I'll send a rollback.

@liggitt
Copy link
Member Author

liggitt commented Mar 5, 2018

/status in-progress

@crassirostris
Copy link

@tallclair

Actually, the forward fix would break scalability anyway. I'll send a rollback.

No, because audit logging tests are not enabled in scalability tests, there's DisabledForLargeClusters in the test name

@crassirostris
Copy link

I think the e2e test fix is better in this case

@crassirostris
Copy link

Filed #60794 to fix the problem

I still believe that using buffered audit logging is a better default, since slowing down all api requests is a pretty serious drawback and should be enabled consciously

@crassirostris
Copy link

@liggitt @tallclair If you're fine with forward-fixing the problem in e2es, I'd prefer this way, it's the easiest. Fix is in #60794

@tallclair
Copy link
Member

I still believe that using buffered audit logging is a better default, since slowing down all api requests is a pretty serious drawback and should be enabled consciously

If we make batch mode the default, I think we should adjust the default parameters to include:

  1. Very low timeout (1s?) - I don't think there's any reason to collect large batches.
  2. Only a single go-routine (do we have the right params for this?) - This is important to reduce write contention, and to keep log lines in order.

I'm OK moving forward with the test fix, but I'd like to fix those defaults before 1.10 is cut, if possible.

@crassirostris
Copy link

crassirostris commented Mar 6, 2018

I'm OK moving forward with the test fix, but I'd like to fix those defaults before 1.10 is cut, if possible.

I agree that defaults should change. What about changing the batch size to 1, so that each log entry is written asynchronously in its own goroutine? Anyways, this or very low timeout SGTM, since I don't think that the second option is feasible right now.

Anyways, let's continue the discussion in #60739, since it's unrelated to the flaking.

k8s-github-robot pushed a commit that referenced this issue Mar 6, 2018
Automatic merge from submit-queue (batch tested with PRs 60630, 60794). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add retrying to audit logging e2e tests

Fixes #60719

Adds retrying to the audit logging e2e tests so it can work when audit logging is in batch mode and actual writing is delayed.

```release-note
NONE
```

/cc @tallclair @liggitt @sttts
@liggitt
Copy link
Member Author

liggitt commented Mar 13, 2018

still failing in upgrade test... e2e fix likely needs backport to 1.9 e2es
http://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-1.9-master-upgrade-master

@crassirostris
Copy link

Good point! Created #61134 to address it

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@crassirostris @liggitt @tallclair @kubernetes/sig-auth-misc

Action Required: This issue has not been updated since Mar 13. Please provide an update.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
  • sig/auth: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

k8s-github-robot pushed a commit that referenced this issue Mar 15, 2018
…-#60794-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #60794: Add retrying to audit logging e2e tests

Cherry pick of #60794 on release-1.9.

Fixes #60719, since audit logging behavior has changed in 1.10. Purely e2e change, so no release note

#60794: Add retrying to audit logging e2e tests

```release-note
NONE
```

/cc @sttts @liggitt
@liggitt
Copy link
Member Author

liggitt commented Mar 15, 2018

fixed in upgrade test by #61134

@liggitt liggitt closed this as completed Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/audit kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/auth Categorizes an issue or PR as relevant to SIG Auth.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants