Fix to avoid REST API calls at log level 2. #40933

aveshagarwal · 2017-02-03T17:57:28Z

By default, logging REST API calls at log level 2 is too verbose (see below), if log level 2 happens to be default. So increasing its default to 3.

I0203 12:37:27.059555   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.358954ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:27.780029   11483 wrap.go:75] GET /api/v1/nodes/127.0.0.1?resourceVersion=0: (819.444µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:27.844846   11483 wrap.go:75] POST /api/v1/namespaces/default/events: (11.337447ms) 201 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:27.844851   11483 wrap.go:75] PATCH /api/v1/nodes/127.0.0.1/status: (9.998981ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:28.942863   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (923.711µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:28.944556   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.188942ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:29.061483   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (937.549µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:29.063068   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.111312ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:30.947922   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (935.198µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:30.950150   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.703438ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:31.064883   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.127992ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:31.066503   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.098029ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:32.951691   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (945.295µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:32.953580   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.331822ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:33.068221   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (912.121µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:33.069787   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.120666ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:34.955546   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.02279ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:34.957812   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.661017ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:35.071528   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (935.155µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:35.073087   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.027371ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:35.580075   11483 wrap.go:75] GET /apis/extensions/v1beta1/thirdpartyresources: (943.453µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.611659   11483 wrap.go:75] GET /api/v1/namespaces/default: (753.781µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.612516   11483 wrap.go:75] GET /api/v1/namespaces/default/services/kubernetes: (495.105µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.613167   11483 wrap.go:75] GET /api/v1/namespaces/default/endpoints/kubernetes: (379.568µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.960131   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.431137ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:36.963470   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (2.190438ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:37.028185   11483 wrap.go:75] GET /api/v1/nodes: (1.34149ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/pod-garbage-collector] [::1]:35156]
I0203 12:37:37.074666   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (928.261µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:37.076314   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.240852ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:37.847163   11483 wrap.go:75] GET /api/v1/nodes/127.0.0.1?resourceVersion=0: (725.021µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:37.901326   11483 wrap.go:75] PATCH /api/v1/nodes/127.0.0.1/status: (2.377445ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:38.968028   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (3.777083ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:38.970313   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.655815ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:39.077913   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (919.65µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:39.079617   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.196855ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:40.972171   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (936.07µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:40.973886   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.200048ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:41.084158   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (3.842758ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:41.085722   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.101371ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1414813

@kubernetes/rh-cluster-infra

Fix #47916

k8s-reviewable · 2017-02-03T17:58:33Z

This change is

eparis · 2017-02-03T19:21:02Z

was this a change since 1.4?

eparis · 2017-02-03T19:32:32Z

@smarterclayton looks like you originally picked v(2) way back in 4e56daf

#1359

Who would need to approve moving to V(3)

smarterclayton · 2017-02-03T19:36:10Z

@lavalamp because we talked about this being an operational tool for normal function, and I suspect people rely on it in GKE and others at v2 today to get logging. Changing to v3 would break anyone who scrapes at v2.

That said, I agree this is very verbose.

eparis · 2017-02-03T19:40:42Z

on a cluster of any size doing work it seems to be swampingly verbose :) whereas the other v(2) stuff isn't so bad...

smarterclayton · 2017-02-03T20:00:12Z

My gut would be that it should be easy to opt out of this and that most people most of the time don't need this (which is what 3 was for). But because it has operational impact...

gmarek · 2017-02-06T15:15:13Z

cc @wojtek-t

I'm leaning against this PR. Those are probably most useful logs in the whole system - I look at them pretty much every time I'm debugging anything. I agree that they are verbose, but they're not for direct human consumption, not enhanced by the power for grep.

@aveshagarwal - why do you want to make this change? Is the size of logs problem?

aveshagarwal · 2017-02-06T15:40:34Z

@gmarek I agree these are useful logs, but the question is do we want these logs at --v=2?

gmarek · 2017-02-06T15:50:59Z

I think that we do. Logs are there to help figure out what's wrong, and those logs are extremely important for that. If someone want no-logging (aka 'I don't care') there's always V=0.

smarterclayton · 2017-02-06T21:05:35Z

v(2) is defined as "every user wants them". If we can make a case that every kube admin wants these logs by default, v=2 is appropriate.

…

On Mon, Feb 6, 2017 at 10:51 AM, Marek Grabowski ***@***.***> wrote: I think that we do. Logs are there to help figure out what's wrong, and those logs are extremely important for that. If someone want no-logging (aka 'I don't care') there's always V=0. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40933 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p0q3YhuPvpKn9Q-tn0NYzehitjMjks5rZ0FugaJpZM4L2nwV> .

gmarek · 2017-02-07T08:49:14Z

I do think that 'every (sane) user wants them', but that's not really a formal definition:)

gmarek · 2017-02-08T13:43:58Z

In other words - I think we should close this PR.

mffiedler · 2017-02-08T18:37:11Z

Level 2 usually provides enough info for devs to start debug process. Having that level sane really helps when difficult to repro problems are encountered.

smarterclayton · 2017-02-08T19:34:00Z

We've caught a fair number of panics, crashes, bad behaviors (eaten errors), and otherwise in test suites by simple observation of the log. Agree it's not quite the level of volume I would expect at V(2), but it's still possible to tune it down with --vmodule.

…

On Wed, Feb 8, 2017 at 1:37 PM, Mike Fiedler ***@***.***> wrote: Level 2 usually provides enough info for devs to start debug process. Having that level sane really helps when difficult to repro problems are encountered. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40933 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pyfbbzR2zyaCaRc1OrXriOgOBgdtks5ragtigaJpZM4L2nwV> .

gmarek · 2017-02-23T11:16:26Z

I'm closing this PR. Ping me if you disagree.

ncdc · 2017-04-12T13:28:33Z

@gmarek we have been discussing this again today. I agree with you that these logs are useful when debugging issues. However, they have a negative performance impact when running large clusters. Would you think it's reasonable to proceed with this, with the expectation that for production clusters, admins would increase the loglevel from 2 to 3 or 4 if they need to debug issues? Otherwise, 2 would be an expected default for informational log messages but not spammy ones. Also, for PR jobs, we should be running at 3 or 4 if we're not already. WDYT?

gmarek · 2017-04-12T13:40:03Z

No, level 3 is already too verbose. I understand the performance implications, but maybe we could get past them by dealing with --logtostderr flag, which is most likely main culprit (unbuffered writes)? I don't know what/if golang team is doing with glog package, so I don't know if it's possible...

Also, this is just my (admittedly very strong) opinion, so if enough important people agree that we should do that, then I won't be able to block this (though honestly - I'm pretty convinced that it'll cost you more debugging time, that you'll save on CPU).

Can't you run cluster with --v=1?

@wojtek-t @smarterclayton @thockin @lavalamp

ncdc · 2017-04-12T13:43:47Z

@gmarek if we run the cluster at --v=1 and run into an issue, we almost always have to bump up to --v=2 (or higher) to truly debug. I suppose another option is to move the non-spammy messages we care about that are currently at 2 down to 1. But from what I've heard from @smarterclayton, we currently want to run at 2 by default. We are trying to eliminate overly spammy messages from 2.

My suggestion is that we move this to 3, run clusters at 2, and if there is an issue that needs debugging, you restart the components at 3 and try to reproduce the issue, if that's acceptable.

gmarek · 2017-04-12T13:46:46Z

The thing is that you never know when issue will surface, and when to do this.

eparis · 2017-04-12T13:48:09Z

For the most part I feel like v(2) has informational changes about a cluster that an admin cares about. v(3) has engineering debug stuff. This is CLEARLY engineering debug stuff. Practically no admin wants to be spammed with these. Real life example on a small cluster (7 nodes with 190 pods) we see over 82,000 messages per hour from this. And less than 4000 messages per hour for not this.

This one spam accounts for x20 more messages than the rest of V(2) put together.

ncdc · 2017-04-12T13:48:47Z

@gmarek it's a balance, and it's one that's hard to win :-) You either spam the cluster with logs that you don't need 99% of the time, or you don't and it's harder to debug the past.

brendandburns · 2018-01-03T00:48:15Z

@gmarek @eparis do we have a conclusion here?

cblecker · 2018-02-10T18:01:41Z

This appears to be somewhat active. Removing the lifecycle/frozen state.

/remove-lifecycle frozen

liggitt · 2018-03-26T17:36:38Z

But IIUC it'll contain exactly the same information you're complaining about, so the problem will not go away, it'll just change the name.

audit lets you select some resources and not others, and allows for async/batch processing to avoid bottlenecks

I'm in favor of this change

soltysh · 2018-04-06T10:12:20Z

/refresh

cblecker · 2018-04-06T16:20:34Z

/test all

sjenning · 2018-04-11T20:28:10Z

/lgtm

k8s-ci-robot · 2018-04-11T20:28:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aveshagarwal, brendandburns, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/OWNERS~~ [brendandburns]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2018-04-11T23:26:33Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

fejta-bot · 2018-04-12T02:13:56Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

k8s-github-robot · 2018-04-12T03:34:43Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-ci-robot · 2018-04-12T04:13:23Z

@aveshagarwal: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-kubemark-e2e-gce	`6a5c248`	link	`/test pull-kubernetes-kubemark-e2e-gce`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2018-04-12T04:35:00Z

Automatic merge from submit-queue (batch tested with PRs 40933, 62188). If you want to cherry-pick this change to another branch, please follow the instructions here.

lavalamp · 2018-04-12T04:40:26Z

....I was writing a comment, but the bot merged this, apparently on a phantom approval.

I think changing this deserved publication on sigs for people who run clusters. [edit: assuming this didn't happen already]

wojtek-t · 2018-04-12T06:34:51Z

@shyamjvs @kubernetes/sig-scalability-bugs

I don't think we should have merged that - this will significantly affect our debugging ability (at least in the context of performance tests, where we can't really run tests on higher verbosity, since the amount of logs there is overloading apiserver).

So I think we may consider reverting it, and making verbosity for those logs a flag (I'm fine with having a default still at 3, but I would like to be able to modify that for the purpose of scalability tests).

wojtek-t · 2018-04-12T13:22:00Z

I've opened #62468 to not forget about it.

shyamjvs · 2018-04-12T14:22:11Z

Sorry, I missed this earlier. Definitely agree with @wojtek-t on how this can greatly hurt debuggability, specially wrt performance/scalability issues.
I'd vote for 'revert + introduce flag' too.

ncdc · 2018-04-12T14:25:35Z

@shyamjvs please see the -vmodule comments in #62468. We probably don't need to revert this if it's acceptable to adjust the scalability logging flags.

eparis · 2018-04-12T14:40:57Z

I think Andy is right and this PR is right, -vmodule can make these logs available but cannot be used to hide them. vmodule only goes in 1 direction.

shyamjvs · 2018-04-13T11:56:36Z

I've sent a PR for enabling the logs through -vmodule for tests.
However, there's still a discussion whether it should be a prod default under #62468 (comment).

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 3, 2017

k8s-github-robot assigned gmarek Feb 3, 2017

k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed labels Feb 3, 2017

eparis added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Feb 3, 2017

gmarek closed this Feb 23, 2017

ncdc mentioned this pull request Apr 12, 2017

REST spam openshift/origin#13724

Closed

eparis reopened this Apr 12, 2017

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed keep-open labels Dec 15, 2017

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 10, 2018

k8s-ci-robot assigned sjenning Apr 11, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2018

kubernetes deleted a comment from k8s-github-robot Apr 11, 2018

k8s-github-robot merged commit 2b854df into kubernetes:master Apr 12, 2018

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/bug Categorizes issue or PR as related to a bug. labels Apr 12, 2018

wojtek-t mentioned this pull request Apr 12, 2018

Hard to debug scalability tests #62468

Closed

aveshagarwal deleted the master-rhbz-1414813 branch August 3, 2018 14:15

Fix to avoid REST API calls at log level 2. #40933

Fix to avoid REST API calls at log level 2. #40933

Conversation

aveshagarwal commented Feb 3, 2017 • edited Loading

k8s-reviewable commented Feb 3, 2017

eparis commented Feb 3, 2017

eparis commented Feb 3, 2017

smarterclayton commented Feb 3, 2017

eparis commented Feb 3, 2017

smarterclayton commented Feb 3, 2017

gmarek commented Feb 6, 2017

aveshagarwal commented Feb 6, 2017

gmarek commented Feb 6, 2017

smarterclayton commented Feb 6, 2017 via email

gmarek commented Feb 7, 2017

gmarek commented Feb 8, 2017

mffiedler commented Feb 8, 2017

smarterclayton commented Feb 8, 2017 via email

gmarek commented Feb 23, 2017

ncdc commented Apr 12, 2017

gmarek commented Apr 12, 2017

ncdc commented Apr 12, 2017

gmarek commented Apr 12, 2017

eparis commented Apr 12, 2017

ncdc commented Apr 12, 2017

brendandburns commented Jan 3, 2018

cblecker commented Feb 10, 2018

liggitt commented Mar 26, 2018

soltysh commented Apr 6, 2018

cblecker commented Apr 6, 2018

sjenning commented Apr 11, 2018

k8s-ci-robot commented Apr 11, 2018

fejta-bot commented Apr 11, 2018

fejta-bot commented Apr 12, 2018

k8s-github-robot commented Apr 12, 2018

k8s-ci-robot commented Apr 12, 2018 • edited Loading

k8s-github-robot commented Apr 12, 2018

lavalamp commented Apr 12, 2018 • edited Loading

wojtek-t commented Apr 12, 2018

wojtek-t commented Apr 12, 2018

shyamjvs commented Apr 12, 2018

ncdc commented Apr 12, 2018

eparis commented Apr 12, 2018

shyamjvs commented Apr 13, 2018 • edited Loading

aveshagarwal commented Feb 3, 2017 •

edited

Loading

k8s-ci-robot commented Apr 12, 2018 •

edited

Loading

lavalamp commented Apr 12, 2018 •

edited

Loading

shyamjvs commented Apr 13, 2018 •

edited

Loading