Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to avoid REST API calls at log level 2. #40933

Merged

Conversation

aveshagarwal
Copy link
Member

@aveshagarwal aveshagarwal commented Feb 3, 2017

By default, logging REST API calls at log level 2 is too verbose (see below), if log level 2 happens to be default. So increasing its default to 3.

I0203 12:37:27.059555   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.358954ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:27.780029   11483 wrap.go:75] GET /api/v1/nodes/127.0.0.1?resourceVersion=0: (819.444µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:27.844846   11483 wrap.go:75] POST /api/v1/namespaces/default/events: (11.337447ms) 201 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:27.844851   11483 wrap.go:75] PATCH /api/v1/nodes/127.0.0.1/status: (9.998981ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:28.942863   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (923.711µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:28.944556   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.188942ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:29.061483   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (937.549µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:29.063068   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.111312ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:30.947922   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (935.198µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:30.950150   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.703438ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:31.064883   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.127992ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:31.066503   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.098029ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:32.951691   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (945.295µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:32.953580   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.331822ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:33.068221   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (912.121µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:33.069787   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.120666ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:34.955546   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.02279ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:34.957812   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.661017ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:35.071528   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (935.155µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:35.073087   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.027371ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:35.580075   11483 wrap.go:75] GET /apis/extensions/v1beta1/thirdpartyresources: (943.453µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.611659   11483 wrap.go:75] GET /api/v1/namespaces/default: (753.781µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.612516   11483 wrap.go:75] GET /api/v1/namespaces/default/services/kubernetes: (495.105µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.613167   11483 wrap.go:75] GET /api/v1/namespaces/default/endpoints/kubernetes: (379.568µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.960131   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.431137ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:36.963470   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (2.190438ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:37.028185   11483 wrap.go:75] GET /api/v1/nodes: (1.34149ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/pod-garbage-collector] [::1]:35156]
I0203 12:37:37.074666   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (928.261µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:37.076314   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.240852ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:37.847163   11483 wrap.go:75] GET /api/v1/nodes/127.0.0.1?resourceVersion=0: (725.021µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:37.901326   11483 wrap.go:75] PATCH /api/v1/nodes/127.0.0.1/status: (2.377445ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:38.968028   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (3.777083ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:38.970313   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.655815ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:39.077913   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (919.65µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:39.079617   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.196855ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:40.972171   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (936.07µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:40.973886   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.200048ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:41.084158   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (3.842758ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:41.085722   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.101371ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1414813

@kubernetes/rh-cluster-infra

Fix #47916

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 3, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed labels Feb 3, 2017
@eparis eparis added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Feb 3, 2017
@eparis
Copy link
Contributor

eparis commented Feb 3, 2017

was this a change since 1.4?

@eparis
Copy link
Contributor

eparis commented Feb 3, 2017

@smarterclayton looks like you originally picked v(2) way back in 4e56daf

#1359

Who would need to approve moving to V(3)

@smarterclayton
Copy link
Contributor

@lavalamp because we talked about this being an operational tool for normal function, and I suspect people rely on it in GKE and others at v2 today to get logging. Changing to v3 would break anyone who scrapes at v2.

That said, I agree this is very verbose.

@eparis
Copy link
Contributor

eparis commented Feb 3, 2017

on a cluster of any size doing work it seems to be swampingly verbose :) whereas the other v(2) stuff isn't so bad...

@smarterclayton
Copy link
Contributor

My gut would be that it should be easy to opt out of this and that most people most of the time don't need this (which is what 3 was for). But because it has operational impact...

@gmarek
Copy link
Contributor

gmarek commented Feb 6, 2017

cc @wojtek-t

I'm leaning against this PR. Those are probably most useful logs in the whole system - I look at them pretty much every time I'm debugging anything. I agree that they are verbose, but they're not for direct human consumption, not enhanced by the power for grep.

@aveshagarwal - why do you want to make this change? Is the size of logs problem?

@aveshagarwal
Copy link
Member Author

@gmarek I agree these are useful logs, but the question is do we want these logs at --v=2?

@gmarek
Copy link
Contributor

gmarek commented Feb 6, 2017

I think that we do. Logs are there to help figure out what's wrong, and those logs are extremely important for that. If someone want no-logging (aka 'I don't care') there's always V=0.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 6, 2017 via email

@gmarek
Copy link
Contributor

gmarek commented Feb 7, 2017

I do think that 'every (sane) user wants them', but that's not really a formal definition:)

@gmarek
Copy link
Contributor

gmarek commented Feb 8, 2017

In other words - I think we should close this PR.

@mffiedler
Copy link

Level 2 usually provides enough info for devs to start debug process. Having that level sane really helps when difficult to repro problems are encountered.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 8, 2017 via email

@gmarek
Copy link
Contributor

gmarek commented Feb 23, 2017

I'm closing this PR. Ping me if you disagree.

@gmarek gmarek closed this Feb 23, 2017
@ncdc
Copy link
Member

ncdc commented Apr 12, 2017

@gmarek we have been discussing this again today. I agree with you that these logs are useful when debugging issues. However, they have a negative performance impact when running large clusters. Would you think it's reasonable to proceed with this, with the expectation that for production clusters, admins would increase the loglevel from 2 to 3 or 4 if they need to debug issues? Otherwise, 2 would be an expected default for informational log messages but not spammy ones. Also, for PR jobs, we should be running at 3 or 4 if we're not already. WDYT?

@ncdc ncdc mentioned this pull request Apr 12, 2017
@gmarek
Copy link
Contributor

gmarek commented Apr 12, 2017

No, level 3 is already too verbose. I understand the performance implications, but maybe we could get past them by dealing with --logtostderr flag, which is most likely main culprit (unbuffered writes)? I don't know what/if golang team is doing with glog package, so I don't know if it's possible...

Also, this is just my (admittedly very strong) opinion, so if enough important people agree that we should do that, then I won't be able to block this (though honestly - I'm pretty convinced that it'll cost you more debugging time, that you'll save on CPU).

Can't you run cluster with --v=1?

@wojtek-t @smarterclayton @thockin @lavalamp

@ncdc
Copy link
Member

ncdc commented Apr 12, 2017

@gmarek if we run the cluster at --v=1 and run into an issue, we almost always have to bump up to --v=2 (or higher) to truly debug. I suppose another option is to move the non-spammy messages we care about that are currently at 2 down to 1. But from what I've heard from @smarterclayton, we currently want to run at 2 by default. We are trying to eliminate overly spammy messages from 2.

My suggestion is that we move this to 3, run clusters at 2, and if there is an issue that needs debugging, you restart the components at 3 and try to reproduce the issue, if that's acceptable.

@gmarek
Copy link
Contributor

gmarek commented Apr 12, 2017

The thing is that you never know when issue will surface, and when to do this.

@eparis
Copy link
Contributor

eparis commented Apr 12, 2017

For the most part I feel like v(2) has informational changes about a cluster that an admin cares about. v(3) has engineering debug stuff. This is CLEARLY engineering debug stuff. Practically no admin wants to be spammed with these. Real life example on a small cluster (7 nodes with 190 pods) we see over 82,000 messages per hour from this. And less than 4000 messages per hour for not this.

This one spam accounts for x20 more messages than the rest of V(2) put together.

@eparis eparis reopened this Apr 12, 2017
@ncdc
Copy link
Member

ncdc commented Apr 12, 2017

@gmarek it's a balance, and it's one that's hard to win :-) You either spam the cluster with logs that you don't need 99% of the time, or you don't and it's harder to debug the past.

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed keep-open labels Dec 15, 2017
@brendandburns
Copy link
Contributor

@gmarek @eparis do we have a conclusion here?

@cblecker
Copy link
Member

This appears to be somewhat active. Removing the lifecycle/frozen state.

/remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 10, 2018
@liggitt
Copy link
Member

liggitt commented Mar 26, 2018

But IIUC it'll contain exactly the same information you're complaining about, so the problem will not go away, it'll just change the name.

audit lets you select some resources and not others, and allows for async/batch processing to avoid bottlenecks

I'm in favor of this change

@soltysh
Copy link
Contributor

soltysh commented Apr 6, 2018

/refresh

@cblecker
Copy link
Member

cblecker commented Apr 6, 2018

/test all

@sjenning
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aveshagarwal, brendandburns, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubernetes kubernetes deleted a comment from k8s-github-robot Apr 11, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 12, 2018

@aveshagarwal: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-kubemark-e2e-gce 6a5c248 link /test pull-kubernetes-kubemark-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 40933, 62188). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 2b854df into kubernetes:master Apr 12, 2018
@lavalamp
Copy link
Member

lavalamp commented Apr 12, 2018

....I was writing a comment, but the bot merged this, apparently on a phantom approval.

I think changing this deserved publication on sigs for people who run clusters. [edit: assuming this didn't happen already]

@wojtek-t
Copy link
Member

@shyamjvs @kubernetes/sig-scalability-bugs

I don't think we should have merged that - this will significantly affect our debugging ability (at least in the context of performance tests, where we can't really run tests on higher verbosity, since the amount of logs there is overloading apiserver).

So I think we may consider reverting it, and making verbosity for those logs a flag (I'm fine with having a default still at 3, but I would like to be able to modify that for the purpose of scalability tests).

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/bug Categorizes issue or PR as related to a bug. labels Apr 12, 2018
@wojtek-t
Copy link
Member

I've opened #62468 to not forget about it.

@shyamjvs
Copy link
Member

Sorry, I missed this earlier. Definitely agree with @wojtek-t on how this can greatly hurt debuggability, specially wrt performance/scalability issues.
I'd vote for 'revert + introduce flag' too.

@ncdc
Copy link
Member

ncdc commented Apr 12, 2018

@shyamjvs please see the -vmodule comments in #62468. We probably don't need to revert this if it's acceptable to adjust the scalability logging flags.

@eparis
Copy link
Contributor

eparis commented Apr 12, 2018

I think Andy is right and this PR is right, -vmodule can make these logs available but cannot be used to hide them. vmodule only goes in 1 direction.

@shyamjvs
Copy link
Member

shyamjvs commented Apr 13, 2018

I've sent a PR for enabling the logs through -vmodule for tests.
However, there's still a discussion whether it should be a prod default under #62468 (comment).

@aveshagarwal aveshagarwal deleted the master-rhbz-1414813 branch August 3, 2018 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet