Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow #5295

Closed
nak3 opened this issue Aug 27, 2019 · 4 comments
Closed
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug.

Comments

@nak3
Copy link
Contributor

nak3 commented Aug 27, 2019

In what area(s)?

/area autoscale

What version of Knative?

Build on top of c49080b

Steps to Reproduce the Problem

This is just an example step.

1. Deploy ksvc on your namespace with istio inject false.

kn service create hello-example --image=gcr.io/knative-samples/helloworld-go'
$ kubectl edit ksvc hello-example
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"

2. Block access to *.local:9090 port from autoscaler by forcing mTLS.

$ export YOUR_NAME_SPACE=default

$ cat <<EOF | kubectl apply -f -
apiVersion: "networking.istio.io/v1alpha3"
kind: "DestinationRule"
metadata:
  name: "mtls-services"
  namespace: "$YOUR_NAME_SPACE"
spec:
  host: "*.local"
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
EOF

This makes autoscaler fail to access ksvc's 9090 port, as hello-example does not have istio sidecar so cannot handle the mTLS.

Actual Behavior

  • In autoscaler pod, the error logs are produced every 1 sec.
$ kubectl  -n knative-serving logs autoscaler-5654d98b98-l44j8 -c autoscaler 
...
{"level":"error","ts":"2019-08-27T01:27:11.595Z","logger":"autoscaler","caller":"autoscaler/collector.go:266","msg":"Failed to scrape metrics","commit":"0b9eb27","error":"unsuccessful scrape, sampleSize=1: Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics: read tcp 172.20.89.26:41492->172.20.17.213:9090: read: connection reset by peer","errorVerbose":"Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics: read tcp 172.20.89.26:41492->172.20.17.213:9090: read: connection reset by peer\nunsuccessful scrape, sampleSize=1\nknative.dev/serving/pkg/autoscaler.(*ServiceScraper).Scrape\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/stats_scraper.go:165\nknative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/collector.go:263\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1337","stacktrace":"knative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/collector.go:266"}
{"level":"error","ts":"2019-08-27T01:27:12.588Z","logger":"autoscaler","caller":"autoscaler/collector.go:266","msg":"Failed to scrape metrics","commit":"0b9eb27","error":"unsuccessful scrape, sampleSize=1: Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics: read tcp 172.20.89.26:41536->172.20.17.213:9090: read: connection reset by peer","errorVerbose":"Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics: read tcp 172.20.89.26:41536->172.20.17.213:9090: read: connection reset by peer\nunsuccessful scrape, sampleSize=1\nknative.dev/serving/pkg/autoscaler.(*ServiceScraper).Scrape\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/stats_scraper.go:165\nknative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/collector.go:263\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1337","stacktrace":"knative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/collector.go:266"}
  • Note, this error is produced every 1 second * number of ksvc revisions.

Root cause

  • Following loop block is running every 1 sec and produces the error everytime when failed without stopping.

for {
select {
case <-c.stopCh:
scrapeTicker.Stop()
return
case <-scrapeTicker.C:
message, err := c.getScraper().Scrape()
if err != nil {
logger.Errorw("Failed to scrape metrics", zap.Error(err))
}
if message != nil {
c.record(message.Stat)
}
}
}

@nak3 nak3 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 27, 2019
@vagababov
Copy link
Contributor

So do you suggest remove the logging statement? Or some real solution? Since if we can't read metrics we can't scale (i.e. system doesn't work, so logging is least of our concerns)/

@nak3
Copy link
Contributor Author

nak3 commented Aug 27, 2019

Yes, I agree that logging is least concerns.

I was thinking about zap's SamplingConfig. Currently the fixed default values:

Sampling: &SamplingConfig{
Initial: 100,
Thereafter: 100,
},

are used, but I wonder that we want to make it configurable.

@nak3
Copy link
Contributor Author

nak3 commented Aug 27, 2019

Umm... I misunderstood about Thereafter: 100. So, with the default setting, autoscaler produces the 100 log lines per sec in the worst scenario. It would be alright with the unconfigurable samplingconfig(?).

@nak3
Copy link
Contributor Author

nak3 commented Sep 3, 2019

I was concerened about CPU/Memory resources when the log was overflowed. So, I quickly collected the profile data with 100 services. (sorry it may be too small, but my personal cluster is limited 😰)

Result

Memory: 22MiB to 33MiB
CPU: 0.02 to 0.08

autoscaler

As per above results, It looks like not so critical so I am closing this issue.

Thank you @vagababov and sorry for bothering you.

@nak3 nak3 closed this as completed Sep 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants