Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow #5295

nak3 · 2019-08-27T02:11:03Z

In what area(s)?

/area autoscale

What version of Knative?

Build on top of c49080b

Steps to Reproduce the Problem

This is just an example step.

1. Deploy ksvc on your namespace with istio inject `false`.

kn service create hello-example --image=gcr.io/knative-samples/helloworld-go'

$ kubectl edit ksvc hello-example
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"

2. Block access to *.local:9090 port from autoscaler by forcing mTLS.

$ export YOUR_NAME_SPACE=default

$ cat <<EOF | kubectl apply -f -
apiVersion: "networking.istio.io/v1alpha3"
kind: "DestinationRule"
metadata:
  name: "mtls-services"
  namespace: "$YOUR_NAME_SPACE"
spec:
  host: "*.local"
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
EOF

This makes autoscaler fail to access ksvc's 9090 port, as hello-example does not have istio sidecar so cannot handle the mTLS.

Actual Behavior

In autoscaler pod, the error logs are produced every 1 sec.

$ kubectl  -n knative-serving logs autoscaler-5654d98b98-l44j8 -c autoscaler 
...
{"level":"error","ts":"2019-08-27T01:27:11.595Z","logger":"autoscaler","caller":"autoscaler/collector.go:266","msg":"Failed to scrape metrics","commit":"0b9eb27","error":"unsuccessful scrape, sampleSize=1: Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics: read tcp 172.20.89.26:41492->172.20.17.213:9090: read: connection reset by peer","errorVerbose":"Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics: read tcp 172.20.89.26:41492->172.20.17.213:9090: read: connection reset by peer\nunsuccessful scrape, sampleSize=1\nknative.dev/serving/pkg/autoscaler.(*ServiceScraper).Scrape\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/stats_scraper.go:165\nknative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/collector.go:263\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1337","stacktrace":"knative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/collector.go:266"}
{"level":"error","ts":"2019-08-27T01:27:12.588Z","logger":"autoscaler","caller":"autoscaler/collector.go:266","msg":"Failed to scrape metrics","commit":"0b9eb27","error":"unsuccessful scrape, sampleSize=1: Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics: read tcp 172.20.89.26:41536->172.20.17.213:9090: read: connection reset by peer","errorVerbose":"Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics: read tcp 172.20.89.26:41536->172.20.17.213:9090: read: connection reset by peer\nunsuccessful scrape, sampleSize=1\nknative.dev/serving/pkg/autoscaler.(*ServiceScraper).Scrape\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/stats_scraper.go:165\nknative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/collector.go:263\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1337","stacktrace":"knative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/knakayam/.go/src/knative.dev/serving/pkg/autoscaler/collector.go:266"}

Note, this error is produced every 1 second * number of ksvc revisions.

Root cause

Following loop block is running every 1 sec and produces the error everytime when failed without stopping.

serving/pkg/autoscaler/collector.go

Lines 256 to 270 in 58e814a

    
           for { 
        
           	select { 
        
           	case <-c.stopCh: 
        
           		scrapeTicker.Stop() 
        
           		return 
        
           	case <-scrapeTicker.C: 
        
           		message, err := c.getScraper().Scrape() 
        
           		if err != nil { 
        
           			logger.Errorw("Failed to scrape metrics", zap.Error(err)) 
        
           		} 
        
           		if message != nil { 
        
           			c.record(message.Stat) 
        
           		} 
        
           	} 
        
           }

The text was updated successfully, but these errors were encountered:

vagababov · 2019-08-27T03:54:45Z

So do you suggest remove the logging statement? Or some real solution? Since if we can't read metrics we can't scale (i.e. system doesn't work, so logging is least of our concerns)/

nak3 · 2019-08-27T07:05:49Z

Yes, I agree that logging is least concerns.

I was thinking about zap's SamplingConfig. Currently the fixed default values:

serving/vendor/go.uber.org/zap/config.go

Lines 118 to 121 in 955261a

    
           Sampling: &SamplingConfig{ 
        
           	Initial:    100, 
        
           	Thereafter: 100, 
        
           },

are used, but I wonder that we want to make it configurable.

nak3 · 2019-08-27T07:35:55Z

Umm... I misunderstood about Thereafter: 100. So, with the default setting, autoscaler produces the 100 log lines per sec in the worst scenario. It would be alright with the unconfigurable samplingconfig(?).

nak3 · 2019-09-03T07:10:07Z

I was concerened about CPU/Memory resources when the log was overflowed. So, I quickly collected the profile data with 100 services. (sorry it may be too small, but my personal cluster is limited 😰)

Result

Memory: 22MiB to 33MiB
CPU: 0.02 to 0.08

As per above results, It looks like not so critical so I am closing this issue.

Thank you @vagababov and sorry for bothering you.

nak3 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 27, 2019

knative-prow-robot added the area/autoscale label Aug 27, 2019

nak3 mentioned this issue Aug 27, 2019

RFC: Add status field to metrics object #5304

Closed

nak3 closed this as completed Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow #5295

Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow #5295

nak3 commented Aug 27, 2019 •

edited

Loading

vagababov commented Aug 27, 2019

nak3 commented Aug 27, 2019 •

edited

Loading

nak3 commented Aug 27, 2019

nak3 commented Sep 3, 2019

Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow #5295

Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow #5295

Comments

nak3 commented Aug 27, 2019 • edited Loading

In what area(s)?

What version of Knative?

Steps to Reproduce the Problem

1. Deploy ksvc on your namespace with istio inject false.

2. Block access to *.local:9090 port from autoscaler by forcing mTLS.

Actual Behavior

Root cause

vagababov commented Aug 27, 2019

nak3 commented Aug 27, 2019 • edited Loading

nak3 commented Aug 27, 2019

nak3 commented Sep 3, 2019

Result

nak3 commented Aug 27, 2019 •

edited

Loading

1. Deploy ksvc on your namespace with istio inject `false`.

nak3 commented Aug 27, 2019 •

edited

Loading