RFC: Add status field to metrics object #5304

nak3 · 2019-08-27T16:46:13Z

In what area(s)?

/area autoscale
/kind spec
/kind proposal

Current issue

When autoscaler fails to read metrics from user's service, the users fail to access the service.
The issue I wanted to report here is that there are neither error logs nor clue to know what happened in user's namespace. (Only autoscaler's container in knative-serving ns has the log.)
The issue could be caused by not only autosclaer's issue but also user's mistake (as I written the example steps in Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow #5295.)

Example steps & Actual result

I have written up the steps in Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow #5295 how to block :9090 port as an example.
Below are the output in user namespace when the issue happened.

The ksvc shows READY status. (revision, routes and configs are all READY.)

$ kubectl get ksvc -n serving-tests
NAME                                   URL                                                                     LATESTCREATED                                LATESTREADY           READY   REASON
hello-example                          http://hello-example.serving-tests.example.com                          hello-example-bcwzf                          hello-example-bcwzf   True

There are no event logs. The user-container/queue-proxy logs also does not have any error.
But we fail to access to the service with following error.

$ curl -H "Host: hello-example.serving-tests.example.com" http://${HOST_NAME}
upstream connect error or disconnect/reset before headers. reset reason: connection failure

So, it is difficult to debug and find out what happened from user's namespace.

Proposal change

Current metrics object does not have status field as:

(current output)

$ kubectl get metrics.autoscaling.internal.knative.dev -n serving-tests
NAME                                         READY   REASON
hello-example-bcwzf

I would like metrics object to have a status field and shows the error message about failure of metrics collection.

(expected output)

$ kubectl get metrics.autoscaling.internal.knative.dev -n serving-tests
NAME                                         READY   REASON
hello-example-bcwzf                          False

$ kubectl get metrics.autoscaling.internal.knative.dev -n serving-tests -o yaml hello-example-bcwzf
status:
  conditions:
  - lastTransitionTime: "2019-08-27T13:12:43Z"
    message: 'unsuccessful scrape, sampleSize=1: Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics:
      read tcp 172.20.43.170:35584->172.20.17.213:9090: read: connection reset by
      peer'
    reason: test
    status: "False"
    type: Ready

Event log could be another solution, but it would produce too many logs as same reason with #5295.
I am thinking that these status field is useful for above scenario. But is there are any different approach or suggestion?

The text was updated successfully, but these errors were encountered:

jonjohnsonjr · 2019-08-27T17:13:35Z

It would be great for this to propagate up through our conditions, as per #5076.

taragu · 2019-09-03T17:04:05Z

@nak3 @jonjohnsonjr can I work on this issue?

nak3 · 2019-09-04T09:44:19Z

Sure, no problem & thank you @taragu

Just in case, I think the status update within this gorountine

serving/pkg/autoscaler/collector.go

Line 252 in 58e814a

go func() {

may need to be handled well, as it is running every 1 sec for every single revisions in the cluster + out side of reconciler. The behavior is different from other objects.

taragu · 2019-09-04T13:00:03Z

I see. Thanks for the pointers @nak3 !
/assign

vagababov · 2019-09-07T19:12:43Z

/reopen
this is mostly not done.

knative-prow-robot · 2019-09-07T19:12:45Z

@vagababov: Reopened this issue.

In response to this:

/reopen
this is mostly not done.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nak3

We should not pass logger to the scrape and other fixes. /assign mattmoor @nak3 Part of knative#5304

@nak3

We should not pass logger to the scrape and other fixes. /assign mattmoor @nak3 Part of #5304

knative-housekeeping-robot · 2020-01-23T12:00:42Z

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

markusthoemmes · 2020-02-06T09:24:14Z

/kind good-first-issue
/good-first-issue

This is not super straightforward but I'm happy to mentor somebody who is willing to pick this up. We effectively need to surface changes to the status, which are done here:

serving/pkg/autoscaler/collector.go

Lines 286 to 299 in f65d5a7

    
           stat, err := c.getScraper().Scrape(c.currentMetric().Spec.StableWindow) 
        
           if err != nil { 
        
           	copy := metric.DeepCopy() 
        
           	switch { 
        
           	case err == ErrFailedGetEndpoints: 
        
           		copy.Status.MarkMetricNotReady("NoEndpoints", ErrFailedGetEndpoints.Error()) 
        
           	case err == ErrDidNotReceiveStat: 
        
           		copy.Status.MarkMetricFailed("DidNotReceiveStat", ErrDidNotReceiveStat.Error()) 
        
           	default: 
        
           		copy.Status.MarkMetricNotReady("CreateOrUpdateFailed", "Collector has failed.") 
        
           	} 
        
           	logger.Errorw("Failed to scrape metrics", zap.Error(err)) 
        
           	c.updateMetric(copy) 
        
           }

to the actual reconciler, so it'll put these Status onto the object stored in the APIServer itself.

knative-prow-robot · 2020-02-06T09:24:16Z

@markusthoemmes:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/kind good-first-issue
/good-first-issue

This is not super straightforward but I'm happy to mentor somebody who is willing to pick this up. We effectively need to surface changes to the status, which are done here:

serving/pkg/autoscaler/collector.go

Lines 286 to 299 in f65d5a7

stat, err := c.getScraper().Scrape(c.currentMetric().Spec.StableWindow)

if err != nil {

copy := metric.DeepCopy()

switch {

case err == ErrFailedGetEndpoints:

copy.Status.MarkMetricNotReady("NoEndpoints", ErrFailedGetEndpoints.Error())

case err == ErrDidNotReceiveStat:

copy.Status.MarkMetricFailed("DidNotReceiveStat", ErrDidNotReceiveStat.Error())

default:

copy.Status.MarkMetricNotReady("CreateOrUpdateFailed", "Collector has failed.")

}

logger.Errorw("Failed to scrape metrics", zap.Error(err))

c.updateMetric(copy)

}

to the actual reconciler, so it'll put these Status onto the object stored in the APIServer itself.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

markusthoemmes · 2020-02-06T09:24:57Z

/unassign @taragu

AFAIK you've been working on the first PR for this but you're no longer looking into this?

taragu · 2020-02-06T14:12:26Z

@markusthoemmes yep I'm no longer working on this

knative-housekeeping-robot · 2020-03-08T00:00:39Z

Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

markusthoemmes · 2020-03-09T07:15:49Z

/remove-lifecycle rotten

I wanted to pick this up for 0.14

markusthoemmes · 2020-03-09T07:17:11Z

/assign

markusthoemmes · 2020-05-05T08:43:29Z

This is done as of d1c1e34 and was part of the 0.14 release.

nak3 added the kind/feature Well-understood/specified features, ready for coding. label Aug 27, 2019

knative-prow-robot added area/autoscale kind/spec Discussion of how a feature should be exposed to customers. labels Aug 27, 2019

knative-prow-robot assigned taragu Sep 4, 2019

taragu mentioned this issue Sep 4, 2019

Add and update metric status #5395

Merged

knative-prow-robot closed this as completed in #5395 Sep 6, 2019

knative-prow-robot reopened this Sep 7, 2019

nak3 mentioned this issue Sep 7, 2019

Metrics still does not have status field #5425

Closed

vagababov added a commit to vagababov/serving that referenced this issue Sep 7, 2019

Improve the logger situation in metrics collection

2beeaf2

We should not pass logger to the scrape and other fixes. /assign mattmoor @nak3 Part of knative#5304

vagababov mentioned this issue Sep 7, 2019

Improve the logger situation in metrics collection #5428

Merged

knative-prow-robot pushed a commit that referenced this issue Sep 8, 2019

Improve the logger situation in metrics collection (#5428)

0ef4a10

We should not pass logger to the scrape and other fixes. /assign mattmoor @nak3 Part of #5304

taragu mentioned this issue Sep 16, 2019

Update metric status when processing fails #5556

Closed

eallred-google added this to the Needs Triage milestone Oct 23, 2019

knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2020

knative-prow-robot added kind/good-first-issue good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 6, 2020

knative-prow-robot unassigned taragu Feb 6, 2020

knative-prow-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 8, 2020

knative-prow-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 9, 2020

markusthoemmes modified the milestones: Needs Triage, Serving 0.14.x Mar 9, 2020

knative-prow-robot assigned markusthoemmes Mar 9, 2020

dprotaso modified the milestones: Serving 0.14.x, Serving 0.15.x Apr 14, 2020

markusthoemmes modified the milestones: Serving 0.15.x, Serving 0.14.x May 5, 2020

markusthoemmes closed this as completed May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Add status field to metrics object #5304

RFC: Add status field to metrics object #5304

nak3 commented Aug 27, 2019

jonjohnsonjr commented Aug 27, 2019

taragu commented Sep 3, 2019

nak3 commented Sep 4, 2019

taragu commented Sep 4, 2019

vagababov commented Sep 7, 2019

knative-prow-robot commented Sep 7, 2019

knative-housekeeping-robot commented Jan 23, 2020

markusthoemmes commented Feb 6, 2020

knative-prow-robot commented Feb 6, 2020

markusthoemmes commented Feb 6, 2020

taragu commented Feb 6, 2020

knative-housekeeping-robot commented Mar 8, 2020

markusthoemmes commented Mar 9, 2020

markusthoemmes commented Mar 9, 2020

markusthoemmes commented May 5, 2020

RFC: Add status field to metrics object #5304

RFC: Add status field to metrics object #5304

Comments

nak3 commented Aug 27, 2019

In what area(s)?

Current issue

Example steps & Actual result

Proposal change

jonjohnsonjr commented Aug 27, 2019

taragu commented Sep 3, 2019

nak3 commented Sep 4, 2019

taragu commented Sep 4, 2019

vagababov commented Sep 7, 2019

knative-prow-robot commented Sep 7, 2019

knative-housekeeping-robot commented Jan 23, 2020

markusthoemmes commented Feb 6, 2020

knative-prow-robot commented Feb 6, 2020

markusthoemmes commented Feb 6, 2020

taragu commented Feb 6, 2020

knative-housekeeping-robot commented Mar 8, 2020

markusthoemmes commented Mar 9, 2020

markusthoemmes commented Mar 9, 2020

markusthoemmes commented May 5, 2020