Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add status field to metrics object #5304

Closed
nak3 opened this issue Aug 27, 2019 · 15 comments · Fixed by #5395
Closed

RFC: Add status field to metrics object #5304

nak3 opened this issue Aug 27, 2019 · 15 comments · Fixed by #5395
Assignees
Labels
area/autoscale good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Well-understood/specified features, ready for coding. kind/spec Discussion of how a feature should be exposed to customers.
Milestone

Comments

@nak3
Copy link
Contributor

nak3 commented Aug 27, 2019

In what area(s)?

/area autoscale
/kind spec
/kind proposal

Current issue

  • When autoscaler fails to read metrics from user's service, the users fail to access the service.
  • The issue I wanted to report here is that there are neither error logs nor clue to know what happened in user's namespace. (Only autoscaler's container in knative-serving ns has the log.)
  • The issue could be caused by not only autosclaer's issue but also user's mistake (as I written the example steps in Blocking ksv'c metrics port by user namespace makes autoscaler logs overflow  #5295.)

Example steps & Actual result

The ksvc shows READY status. (revision, routes and configs are all READY.)

$ kubectl get ksvc -n serving-tests
NAME                                   URL                                                                     LATESTCREATED                                LATESTREADY           READY   REASON
hello-example                          http://hello-example.serving-tests.example.com                          hello-example-bcwzf                          hello-example-bcwzf   True

There are no event logs. The user-container/queue-proxy logs also does not have any error.
But we fail to access to the service with following error.

$ curl -H "Host: hello-example.serving-tests.example.com" http://${HOST_NAME}
upstream connect error or disconnect/reset before headers. reset reason: connection failure

So, it is difficult to debug and find out what happened from user's namespace.

Proposal change

  • Current metrics object does not have status field as:

(current output)

$ kubectl get metrics.autoscaling.internal.knative.dev -n serving-tests
NAME                                         READY   REASON
hello-example-bcwzf
  • I would like metrics object to have a status field and shows the error message about failure of metrics collection.

(expected output)

$ kubectl get metrics.autoscaling.internal.knative.dev -n serving-tests
NAME                                         READY   REASON
hello-example-bcwzf                          False

$ kubectl get metrics.autoscaling.internal.knative.dev -n serving-tests -o yaml hello-example-bcwzf
status:
  conditions:
  - lastTransitionTime: "2019-08-27T13:12:43Z"
    message: 'unsuccessful scrape, sampleSize=1: Get http://hello-example-bcwzf-rzzpm.serving-tests:9090/metrics:
      read tcp 172.20.43.170:35584->172.20.17.213:9090: read: connection reset by
      peer'
    reason: test
    status: "False"
    type: Ready

Event log could be another solution, but it would produce too many logs as same reason with #5295.
I am thinking that these status field is useful for above scenario. But is there are any different approach or suggestion?

@nak3 nak3 added the kind/feature Well-understood/specified features, ready for coding. label Aug 27, 2019
@knative-prow-robot knative-prow-robot added area/autoscale kind/spec Discussion of how a feature should be exposed to customers. labels Aug 27, 2019
@jonjohnsonjr
Copy link
Contributor

It would be great for this to propagate up through our conditions, as per #5076.

@taragu
Copy link
Contributor

taragu commented Sep 3, 2019

@nak3 @jonjohnsonjr can I work on this issue?

@nak3
Copy link
Contributor Author

nak3 commented Sep 4, 2019

Sure, no problem & thank you @taragu

Just in case, I think the status update within this gorountine

may need to be handled well, as it is running every 1 sec for every single revisions in the cluster + out side of reconciler. The behavior is different from other objects.

@taragu
Copy link
Contributor

taragu commented Sep 4, 2019

I see. Thanks for the pointers @nak3 !
/assign

@vagababov
Copy link
Contributor

/reopen
this is mostly not done.

@knative-prow-robot
Copy link
Contributor

@vagababov: Reopened this issue.

In response to this:

/reopen
this is mostly not done.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vagababov added a commit to vagababov/serving that referenced this issue Sep 7, 2019
We should not pass logger to the scrape and other fixes.

/assign mattmoor @nak3
Part of knative#5304
knative-prow-robot pushed a commit that referenced this issue Sep 8, 2019
We should not pass logger to the scrape and other fixes.

/assign mattmoor @nak3
Part of #5304
@eallred-google eallred-google added this to the Needs Triage milestone Oct 23, 2019
@knative-housekeeping-robot

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

@knative-prow-robot knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2020
@markusthoemmes
Copy link
Contributor

/kind good-first-issue
/good-first-issue

This is not super straightforward but I'm happy to mentor somebody who is willing to pick this up. We effectively need to surface changes to the status, which are done here:

stat, err := c.getScraper().Scrape(c.currentMetric().Spec.StableWindow)
if err != nil {
copy := metric.DeepCopy()
switch {
case err == ErrFailedGetEndpoints:
copy.Status.MarkMetricNotReady("NoEndpoints", ErrFailedGetEndpoints.Error())
case err == ErrDidNotReceiveStat:
copy.Status.MarkMetricFailed("DidNotReceiveStat", ErrDidNotReceiveStat.Error())
default:
copy.Status.MarkMetricNotReady("CreateOrUpdateFailed", "Collector has failed.")
}
logger.Errorw("Failed to scrape metrics", zap.Error(err))
c.updateMetric(copy)
}
to the actual reconciler, so it'll put these Status onto the object stored in the APIServer itself.

@knative-prow-robot
Copy link
Contributor

@markusthoemmes:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/kind good-first-issue
/good-first-issue

This is not super straightforward but I'm happy to mentor somebody who is willing to pick this up. We effectively need to surface changes to the status, which are done here:

stat, err := c.getScraper().Scrape(c.currentMetric().Spec.StableWindow)
if err != nil {
copy := metric.DeepCopy()
switch {
case err == ErrFailedGetEndpoints:
copy.Status.MarkMetricNotReady("NoEndpoints", ErrFailedGetEndpoints.Error())
case err == ErrDidNotReceiveStat:
copy.Status.MarkMetricFailed("DidNotReceiveStat", ErrDidNotReceiveStat.Error())
default:
copy.Status.MarkMetricNotReady("CreateOrUpdateFailed", "Collector has failed.")
}
logger.Errorw("Failed to scrape metrics", zap.Error(err))
c.updateMetric(copy)
}
to the actual reconciler, so it'll put these Status onto the object stored in the APIServer itself.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added kind/good-first-issue good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 6, 2020
@markusthoemmes
Copy link
Contributor

/unassign @taragu

AFAIK you've been working on the first PR for this but you're no longer looking into this?

@taragu
Copy link
Contributor

taragu commented Feb 6, 2020

@markusthoemmes yep I'm no longer working on this

@knative-housekeeping-robot

Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

@knative-prow-robot knative-prow-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 8, 2020
@markusthoemmes
Copy link
Contributor

/remove-lifecycle rotten

I wanted to pick this up for 0.14

@knative-prow-robot knative-prow-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 9, 2020
@markusthoemmes
Copy link
Contributor

/assign

@markusthoemmes
Copy link
Contributor

This is done as of d1c1e34 and was part of the 0.14 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Well-understood/specified features, ready for coding. kind/spec Discussion of how a feature should be exposed to customers.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants