Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rack middlewares #32

Merged
merged 2 commits into from
Feb 25, 2017
Merged

Improve rack middlewares #32

merged 2 commits into from
Feb 25, 2017

Conversation

grobie
Copy link
Member

@grobie grobie commented Sep 9, 2016

  • Move from prometheus/client/rack/* to prometheus/middleware/*
  • Use histogram instead of summary in collector middleware
  • Spepicfy collector metrics by changing the prefix to http_server_*
  • Remove HTTP Host header from default label builder

Fixes #27.

@tmc and @mikeurbach as users of the library.
@brian-brazil for comments on the default metrics names and help strings.

# TYPE http_server_requests_total counter
# HELP http_server_requests_total The total number of HTTP requests handled by the Rack application.
http_server_requests_total{method="get",path="/",code="200"} 74
http_server_requests_total{method="post",path="/",code="200"} 64
http_server_requests_total{method="delete",path="/",code="404"} 18
http_server_requests_total{method="delete",path="/",code="200"} 61
http_server_requests_total{method="get",path="/",code="404"} 8
http_server_requests_total{method="post",path="/",code="404"} 15
# TYPE http_server_request_latency_seconds histogram
# HELP http_server_request_latency_seconds The HTTP response latency of the Rack application.
http_server_request_latency_seconds{method="get",path="/",code="200",le="0.005"} 74
http_server_request_latency_seconds{method="get",path="/",code="200",le="0.01"} 74
http_server_request_latency_seconds{method="get",path="/",code="200",le="0.025"} 74
...
# TYPE http_server_exceptions_total counter
# HELP http_server_exceptions_total The total number of exceptions raised by the Rack application.
http_server_exceptions_total{exception="NoMethodError"} 10

I was thinking about removing http_server_requests_total as it's included in the histogram, but I've seen beginners to be heavily confused using http_server_request_latency_seconds_count as counter. I guess it's not too harmful.

@coveralls
Copy link

coveralls commented Sep 9, 2016

Coverage Status

Coverage decreased (-0.002%) to 99.087% when pulling 271d793 on grobie/middleware-rewrite into 5e16541 on master.

def init_request_metrics
@requests = @registry.counter(
:http_server_requests_total,
'The total number of HTTP requests handled by the Rack application.',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's from rack, then the metric name should contain "rack"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rack is the de facto standard interface for ruby HTTP servers. It can be implemented by many different servers (unicorn, thin, puma, ...). I thought about including it, but concluded after some discussions that including it in the metric name is neither expected nor helpful.

@olleolleolle
Copy link
Contributor

Thanks a lot for improving the Rack middleware.

As a novice Prometheus user providing data from a Rack app, it was nice to see a middleware.

As I got it up and presenting data, I later learned that the "summary" metric was wrong for what we'd tried to use it for. We now find ourselves in the wrong column of table "Histogram vs Summary".

Can we move this forward in any way? Are all of the understanding-improving/renaming comments blocking a merge?

@grobie
Copy link
Member Author

grobie commented Dec 20, 2016 via email

@perlun
Copy link

perlun commented Jan 19, 2017

@grobie - any updates on this one?

@grobie
Copy link
Member Author

grobie commented Jan 30, 2017

I've been traveling in the Amazon jungle for the last few weeks, but will have more internet connections during the next couple of days. I'll try to pick this up again.

@grobie grobie force-pushed the grobie/middleware-rewrite branch from 271d793 to 26b229e Compare February 14, 2017 03:58
@coveralls
Copy link

coveralls commented Feb 14, 2017

Coverage Status

Coverage remained the same at 99.09% when pulling 26b229e on grobie/middleware-rewrite into 1b458a0 on master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage remained the same at 99.09% when pulling 26b229e on grobie/middleware-rewrite into 1b458a0 on master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 99.09% when pulling 26b229e on grobie/middleware-rewrite into 1b458a0 on master.

@grobie grobie force-pushed the grobie/middleware-rewrite branch from 26b229e to 2caa1bf Compare February 14, 2017 04:38
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling 2caa1bf on grobie/middleware-rewrite into 1b458a0 on master.

3 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling 2caa1bf on grobie/middleware-rewrite into 1b458a0 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling 2caa1bf on grobie/middleware-rewrite into 1b458a0 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling 2caa1bf on grobie/middleware-rewrite into 1b458a0 on master.

@grobie grobie force-pushed the grobie/middleware-rewrite branch from 2caa1bf to ac14e38 Compare February 14, 2017 04:49
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling ac14e38 on grobie/middleware-rewrite into 1b458a0 on master.

3 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling ac14e38 on grobie/middleware-rewrite into 1b458a0 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling ac14e38 on grobie/middleware-rewrite into 1b458a0 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling ac14e38 on grobie/middleware-rewrite into 1b458a0 on master.

Copy link
Member Author

@grobie grobie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I finally addressed almost all of the feedback. The only outstanding issues is how to handle the response code label.

The term request is pretty overloaded and I always understood it to describe here the full cycle of a HTTP request and response. For example the latency metric measures the full round-trip and not just the time until the request reaches a server (which would be very hard to define anyway).

It's a very common need to have a metric broken out by request method, path and response code. Neither http_request_... nor http_response_... metrics would include all three. I'm open to suggestions, but the current approach seems to be the best option IMHO.

As these metrics were implemented based on client_golang, I'd like to get @beorn7's and @juliusv's input as well.

@grobie grobie force-pushed the grobie/middleware-rewrite branch from ac14e38 to 9e03c52 Compare February 14, 2017 05:11
@grobie
Copy link
Member Author

grobie commented Feb 14, 2017

The metrics generated by the included HTTP tracer would look like this:

# TYPE http_server_requests_total counter
# HELP http_server_requests_total The total number of HTTP requests handled by the Rack application.
http_server_requests_total{code="200",method="get",path="/"} 76.0
http_server_requests_total{code="200",method="post",path="/"} 67.0
http_server_requests_total{code="404",method="delete",path="/"} 16.0
http_server_requests_total{code="200",method="delete",path="/"} 62.0
http_server_requests_total{code="404",method="get",path="/"} 7.0
http_server_requests_total{code="404",method="post",path="/"} 10.0
# TYPE http_server_request_latency_seconds histogram
# HELP http_server_request_latency_seconds The HTTP response latency of the Rack application.
http_server_request_latency_seconds{method="get",path="/",le="0.005"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.01"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.025"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.05"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.1"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.25"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.5"} 83
http_server_request_latency_seconds{method="get",path="/",le="1"} 83
http_server_request_latency_seconds{method="get",path="/",le="2.5"} 83
...
http_server_request_latency_seconds{method="delete",path="/",le="5"} 78
http_server_request_latency_seconds{method="delete",path="/",le="10"} 78
http_server_request_latency_seconds{method="delete",path="/",le="+Inf"} 78
http_server_request_latency_seconds_sum{method="delete",path="/"} 0.003424651000000001
http_server_request_latency_seconds_count{method="delete",path="/"} 78
# TYPE http_server_exceptions_total counter
# HELP http_server_exceptions_total The total number of exceptions raised by the Rack application.
http_server_exceptions_total{exception="NoMethodError"} 12.0

@coveralls
Copy link

coveralls commented Feb 14, 2017

Coverage Status

Coverage decreased (-0.1%) to 98.965% when pulling 9e03c52 on grobie/middleware-rewrite into 1b458a0 on master.

@grobie
Copy link
Member Author

grobie commented Feb 14, 2017

Request means request, response means response.

And still it's commonly understood that a statement "The request latency was 400ms" includes both the time to sent a HTTP request and receive a HTTP response without the need to say something like "The request and response latency was ...".

I'd suggest switching the request counter to being called response, and dropping the code label from the latency.

As commented earlier, I had already dropped the code label from the latency histogram.

Following your argumentation, a http_server_responses_total metric wouldn't be allowed to include method and path labels as a HTTP response doesn't have such fields by definition.

I understand the naming isn't perfect and remember the same discussion with Matt years ago. Considering the colloquial use of the term request, the similar problems with labels on a http_responses_total metric, the benefit of having the same prefix on all HTTP metrics and the existing usage of http_requests_total metrics in various client libraries, I'm for keeping the current naming pattern.

Are there other committers voting to change the naming? @beorn7 @juliusv

@beorn7
Copy link
Member

beorn7 commented Feb 14, 2017

The naming used by https://godoc.org/github.com/prometheus/client_golang/prometheus#InstrumentHandler in client_golang should not serve as a role model in any way. It's strongly deprecated (and coming up with a replacement that doesn't immediately run into similar or other problems turns out to be hard).

The problem with the metric name http_requests_total, as I understood it so far, is the one described in https://prometheus.io/docs/instrumenting/writing_exporters/#naming :

Generally metric names should allow someone who’s familiar with Prometheus but not a particular system to make a good guess as to what a metric means. A metric named http_requests_total is not extremely useful - are these being measured as they come in, in some filter or when they get to the user’s code?

Thus, the name should include some kind of subsystem. (In this case, perhaps rack_http_requests_total? But I'm not too familiar with the Ruby code here.)

The discussion around request vs. response is new to me. I certainly want to see how many of my DELETE requests got a 200 response. If I separated those two from each other, how would I do this?

@juliusv
Copy link
Member

juliusv commented Feb 14, 2017

While I've heard about the request vs. response argument before, I agree that it's often useful to have both the method (or path) and the status code in one metric, and that the colloquial use of "request" is not only about the request, but also the response. @brian-brazil or would there be a better metric name wording?

@grobie
Copy link
Member Author

grobie commented Feb 14, 2017

Regarding the inclusion of rack in metric names.

The problem with the metric name http_requests_total, as I understood it so far, is the one described in https://prometheus.io/docs/instrumenting/writing_exporters/#naming :

That's the reason this PR changes the metric names from http_requests_... to http_server_requests_.... I considered including rack in the name as well, but after some thinking and discussions with others opted against it. Copy&pasting my answer to Brian from above:

Rack is the de facto standard interface for ruby HTTP servers. It can be implemented by many different servers (unicorn, thin, puma, ...). I thought about including it, but concluded after some discussions that including it in the metric name is neither expected nor helpful.

Rack is really just an interface definition and not an implementation in its own. Literally every ruby server uses rack these days, so that servers (unicorn, thin, ...), frameworks (sinatra, rails, ...) and plugins (rack middlewares) can be freely used together. It does not provide any additional information to include rack in the metric name, as it's still unclear where the prometheus middlewares are added in the chain of middlewares. For reference, a typical Rails application uses 20-30 middlewares.

@beorn7
Copy link
Member

beorn7 commented Feb 15, 2017

The reasoning about the name appears sound to me. However, it also proves how difficult generic naming is. My current plan for the HTTP instrumentation helpers in client_golang is to always let the user provide a name, without default. But as said, we are still in exploration state there.

@matthiasr
Copy link

Request means request, response means response.

They are not independent. The response means nothing without the request, and the request has no result without the response.

In an ideal nanoservice world, where each binary does exactly one thing, the responses may make sense without further context, but out here in the real world we need the response breakdown by (method,path,status). We actively graph and alert this, as otherwise high-volume low-value endpoints swamp out signal from low-volume high-value ones. Consider a CRUD service – if the R request rate is 1000x the CUD rate (not uncommon in consumer websites cough), and a 0.1% error rate on reads is acceptable, it is impossible to detect a completely failing C/U/D endpoint without method and path.

@grobie
Copy link
Member Author

grobie commented Feb 20, 2017

In order to unblock this PR, I'd ask for a vote how to proceed here. Arguments against and for http_server_requests_total/http_server_responses_total have been written above.

Summary of the changes in this PR:

  • changed prefix of all HTTP tracer metrics from http_... to http_server_...
  • replace summary with histogram for http_server_request_latency metric
  • removed code label from histogram (formerly a summary)
  • provided two separate options to change the labels added to http_server_requests_total and http_server_request_latency (formerly both label sets were changed simultaneously)
  • renamed the middlewares from Prometheus::Client::Rack::XXX to Prometheus::Middleware::XXX to make the breaking changes clear

The newly generated metrics look like this:

# TYPE http_server_requests_total counter
# HELP http_server_requests_total The total number of HTTP requests handled by the Rack application.
http_server_requests_total{code="200",method="get",path="/"} 76.0
http_server_requests_total{code="200",method="post",path="/"} 67.0
http_server_requests_total{code="404",method="delete",path="/"} 16.0
http_server_requests_total{code="200",method="delete",path="/"} 62.0
http_server_requests_total{code="404",method="get",path="/"} 7.0
http_server_requests_total{code="404",method="post",path="/"} 10.0
# TYPE http_server_request_latency_seconds histogram
# HELP http_server_request_latency_seconds The HTTP response latency of the Rack application.
http_server_request_latency_seconds{method="get",path="/",le="0.005"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.01"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.025"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.05"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.1"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.25"} 83
http_server_request_latency_seconds{method="get",path="/",le="0.5"} 83
http_server_request_latency_seconds{method="get",path="/",le="1"} 83
http_server_request_latency_seconds{method="get",path="/",le="2.5"} 83
...
http_server_request_latency_seconds{method="delete",path="/",le="5"} 78
http_server_request_latency_seconds{method="delete",path="/",le="10"} 78
http_server_request_latency_seconds{method="delete",path="/",le="+Inf"} 78
http_server_request_latency_seconds_sum{method="delete",path="/"} 0.003424651000000001
http_server_request_latency_seconds_count{method="delete",path="/"} 78
# TYPE http_server_exceptions_total counter
# HELP http_server_exceptions_total The total number of exceptions raised by the Rack application.
http_server_exceptions_total{exception="NoMethodError"} 12.0

Please vote with 👍 / 👎 if you agree with these changes. @brian-brazil @beorn7 @juliusv @matthiasr

@matthiasr
Copy link

👍

@beorn7
Copy link
Member

beorn7 commented Feb 21, 2017

Slightly different note: We often use …_duration_seconds instead of …_latency_seconds. I would prefer if one was used consistently (or there are rules when to use what). Does anybody have thoughts about this?

@grobie
Copy link
Member Author

grobie commented Feb 21, 2017 via email

@beorn7
Copy link
Member

beorn7 commented Feb 21, 2017

But is …_duration_seconds the more common? I always thought so, but I'm not sure anymore…

@brian-brazil
Copy link

I don't think there's a clear convention there, I'd say follow whatever the Go client has.

@beorn7
Copy link
Member

beorn7 commented Feb 21, 2017

As of now, the Go client has …_duration_seconds (actually …_duration_microseconds, but changing that to seconds is one of the easy and obvious changes that will happen).

@grobie grobie force-pushed the grobie/middleware-rewrite branch from 9e03c52 to 43f01b1 Compare February 24, 2017 00:23
@grobie
Copy link
Member Author

grobie commented Feb 24, 2017

Ok, I changed latency to duration.

* Move from prometheus/client/rack/* to rack/middleware/*
* Use histogram instead of summary in collector middleware
* Spepicfy collector metrics by changing the prefix to http_server_*
* Remove HTTP Host header from default label builder
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.987% when pulling 43f01b1 on grobie/middleware-rewrite into 6606b59 on master.

3 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.987% when pulling 43f01b1 on grobie/middleware-rewrite into 6606b59 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.987% when pulling 43f01b1 on grobie/middleware-rewrite into 6606b59 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.987% when pulling 43f01b1 on grobie/middleware-rewrite into 6606b59 on master.

@beorn7
Copy link
Member

beorn7 commented Feb 24, 2017

👍 from a pure "how the metrics look like" perspective.


def call(env)
if env['PATH_INFO'] == @path
format = negotiate(env['HTTP_ACCEPT'], @acceptable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be more efficient to fetch a key from env, like env.fetch('HTTP_ACCEPT', '') and assume that in negotiate method we will always get a string in accept argument.

Then we can improve negotiate method and replace accept.to_s.empty? on accept.empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea to set default value here, but it can also be a strategy – always provide accept argument, then it could be env.fetch('HTTP_ACCEPT', '*/*') or something like env.fetch('HTTP_ACCEPT', DEFAULT_HTTP_ACCEPT), but I'm afraid then it will shift responsibilities of negotiation a bit here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Strech! I incorporated your suggestions and simplified the code. Note, most of that was necessary when the ruby client library supported more than one exposition format (json and text). It might become necessary again in the future if #10 gets implemented.

Copy link
Contributor

@Strech Strech Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grobie copy, I hope that I will have a chance to contribute more in next few months 👍

end

def parse(header)
header.to_s.split(/\s*,\s*/).map do |type|
Copy link
Contributor

@Strech Strech Feb 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used only by negotiate and we will never pass header as non-string value, so we can omit to_s here

@grobie grobie force-pushed the grobie/middleware-rewrite branch from 43f01b1 to 5c38163 Compare February 25, 2017 01:37
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.986% when pulling 5c38163 on grobie/middleware-rewrite into 6606b59 on master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 98.986% when pulling 5c38163 on grobie/middleware-rewrite into 6606b59 on master.

@coveralls
Copy link

coveralls commented Feb 25, 2017

Coverage Status

Coverage decreased (-0.1%) to 98.986% when pulling 5c38163 on grobie/middleware-rewrite into 6606b59 on master.

@grobie grobie merged commit 9e28f0d into master Feb 25, 2017
@grobie grobie deleted the grobie/middleware-rewrite branch February 25, 2017 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use histogram in rack collector.
10 participants