-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce concurrency limit for GET requests and a general timeout for HTTP #1743
Conversation
@mxinden I'm banging my head against a wall when investigating the Travis errors: |
I have a lead: the |
That's definitely the reason. Now trying to find a solution. (BTW: This would fail in master in the same way.) |
Still banging my head against a wall. The way to reproduce: set GOPATH to something else, and then run |
Hmm, I just pushed a whitespace change to a branch freshly branched off master, and there the |
Interesting. Ignoring Travis build link for reference: https://travis-ci.org/prometheus/alertmanager/builds/489679923 |
I think a bunch of things in our build are pulled in with current versions. That might explain changing behavior (and is at odds with the go-modules approach of reproducible builds). |
I'll get to the bottom of this tomorrow with a clear head… Very mysterious. 🤔 |
0318737
to
5374774
Compare
Regarding the |
Rebased on top of master with @simonpasquier 's and my fixes. Let's see what's happening now. |
Tadaaa! 🎉 |
api/api.go
Outdated
) | ||
|
||
func init() { | ||
prometheus.MustRegister(requestsInFlight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a big fan of the global metric registry. It might be less verbose but also implicit and more difficult when it comes to vendoring. I am trying the steer this project away from it. What do you think in this regard? Would passing a Registerer
to API.New()
also be an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of the global registry, either. I just wanted to stay consistent with the way it is and then have a separate refactoring to make it local.
But now I have seen what you did in #1741, which is already merged. Thus, a incremental migration to local Registerers is already ongoing. Happy to follow that path here.
Nothing much to add to @mxinden's comments from my side. |
Responded to the comments, but haven't yet pushed the commit to actually address them. |
The default concurrency limit is max(GOMAXPROCS, 8). That should not imply that each GET requests eats a whole CPU. It's more to get some reasonable heuristics for the processing power of the hosting machine (while allowing at least 8 concurrent requests even on the smallest machines). As GET requests can easily overload the Alertmanager, rendering it incapable of doing its main task, namely sending alert notifications, we need to limit GET requests by default. In contrast, no timeout is set by default. The http.TimeoutHandler inovkes quite a bit of machinery behind the scenes, in particular an additional layer of buffering. Thus, we should first get a bit of experience with it before we consider enforcing a timeout by default, even if setting a timeout is in general the safer setting for resiliency. Signed-off-by: beorn7 <beorn@soundcloud.com>
The context is created by the http.TimeoutHandler we use to set the timeout. I believe this is the only endpoint where propagating the timeout is feasible and needed. Signed-off-by: beorn7 <beorn@soundcloud.com>
While the newly added in-flight instrumentation works for all GET requests, the existing HTTP instrumentation omits api/v2 calls. This commit adds a TODO note about that. Signed-off-by: beorn7 <beorn@soundcloud.com>
Sorry for the delay. Got sidetracked all the time. Finally pushed that new commit. @stuartnelson3 is gone for vacation again. @mxinden please have another look. |
Most importantly, `api.New` now takes an `Options` struct as an argument, which allows some other things done here as well: - Timout and concurrency limit are now in the options, streamlining the registration and the implementation of the limiting middleware. - A local registry is used for metrics, and the metrics used so far inside any of the api packages are using it now. The 'in flight' metric now contains the 'get' as a method label. I have also added a TODO to instrument other methods in the same way (otherwise, the label doesn't reall make sense, semantically). I have also added an explicit error counter for requests rejected because of the concurrency limit. (They also show up as 503s in the generic HTTP instrumentation (or they would, if v2 were instrumented, too), but those 503s might have a number of reasons, while users might want to alert on concurrency limit problems explicitly). Signed-off-by: beorn7 <beorn@soundcloud.com>
@mxinden Do you think you could have a look at this any time soon? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small thing. Otherwise this looks good to me.
requestsInFlight prometheus.Gauge | ||
concurrencyLimitExceeded prometheus.Counter | ||
timeout time.Duration | ||
inFlightSem chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in the handler, right? How about initializing it in the handler constructor closure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we call limitHandler
twice, once to handle the router handler, once to handle the apiv2 handler. If we moved the inFlightSem
into limitHandler
, we would limit the GET requests separately for each one. But we want one limit for all of the HTTP handling. (That was the reason why I had a function that returns a function (the middleware) that returns a function (the HandlerFunc) in the previous version. I could avoid that confusion by having inFlightSem as a member of API
and limitHandler a method of API
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mxinden did my comment clarify the issue? Or do you think this should be changed in another way?
inFlightSem chan struct{} | ||
} | ||
|
||
// Options for the creation of an API object. Alerts, Silences, and StatusFunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding detailed comments here.
Thanks @simonpasquier . @mxinden I assume things are clarified now. We can still rework this if needed. I just would to like to get this merged as #1733 needs to address some merge conflicts with it and I have yet another change on top of all that in my pipeline. Thus, I'm merging this now to get all those ends tied up. It also unblocks your work on #1744. |
Fixes #1723 .