-
Notifications
You must be signed in to change notification settings - Fork 105
use Context's properly to stop executing when a request is canceled #728
Conversation
88878c8
to
b791d24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not done reviewing yet.
an overall comment at this point, is that we should be consistent in where we call cancel.
e.g. looking at findSeries and getTargets both calling cancel, but also within getTargetsRemote and getTargetsLocal, but not within s.findSeriesLocal and s.findSeriesRemote is weird.
perhaps we should have all helper functions like getTargetsRemote etc bubble up all errors as fast as they can, and leave all the canceling up to the macaron Handlers only, and they would cancel when they get errors bubbled up into them.
not sure yet if that's the best approach, i just want us to be consistent
api/dataprocessor.go
Outdated
wg.Add(len(reqs)) | ||
limiter := make(chan struct{}, getTargetsConcurrency) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a bit cleaner if the limiter was factored out like in https://github.com/go-graphite/carbonapi/blob/master/limiter.go btw this is probably my favorite 10 lines of Go code ever.
api/config.go
Outdated
@@ -45,6 +47,7 @@ func ConfigSetup() { | |||
apiCfg.BoolVar(&multiTenant, "multi-tenant", true, "require x-org-id authentication to auth as a specific org. otherwise orgId 1 is assumed") | |||
apiCfg.StringVar(&fallbackGraphite, "fallback-graphite-addr", "http://localhost:8080", "in case our /render endpoint does not support the requested processing, proxy the request to this graphite") | |||
apiCfg.StringVar(&timeZoneStr, "time-zone", "local", "timezone for interpreting from/until values when needed, specified using [zoneinfo name](https://en.wikipedia.org/wiki/Tz_database#Names_of_time_zones) e.g. 'America/New_York', 'UTC' or 'local' to use local server timezone") | |||
apiCfg.IntVar(&getTargetsConcurrency, "get-targets-concurrency", 20, "maximum number of concurrent threads for fetching data on the local node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is not clear enough. in particular we need to clarify how this ties in to the store and the store's own tunables.
perhaps "maximum number of concurrent timeseries reads issued to the store (note that a read of an avg-aggregated series counts as 1 but issues 2 series reads)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is completely independent of the store. The behaviour is exactly as described, it is a cap on the number of threads used for fetching data on the local node, whether that data comes from the numerous caches or from cassandra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i just need to add that each execution thread handles only 1 series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually let's clarify that this limit is per-request
|
||
responses := make(chan getTargetsResp, 1) | ||
getCtx, cancel := context.WithCancel(ctx) | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this deferred cancel? in particular, triggering cancel() when there was no error. is this a safeguard against programming errors (e.g. that we forgot that there may still be workers running that should have exited)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://godoc.org/golang.org/x/net/context#WithCancel
cancel should always be called otherwise resources will leak. go vet
will warn if you dont call cancel with something like
api/dataprocessor.go:135: the cancel function is not used on all paths (possible context leak)
mdata/store_cassandra.go
Outdated
if o.omitted { | ||
tracing.Failure(span) | ||
tracing.Error(span, errReadTooOld) | ||
return nil, errReadTooOld |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this error still accurate? seems like we can hit this condition when ctx was canceled during processReadQueue, not due to the CRR being too old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it is too old in the sense that the CRR outlived the originating "/render" request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also be noted that with the default cassandra-omit-read-timeout config, unless the client waits longer than 60seconds for a request to complete (not possible in hosted-metrics as the ingress-controller will timeout after 60seconds), o.omitted will only ever be set for requests that are cancelled (due to timeout or other error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. the reason i brought this up is because it seemed useful to separate these 2 cases in the error being logged in the trace, but actually it doesn't seem to matter much so this looks fine.
mdata/store_cassandra.go
Outdated
@@ -459,8 +468,8 @@ func (c *CassandraStore) SearchTable(ctx context.Context, key, table string, sta | |||
|
|||
crrs := make([]*ChunkReadRequest, 0) | |||
|
|||
query := func(month, sortKey uint32, q string, p ...interface{}) { | |||
crrs = append(crrs, &ChunkReadRequest{month, sortKey, q, p, pre, nil}) | |||
query := func(ctx context.Context, month, sortKey uint32, q string, p ...interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since ctx
is the same for all invocations of query, we don't need to pass it in each time. that'll clean up the diff a bit. (please rebase -i and fixup into b791d24, so that we maintain clean history )
mdata/store_cassandra.go
Outdated
for { | ||
select { | ||
case <-ctx.Done(): | ||
// request has been canceled, so no need to process the results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correction: "no need to continue processing the results"
} | ||
|
||
func handleResp(rsp *http.Response) ([]byte, error) { | ||
defer rsp.Body.Close() | ||
if rsp.StatusCode != 200 { | ||
ioutil.ReadAll(rsp.Body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an unrelated fix to allow for connection reuse? did you just notice in the code or did you observe a symptom that led you to fixing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just correct behaviour for allowing connection re-use. I noticed it missing so added it.
fd5c709
to
ee4c050
Compare
That is not possible. |
ee4c050
to
7658866
Compare
api/graphite.go
Outdated
seenDefs := make(map[string]struct{}) | ||
var mu sync.Mutex | ||
reqCtx := ctx.Req.Context() | ||
responses := make(chan struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik types can have a function scope, so it might be worth it to declare a type for that response struct instead of redefining it 3 times
but we can change getTargetsRemote and getTargets to simply return the err as soon as they hit one, and the caller then calls cancel, canceling all other pending routines update: here's one of the go devs sharing a pattern where it is deemed OK for functions to leak goroutines and cleaning them up by calling cancel in the caller : https://rakyll.org/leakingctx/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's cool
@Dieterbe I refactored so that getTargetsLocal and getTargetsRemote both return errors to their callers as soon as they happen, allowing the caller to cancel the context |
getTargetsLocal still calls cancel in 2 places. wouldn't it make sense to remove those, and add a cancel in Server.getData() after calling getTargetsLocal ? |
getTargetsLocal uses its own contextWithCancel, derived from the passed in context. If we dont call "cancel()" inside the goroutines that are processing a req, then every req will be dispactched before we even start to read the reposnes channel for any errors. If there are 1000 reqs, and the first one returns an error, getTarget() will be called for the other 999 reqs before we notice that the first one failed. Because of the limiting, |
i have added the withCancel contexts to metricsIndex and metricsDelete so that requests to peers are immediately canceled if any peer returns an error. here is a more detailed diagram of the context flow. Each colour represents a different context. |
@@ -476,46 +545,72 @@ func (s *Server) metricsDelete(ctx *middleware.Context, req models.MetricsDelete | |||
peers := cluster.Manager.MemberList() | |||
peers = append(peers, cluster.Manager.ThisNode()) | |||
log.Debug("HTTP metricsDelete for %v across %d instances", req.Query, len(peers)) | |||
errors := make([]error, 0) | |||
|
|||
reqCtx, cancel := context.WithCancel(ctx.Req.Context()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a specific naming scheme at play here, e.g. why not call this ctx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind. i see now that ctx
is already used
if err != nil { | ||
tags.Error.Set(span, true) | ||
errorsChan <- err | ||
cancel() // cancel all other requests. | ||
responses <- getTargetsResp{nil, err} | ||
} else { | ||
getTargetDuration.Value(time.Now().Sub(pre)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if getTarget() returned early due to a cancel, than we will report an incorrect (and possibly unrealistically low) getTargetDuration here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@woodsaj ^^ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric will correctly reflect the time spent running getTarget().
Is getTargetDuration meant to represent something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. the metric represents the getting of a target. if getTarget is cancelled due to whatever reason and is not actually getting targets, then we shouldn't report this duration as it does not apply.
generally throughout this new code, cancellations caused by client disconnect, result in non-erroreous returns (typically the end result is that executePlan returns an error and a http 500 is reported instead of 499. (and executePlan logs the "error" after calling getTargets) |
- The ctx was already being passed between the different methods within the query pipeline, but the context was never being inspected to see if it had been canceled. This commit fixes that. - Now when an error occurs, whether that error is local or on a remote peer, the request is immediately canceled. - if the client disconnects before a response is returned (whether that is due to a timeout or the client canceling the request), that cancelation will propagate throught and abort any work.
limit the number of getTargetsLocal threads that can be running at once. This ensures that if a large query that spans 1000s of series is executed, 1000s of goroutines wont be run at once to hit the caches and cassandra. This also ensures that a single query cant fill the cassandra read queue. Mt only executes cassandra-read-concurrency requests at a time so requests will always need to be queued. With this new approach multiple render requests will share read queue and their read requests will be interspaced in the queue, rather then the old behaviour where a render request would have to complete all of its reads before the next render request would be processed. Overall this will lead to a more consistant performance profile for render requests.
- the functions that spawn background work in new goroutines should be responsible for signalling to them that they should shutdown. So these functions should create a context.WithCancel and call cancel when the the first error is encounted.
bbe2410
to
705d97a
Compare
705d97a
to
e08a3ca
Compare
This is ready for review again. @Dieterbe i have removed the panics and doRecover calls in favour of passing errors. So now, the first error will cause the context to be cancelled triggering all other goroutines to quickly return (without errors). |
4da9cc7
to
591a709
Compare
I don't think it is necessary to remove the panic-doRecover mechanism to achieve that: whether you bubble up an error through a callchain, or take the fast path of panicing and turning the panic into an error, the result is the same : getTarget returns an error to getTargetsLocal, which can then cancel the goroutintes.
there's many, many more. in particular the ones in api/dataprocessor.go and in mdata stand out, because many of those are in the code called by getTarget.
since |
Ill re-add the doRecover back in to catch the remaining panics that can be thrown inside the getTarget() goroutine. But the refactor to use errors instead of calling panic going to stay. It is well established that panics should be avoided whenever possible. https://golang.org/doc/effective_go.html#errors |
- Instead of calling panic and having to recover the panic later, just pass errors throught the data fetch (getTargets) pipeline. This allows us to correctly cancel requests via the request context.
591a709
to
7d3e803
Compare
* no need to wrap an if err clause around s.getSeriesFixed(), as the output is always correct (in particular, when it returns an error, the series is nil) * add consolidation.ConsolidateContext() wrapper so that the caller doesn't need to check the context prior to a call to consolidation.Consolidate() * ditto for divideContext() -> divide() * ditto for getSeriesFixed, except it doesn't need a wrapper, it was already checking the ctx.Done for most of its work (but not all)
and yet, the doRecover mechanism is also an established practice.
that is exactly the use case we use it for. If anything, the thing we're doing improperly here, according to this document, is that we use it across package boundaries. I may not agree with your reasons, but I am OK with migrating away from this approach though.
I have just pushed 3 commits to clean things up a bit, please let me know what you think. |
If you approve of the 3 commits I've pushed, and you can address the one remark that's still pending, then I think this is good to go. (don't worry about the conflicts, i can resolve them when i merge) |
LGTM |
But what if there is a problem that causes requests to spend too long in
getTargets when a context is canceled. We would lose visibility into the
problem. GetTargets does a lot of things, I don't think it is a good
measure of anything other then how long getTarget runs for.
…On Dec 5, 2017 9:52 PM, "Dieter Plaetinck" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In api/dataprocessor.go
<#728 (comment)>:
> if err != nil {
tags.Error.Set(span, true)
- errorsChan <- err
+ cancel() // cancel all other requests.
+ responses <- getTargetsResp{nil, err}
} else {
getTargetDuration.Value(time.Now().Sub(pre))
yes. the metric represents the getting of a target. if getTarget is
cancelled due to whatever reason and is not actually getting targets, then
we shouldn't report this duration as it does not apply.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#728 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIBnYAunrvvD8bH8qkwqvge0PLWwAQUTks5s9UqSgaJpZM4PX0Qi>
.
|
ok that works for me |
within the query pipeline, but the context was never being inspected
to see if it had been canceled. This commit fixes that.
peer, the request is immediately canceled.
that is due to a timeout or the client canceling the request), that
cancelation will propagate throught and abort any work.