-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout to bulker flush, add default case in failQueue #3986
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label. Could you fix it @juliaElastic? 🙏
|
|
internal/pkg/api/handleCheckin.go
Outdated
@@ -337,15 +337,20 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r | |||
actions, ackToken = convertActions(zlog, agent.Id, pendingActions) | |||
|
|||
span, ctx := apm.StartSpan(r.Context(), "longPoll", "process") | |||
// ctx, cancel := context.WithTimeout(ctx, pollDuration) | |||
// defer cancel() | |||
|
|||
if len(actions) == 0 { | |||
LOOP: | |||
for { | |||
select { | |||
case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently looking at this the only time this case would be hit would be if the client closes there connection to Fleet Server. As the span, ctx := apm.StartSpan(r.Context(), ...
is being used as the context here, so then this section writes the response. Looking at this code, it shouldn't even write the response. If the context is cancelled then that means the client is no longer connected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic of returning CheckinResponse here was added in this pr: https://github.com/elastic/fleet-server/pull/3165/files#diff-e0c02bac8d151e9941eedd5ef643441665ee3d2f78baf42c121edd45dee08ded
Can the context be cancelled only by the client here? I'm wondering if the writeResponse
is successful, is it correct to return the AckToken
without actions? I'm trying to find where the AckToken
is persisted back to ES.
I think it's persisted in action_seq_no
field on a successful checkin here:
fleet-server/internal/pkg/checkin/bulk.go
Line 263 in 3fd2a48
fields[dl.FieldActionSeqNo] = pendingData.extra.seqNo |
I'm wondering if there is any retry if the agent /acks
request fails or fleet-server fails to persist the action result. I've seen some stuck upgrades where the ack failed and it was never retried.
Though when testing this locally with a simulated error instead of writing action result, I see the retries happening.
internal/pkg/api/handleCheckin.go
Outdated
@@ -360,6 +365,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r | |||
actions = append(actions, acs...) | |||
break LOOP | |||
case policy := <-sub.Output(): | |||
zlog.Debug().Str(logger.AgentID, agent.Id).Msg("SCALEDEBUG new policy") | |||
actionResp, err := processPolicy(ctx, zlog, ct.bulker, agent.Id, policy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible that it gets stuck here processing the policy, as it doesn't create its own context and timeout after a period of time. This is still using the same context of the request connection.
internal/pkg/api/handleCheckin.go
Outdated
@@ -368,7 +374,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r | |||
actions = append(actions, *actionResp) | |||
break LOOP | |||
case <-longPoll.C: | |||
zlog.Trace().Msg("fire long poll") | |||
zlog.Debug().Str(logger.AgentID, agent.Id).Msg("fire long poll") | |||
break LOOP | |||
case <-tick.C: | |||
err := ct.bc.CheckIn(agent.Id, string(req.Status), req.Message, nil, rawComponents, nil, ver, unhealthyReason, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem likely that it gets stuck here as ct.bc.CheckIn
just grabs a lock and adds to a map. But again it could be possible that there is a dead lock here and that lock is held and never freed. Shouldn't be ruled out.
The long poll issue is reproduced again here: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2429301669 |
d3d3f2f
to
7352efe
Compare
internal/pkg/bulk/engine.go
Outdated
@@ -336,6 +336,9 @@ func (b *Bulker) Run(ctx context.Context) error { | |||
|
|||
w := semaphore.NewWeighted(int64(b.opts.maxPending)) | |||
|
|||
ctx, cancel := context.WithTimeout(ctx, 5*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes each call to the bulker's Run() function to return unconditionally after 5 minutes, including the one in the fleet server main loop.
You can see this happening in the logs. Every 5 minutes the bulker aborts everything it is doing.
To avoid this unconditional exit you would create the context in doFlush https://github.com/juliaElastic/fleet-server/blob/a9a82eaae87b7c2d4b54d714d34943a970c8f3cd/internal/pkg/bulk/engine.go#L352
That would bound each execution of doFlush
to 5 minutes which should have the same effect, but without the logging about constant exiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried moving the timeout to doFlush
but it doesn't work, I think the doFlush
function exits too quickly and cancels the context. Flee server doesn't come online at all.
See the logs from a 10k run: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/S0Wla
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the timeout outside of doFlush
, see this comment: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2459697129
Strangely I don't really see the flush timing out anymore, mostly seeing the upgrade step being stuck at 75k runs, 10k runs pass almost always.
I'm seeing this checkin error in the logs.
internal/pkg/checkin/bulk.go
Outdated
@@ -153,6 +153,7 @@ LOOP: | |||
case <-tick.C: | |||
if err = bc.flush(ctx); err != nil { | |||
zerolog.Ctx(ctx).Error().Err(err).Msg("Eat bulk checkin error; Keep on truckin'") | |||
break LOOP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breaking the loop here makes the bulker exit and be restarted
it seems to help with the upgrade step, tested with a 51k run: https://buildkite.com/elastic/observability-perf/builds/3670#01930bb4-fb8d-47c2-8699-94c0cc471699
Logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/NiMhb
Hm though upgrade seems to be stuck in this 10k run and I don't see any related errors in the logs.
https://buildkite.com/elastic/observability-perf/builds/3673#01930c05-078e-4e75-9453-e3792ab5294e
This change doesn't seem to help overall, had another 75k run stuck in upgrading for the last 500 drones. Logs here.
This pull request is now in conflicts. Could you fix it @juliaElastic? 🙏
|
@juliaElastic What is the status of this PR? I see you request another review, are you seeing success with that current state of the PR now? Can you provide a summary of those results? |
I summarized here: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2467506043 |
internal/pkg/bulk/engine.go
Outdated
// deadline prevents bulker being blocked on flush | ||
flushCtx, cancel := context.WithTimeout(ctx, defaultFlushContextTimeout) | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These contexts are created in a loop, but the cancels are deffered until the function (Bulker.Run
) returns. Should we move context creation to within the doFlush
function to prevent this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally added it to doFlush
but it didn't work, Fleet Server didn't come up successfully. I think doFlush
finishes too quickly and would cancel the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should cancel be called after doFlush
is called instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work because most of the work happening is asynchronous in a goroutine, so once you unblock from acquiring the semaphore if you cancel immediately you'll exit the goroutine.
fleet-server/internal/pkg/bulk/engine.go
Lines 458 to 461 in 1a23838
go func() { | |
start := time.Now() | |
if b.tracer != nil { |
There probably needs to be two separate deadlines created in doFlush():
- On the semaphore acquire call:
fleet-server/internal/pkg/bulk/engine.go
Lines 446 to 449 in 1a23838
if err := w.Acquire(ctx, 1); err != nil { return err } - Inside the goroutine doing the work
fleet-server/internal/pkg/bulk/engine.go
Lines 458 to 472 in 1a23838
go func() { start := time.Now() if b.tracer != nil { trans := b.tracer.StartTransaction(fmt.Sprintf("Flush queue %s", queue.Type()), "bulker") trans.Context.SetLabel("queue.size", queue.cnt) trans.Context.SetLabel("queue.pending", queue.pending) ctx = apm.ContextWithTransaction(ctx, trans) defer trans.End() } defer w.Release(1) var err error switch queue.ty {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried to add deadline in doFlush to those 2 places, but it doesn't seem to work
logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/L3Nkr
internal/pkg/bulk/engine.go
Outdated
@@ -458,6 +464,10 @@ func (b *Bulker) flushQueue(ctx context.Context, w *semaphore.Weighted, queue qu | |||
go func() { | |||
start := time.Now() | |||
|
|||
// deadline prevents bulker being blocked on flush | |||
ctx, cancel := context.WithTimeout(ctx, defaultFlushContextTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context we are wrapping here has a timeout associated with it from line 448.
Is that what we want in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, probably should be separate timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the change, I had a passing 10k and 50k run: https://buildkite.com/elastic/observability-perf/builds/3745
75k run stuck in upgrading: https://buildkite.com/elastic/observability-perf/builds/3746#019377e3-7719-494b-a952-86c06b748960
I'm only seeing deadline errors with the message find action
which comes from handleAck, not the bulker
https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/4Xwhw
I don't know though how much this timeout helps, since it is hard to reproduce it expiring.
In the last few weeks the scale tests on main (without the changes in this pr) were quite stable, and only failed for 30k+ runs on the upgrade step: https://buildkite.com/elastic/observability-perf/builds?branch=main
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good. Great to hear that happens to have 10k and 50k scale testing pass.
What is the problem this PR solves?
Scale tests are often blocked at flushing the bulker queue.
How does this PR solve the problem?
Adding a context timeout to the bulker flush so it times out if it takes more time than the deadline.
How to test this PR locally
Ran scale tests here: https://buildkite.com/elastic/observability-perf/builds?branch=increase-poll-action-retries-for-agent-upgrades
Design Checklist
Checklist
./changelog/fragments
using the changelog toolRelated issues
Relates https://github.com/elastic/ingest-dev/issues/3783