Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to bulker flush, add default case in failQueue #3986

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Oct 8, 2024

What is the problem this PR solves?

Scale tests are often blocked at flushing the bulker queue.

How does this PR solve the problem?

Adding a context timeout to the bulker flush so it times out if it takes more time than the deadline.

How to test this PR locally

Ran scale tests here: https://buildkite.com/elastic/observability-perf/builds?branch=increase-poll-action-retries-for-agent-upgrades

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates https://github.com/elastic/ingest-dev/issues/3783

Copy link
Contributor

mergify bot commented Oct 8, 2024

This pull request does not have a backport label. Could you fix it @juliaElastic? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Oct 8, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 8, 2024
@@ -337,15 +337,20 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
actions, ackToken = convertActions(zlog, agent.Id, pendingActions)

span, ctx := apm.StartSpan(r.Context(), "longPoll", "process")
// ctx, cancel := context.WithTimeout(ctx, pollDuration)
// defer cancel()

if len(actions) == 0 {
LOOP:
for {
select {
case <-ctx.Done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently looking at this the only time this case would be hit would be if the client closes there connection to Fleet Server. As the span, ctx := apm.StartSpan(r.Context(), ... is being used as the context here, so then this section writes the response. Looking at this code, it shouldn't even write the response. If the context is cancelled then that means the client is no longer connected.

Copy link
Contributor Author

@juliaElastic juliaElastic Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic of returning CheckinResponse here was added in this pr: https://github.com/elastic/fleet-server/pull/3165/files#diff-e0c02bac8d151e9941eedd5ef643441665ee3d2f78baf42c121edd45dee08ded

Can the context be cancelled only by the client here? I'm wondering if the writeResponse is successful, is it correct to return the AckToken without actions? I'm trying to find where the AckToken is persisted back to ES.
I think it's persisted in action_seq_no field on a successful checkin here:

fields[dl.FieldActionSeqNo] = pendingData.extra.seqNo

I'm wondering if there is any retry if the agent /acks request fails or fleet-server fails to persist the action result. I've seen some stuck upgrades where the ack failed and it was never retried.
Though when testing this locally with a simulated error instead of writing action result, I see the retries happening.

@@ -360,6 +365,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
actions = append(actions, acs...)
break LOOP
case policy := <-sub.Output():
zlog.Debug().Str(logger.AgentID, agent.Id).Msg("SCALEDEBUG new policy")
actionResp, err := processPolicy(ctx, zlog, ct.bulker, agent.Id, policy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible that it gets stuck here processing the policy, as it doesn't create its own context and timeout after a period of time. This is still using the same context of the request connection.

@@ -368,7 +374,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
actions = append(actions, *actionResp)
break LOOP
case <-longPoll.C:
zlog.Trace().Msg("fire long poll")
zlog.Debug().Str(logger.AgentID, agent.Id).Msg("fire long poll")
break LOOP
case <-tick.C:
err := ct.bc.CheckIn(agent.Id, string(req.Status), req.Message, nil, rawComponents, nil, ver, unhealthyReason, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem likely that it gets stuck here as ct.bc.CheckIn just grabs a lock and adds to a map. But again it could be possible that there is a dead lock here and that lock is held and never freed. Shouldn't be ruled out.

@juliaElastic
Copy link
Contributor Author

The long poll issue is reproduced again here: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2429301669
I'm not seeing any of these added logs showing up, and it still seems to be the issue with the long running ES request.

@@ -336,6 +336,9 @@ func (b *Bulker) Run(ctx context.Context) error {

w := semaphore.NewWeighted(int64(b.opts.maxPending))

ctx, cancel := context.WithTimeout(ctx, 5*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes each call to the bulker's Run() function to return unconditionally after 5 minutes, including the one in the fleet server main loop.

You can see this happening in the logs. Every 5 minutes the bulker aborts everything it is doing.

To avoid this unconditional exit you would create the context in doFlush https://github.com/juliaElastic/fleet-server/blob/a9a82eaae87b7c2d4b54d714d34943a970c8f3cd/internal/pkg/bulk/engine.go#L352

That would bound each execution of doFlush to 5 minutes which should have the same effect, but without the logging about constant exiting.

Copy link
Contributor Author

@juliaElastic juliaElastic Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving the timeout to doFlush but it doesn't work, I think the doFlush function exits too quickly and cancels the context. Flee server doesn't come online at all.
See the logs from a 10k run: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/S0Wla

Copy link
Contributor Author

@juliaElastic juliaElastic Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the timeout outside of doFlush, see this comment: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2459697129
Strangely I don't really see the flush timing out anymore, mostly seeing the upgrade step being stuck at 75k runs, 10k runs pass almost always.
I'm seeing this checkin error in the logs.

@juliaElastic juliaElastic marked this pull request as ready for review November 8, 2024 11:57
@juliaElastic juliaElastic requested a review from a team as a code owner November 8, 2024 11:57
@swiatekm swiatekm added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 8, 2024
@@ -153,6 +153,7 @@ LOOP:
case <-tick.C:
if err = bc.flush(ctx); err != nil {
zerolog.Ctx(ctx).Error().Err(err).Msg("Eat bulk checkin error; Keep on truckin'")
break LOOP
Copy link
Contributor Author

@juliaElastic juliaElastic Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaking the loop here makes the bulker exit and be restarted
it seems to help with the upgrade step, tested with a 51k run: https://buildkite.com/elastic/observability-perf/builds/3670#01930bb4-fb8d-47c2-8699-94c0cc471699

Logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/NiMhb

Hm though upgrade seems to be stuck in this 10k run and I don't see any related errors in the logs.
https://buildkite.com/elastic/observability-perf/builds/3673#01930c05-078e-4e75-9453-e3792ab5294e

This change doesn't seem to help overall, had another 75k run stuck in upgrading for the last 500 drones. Logs here.

@juliaElastic juliaElastic changed the title use poll timeout in es ctx Add timeout to bulker flush, add default case in failQueue Nov 8, 2024
Copy link
Contributor

mergify bot commented Nov 8, 2024

This pull request is now in conflicts. Could you fix it @juliaElastic? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b es-timeout-long-poll upstream/es-timeout-long-poll
git merge upstream/main
git push upstream es-timeout-long-poll

@swiatekm swiatekm removed their request for review November 12, 2024 12:36
@blakerouse
Copy link
Contributor

@juliaElastic What is the status of this PR? I see you request another review, are you seeing success with that current state of the PR now? Can you provide a summary of those results?

@juliaElastic
Copy link
Contributor Author

@juliaElastic What is the status of this PR? I see you request another review, are you seeing success with that current state of the PR now? Can you provide a summary of those results?

I summarized here: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2467506043
I couldn't reproduce issues with policy change steps anymore with this pr, only upgrade seems to fail sometimes, which seems a different issue.
I would like to merge this if possible, unless we want to wait until the upgrade issue is tracked down too.

Comment on lines 409 to 411
// deadline prevents bulker being blocked on flush
flushCtx, cancel := context.WithTimeout(ctx, defaultFlushContextTimeout)
defer cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These contexts are created in a loop, but the cancels are deffered until the function (Bulker.Run) returns. Should we move context creation to within the doFlush function to prevent this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally added it to doFlush but it didn't work, Fleet Server didn't come up successfully. I think doFlush finishes too quickly and would cancel the context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should cancel be called after doFlush is called instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work because most of the work happening is asynchronous in a goroutine, so once you unblock from acquiring the semaphore if you cancel immediately you'll exit the goroutine.

go func() {
start := time.Now()
if b.tracer != nil {

There probably needs to be two separate deadlines created in doFlush():

  1. On the semaphore acquire call:
    if err := w.Acquire(ctx, 1); err != nil {
    return err
    }
  2. Inside the goroutine doing the work
    go func() {
    start := time.Now()
    if b.tracer != nil {
    trans := b.tracer.StartTransaction(fmt.Sprintf("Flush queue %s", queue.Type()), "bulker")
    trans.Context.SetLabel("queue.size", queue.cnt)
    trans.Context.SetLabel("queue.pending", queue.pending)
    ctx = apm.ContextWithTransaction(ctx, trans)
    defer trans.End()
    }
    defer w.Release(1)
    var err error
    switch queue.ty {

Copy link
Contributor Author

@juliaElastic juliaElastic Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to add deadline in doFlush to those 2 places, but it doesn't seem to work
logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/L3Nkr

@@ -458,6 +464,10 @@ func (b *Bulker) flushQueue(ctx context.Context, w *semaphore.Weighted, queue qu
go func() {
start := time.Now()

// deadline prevents bulker being blocked on flush
ctx, cancel := context.WithTimeout(ctx, defaultFlushContextTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context we are wrapping here has a timeout associated with it from line 448.
Is that what we want in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, probably should be separate timeout

Copy link
Contributor Author

@juliaElastic juliaElastic Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change, I had a passing 10k and 50k run: https://buildkite.com/elastic/observability-perf/builds/3745

75k run stuck in upgrading: https://buildkite.com/elastic/observability-perf/builds/3746#019377e3-7719-494b-a952-86c06b748960
I'm only seeing deadline errors with the message find action which comes from handleAck, not the bulker
https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/4Xwhw

I don't know though how much this timeout helps, since it is hard to reproduce it expiring.

In the last few weeks the scale tests on main (without the changes in this pr) were quite stable, and only failed for 30k+ runs on the upgrade step: https://buildkite.com/elastic/observability-perf/builds?branch=main

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. Great to hear that happens to have 10k and 50k scale testing pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants