Add timeout to bulker flush, add default case in failQueue #3986

juliaElastic · 2024-10-08T12:45:59Z

What is the problem this PR solves?

Scale tests are often blocked at flushing the bulker queue.

How does this PR solve the problem?

Adding a context timeout to the bulker flush so it times out if it takes more time than the deadline.

How to test this PR locally

Ran scale tests here: https://buildkite.com/elastic/observability-perf/builds?branch=increase-poll-action-retries-for-agent-upgrades

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates https://github.com/elastic/ingest-dev/issues/3783

mergify · 2024-10-08T12:46:50Z

This pull request does not have a backport label. Could you fix it @juliaElastic? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-10-08T12:46:51Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

internal/pkg/api/handleCheckin.go

blakerouse · 2024-10-10T00:07:31Z

internal/pkg/api/handleCheckin.go

@@ -337,15 +337,20 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
 	actions, ackToken = convertActions(zlog, agent.Id, pendingActions)

 	span, ctx := apm.StartSpan(r.Context(), "longPoll", "process")
+	// ctx, cancel := context.WithTimeout(ctx, pollDuration)
+	// defer cancel()
+
 	if len(actions) == 0 {
 	LOOP:
 		for {
 			select {
 			case <-ctx.Done():


Currently looking at this the only time this case would be hit would be if the client closes there connection to Fleet Server. As the span, ctx := apm.StartSpan(r.Context(), ... is being used as the context here, so then this section writes the response. Looking at this code, it shouldn't even write the response. If the context is cancelled then that means the client is no longer connected.

the logic of returning CheckinResponse here was added in this pr: https://github.com/elastic/fleet-server/pull/3165/files#diff-e0c02bac8d151e9941eedd5ef643441665ee3d2f78baf42c121edd45dee08ded

Can the context be cancelled only by the client here? I'm wondering if the writeResponse is successful, is it correct to return the AckToken without actions? I'm trying to find where the AckToken is persisted back to ES.
I think it's persisted in action_seq_no field on a successful checkin here:

fleet-server/internal/pkg/checkin/bulk.go

Line 263 in 3fd2a48

fields[dl.FieldActionSeqNo] = pendingData.extra.seqNo

I'm wondering if there is any retry if the agent /acks request fails or fleet-server fails to persist the action result. I've seen some stuck upgrades where the ack failed and it was never retried.
Though when testing this locally with a simulated error instead of writing action result, I see the retries happening.

blakerouse · 2024-10-10T00:08:47Z

internal/pkg/api/handleCheckin.go

@@ -360,6 +365,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
 				actions = append(actions, acs...)
 				break LOOP
 			case policy := <-sub.Output():
+				zlog.Debug().Str(logger.AgentID, agent.Id).Msg("SCALEDEBUG new policy")
 				actionResp, err := processPolicy(ctx, zlog, ct.bulker, agent.Id, policy)


Possible that it gets stuck here processing the policy, as it doesn't create its own context and timeout after a period of time. This is still using the same context of the request connection.

blakerouse · 2024-10-10T00:10:31Z

internal/pkg/api/handleCheckin.go

@@ -368,7 +374,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
 				actions = append(actions, *actionResp)
 				break LOOP
 			case <-longPoll.C:
-				zlog.Trace().Msg("fire long poll")
+				zlog.Debug().Str(logger.AgentID, agent.Id).Msg("fire long poll")
 				break LOOP
 			case <-tick.C:
 				err := ct.bc.CheckIn(agent.Id, string(req.Status), req.Message, nil, rawComponents, nil, ver, unhealthyReason, false)


Doesn't seem likely that it gets stuck here as ct.bc.CheckIn just grabs a lock and adds to a map. But again it could be possible that there is a dead lock here and that lock is held and never freed. Shouldn't be ruled out.

juliaElastic · 2024-10-22T14:14:11Z

The long poll issue is reproduced again here: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2429301669
I'm not seeing any of these added logs showing up, and it still seems to be the issue with the long running ES request.

cmacknz · 2024-11-05T20:22:27Z

internal/pkg/bulk/engine.go

@@ -336,6 +336,9 @@ func (b *Bulker) Run(ctx context.Context) error {

 	w := semaphore.NewWeighted(int64(b.opts.maxPending))

+	ctx, cancel := context.WithTimeout(ctx, 5*time.Minute)


This causes each call to the bulker's Run() function to return unconditionally after 5 minutes, including the one in the fleet server main loop.

You can see this happening in the logs. Every 5 minutes the bulker aborts everything it is doing.

To avoid this unconditional exit you would create the context in doFlush https://github.com/juliaElastic/fleet-server/blob/a9a82eaae87b7c2d4b54d714d34943a970c8f3cd/internal/pkg/bulk/engine.go#L352

That would bound each execution of doFlush to 5 minutes which should have the same effect, but without the logging about constant exiting.

I tried moving the timeout to doFlush but it doesn't work, I think the doFlush function exits too quickly and cancels the context. Flee server doesn't come online at all.
See the logs from a 10k run: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/S0Wla

I moved the timeout outside of doFlush, see this comment: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2459697129
Strangely I don't really see the flush timing out anymore, mostly seeing the upgrade step being stuck at 75k runs, 10k runs pass almost always.
I'm seeing this checkin error in the logs.

juliaElastic · 2024-11-08T14:21:42Z

internal/pkg/checkin/bulk.go

@@ -153,6 +153,7 @@ LOOP:
 		case <-tick.C:
 			if err = bc.flush(ctx); err != nil {
 				zerolog.Ctx(ctx).Error().Err(err).Msg("Eat bulk checkin error; Keep on truckin'")
+				break LOOP


breaking the loop here makes the bulker exit and be restarted
it seems to help with the upgrade step, tested with a 51k run: https://buildkite.com/elastic/observability-perf/builds/3670#01930bb4-fb8d-47c2-8699-94c0cc471699

Logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/NiMhb

Hm though upgrade seems to be stuck in this 10k run and I don't see any related errors in the logs.
https://buildkite.com/elastic/observability-perf/builds/3673#01930c05-078e-4e75-9453-e3792ab5294e

This change doesn't seem to help overall, had another 75k run stuck in upgrading for the last 500 drones. Logs here.

mergify · 2024-11-08T14:58:20Z

This pull request is now in conflicts. Could you fix it @juliaElastic? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b es-timeout-long-poll upstream/es-timeout-long-poll
git merge upstream/main
git push upstream es-timeout-long-poll

blakerouse · 2024-11-13T19:18:37Z

@juliaElastic What is the status of this PR? I see you request another review, are you seeing success with that current state of the PR now? Can you provide a summary of those results?

juliaElastic · 2024-11-14T07:57:09Z

@juliaElastic What is the status of this PR? I see you request another review, are you seeing success with that current state of the PR now? Can you provide a summary of those results?

I summarized here: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2467506043
I couldn't reproduce issues with policy change steps anymore with this pr, only upgrade seems to fail sometimes, which seems a different issue.
I would like to merge this if possible, unless we want to wait until the upgrade issue is tracked down too.

michel-laterman · 2024-11-18T18:31:02Z

internal/pkg/bulk/engine.go

+				// deadline prevents bulker being blocked on flush
+				flushCtx, cancel := context.WithTimeout(ctx, defaultFlushContextTimeout)
+				defer cancel()


These contexts are created in a loop, but the cancels are deffered until the function (Bulker.Run) returns. Should we move context creation to within the doFlush function to prevent this?

I originally added it to doFlush but it didn't work, Fleet Server didn't come up successfully. I think doFlush finishes too quickly and would cancel the context.

should cancel be called after doFlush is called instead?

It doesn't work because most of the work happening is asynchronous in a goroutine, so once you unblock from acquiring the semaphore if you cancel immediately you'll exit the goroutine.

fleet-server/internal/pkg/bulk/engine.go

Lines 458 to 461 in 1a23838

go func() {

start := time.Now()

if b.tracer != nil {

There probably needs to be two separate deadlines created in doFlush():

On the semaphore acquire call:

fleet-server/internal/pkg/bulk/engine.go

Lines 446 to 449 in 1a23838

if err := w.Acquire(ctx, 1); err != nil {

return err

}

Inside the goroutine doing the work

fleet-server/internal/pkg/bulk/engine.go

Lines 458 to 472 in 1a23838

go func() {

start := time.Now()

if b.tracer != nil {

trans := b.tracer.StartTransaction(fmt.Sprintf("Flush queue %s", queue.Type()), "bulker")

trans.Context.SetLabel("queue.size", queue.cnt)

trans.Context.SetLabel("queue.pending", queue.pending)

ctx = apm.ContextWithTransaction(ctx, trans)

defer trans.End()

}

defer w.Release(1)

var err error

switch queue.ty {

tried to add deadline in doFlush to those 2 places, but it doesn't seem to work
logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/L3Nkr

michel-laterman · 2024-11-28T21:01:32Z

internal/pkg/bulk/engine.go

@@ -458,6 +464,10 @@ func (b *Bulker) flushQueue(ctx context.Context, w *semaphore.Weighted, queue qu
 	go func() {
 		start := time.Now()

+		// deadline prevents bulker being blocked on flush
+		ctx, cancel := context.WithTimeout(ctx, defaultFlushContextTimeout)


The context we are wrapping here has a timeout associated with it from line 448.
Is that what we want in this case?

you're right, probably should be separate timeout

made the change, I had a passing 10k and 50k run: https://buildkite.com/elastic/observability-perf/builds/3745

75k run stuck in upgrading: https://buildkite.com/elastic/observability-perf/builds/3746#019377e3-7719-494b-a952-86c06b748960
I'm only seeing deadline errors with the message find action which comes from handleAck, not the bulker
https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/4Xwhw

I don't know though how much this timeout helps, since it is hard to reproduce it expiring.

In the last few weeks the scale tests on main (without the changes in this pr) were quite stable, and only failed for 30k+ runs on the upgrade step: https://buildkite.com/elastic/observability-perf/builds?branch=main

elastic-sonarqube · 2024-11-29T14:43:23Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
78.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

blakerouse

Change looks good. Great to hear that happens to have 10k and 50k scale testing pass.

use poll timeout in es ctx

2add88e

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 8, 2024

cmacknz reviewed Oct 8, 2024

View reviewed changes

internal/pkg/api/handleCheckin.go Outdated Show resolved Hide resolved

Add some SCALEDEBUG logs

780a146

blakerouse reviewed Oct 10, 2024

View reviewed changes

add agent id to logs

5c83def

juliaElastic added 2 commits October 28, 2024 15:33

debug logs in fleet.go

ed72b1c

add default case, log deadline

7352efe

juliaElastic force-pushed the es-timeout-long-poll branch from d3d3f2f to 7352efe Compare November 4, 2024 09:48

5m timeout

a9a82ea

cmacknz reviewed Nov 5, 2024

View reviewed changes

juliaElastic added 7 commits November 6, 2024 09:48

Merge remote-tracking branch 'origin' into es-timeout-long-poll

8cd2f0b

cleanup logs, move deadline to doFlush

6152b0a

move timeout before doFlush

17de402

cleanup logs

dcad89b

extracted const

6f82ac4

Merge remote-tracking branch 'origin/main' into es-timeout-long-poll

b05bcfa

remove log

2fcfcd0

juliaElastic marked this pull request as ready for review November 8, 2024 11:57

juliaElastic requested a review from a team as a code owner November 8, 2024 11:57

juliaElastic requested review from michel-laterman and swiatekm November 8, 2024 11:57

swiatekm added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 8, 2024

exit bulker on checkin error

7bb949e

juliaElastic commented Nov 8, 2024

View reviewed changes

juliaElastic changed the title ~~use poll timeout in es ctx~~ Add timeout to bulker flush, add default case in failQueue Nov 8, 2024

update to latest stack snapshot

e545627

juliaElastic and others added 2 commits November 8, 2024 15:59

Merge branch 'main' into es-timeout-long-poll

6f593e9

revert break LOOP

3d5f557

swiatekm removed their request for review November 12, 2024 12:36

juliaElastic requested review from blakerouse and cmacknz November 13, 2024 15:55

michel-laterman reviewed Nov 18, 2024

View reviewed changes

juliaElastic added 3 commits November 26, 2024 13:24

move deadline inside doFlush

11b16cc

fix cancel

a4525ed

remove doFlush param

713cb25

michel-laterman reviewed Nov 28, 2024

View reviewed changes

juliaElastic added 2 commits November 29, 2024 10:18

separate context

b18c952

added changelog

2bcf456

blakerouse approved these changes Nov 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to bulker flush, add default case in failQueue #3986

Add timeout to bulker flush, add default case in failQueue #3986

juliaElastic commented Oct 8, 2024 •

edited

Loading

mergify bot commented Oct 8, 2024

mergify bot commented Oct 8, 2024

blakerouse Oct 10, 2024

juliaElastic Oct 25, 2024 •

edited

Loading

blakerouse Oct 10, 2024

blakerouse Oct 10, 2024

juliaElastic commented Oct 22, 2024

cmacknz Nov 5, 2024

juliaElastic Nov 6, 2024 •

edited

Loading

juliaElastic Nov 8, 2024 •

edited

Loading

juliaElastic Nov 8, 2024 •

edited

Loading

mergify bot commented Nov 8, 2024

blakerouse commented Nov 13, 2024

juliaElastic commented Nov 14, 2024

michel-laterman Nov 18, 2024

juliaElastic Nov 19, 2024

michel-laterman Nov 21, 2024

cmacknz Nov 25, 2024

juliaElastic Nov 26, 2024 •

edited

Loading

michel-laterman Nov 28, 2024

juliaElastic Nov 29, 2024

juliaElastic Nov 29, 2024 •

edited

Loading

elastic-sonarqube bot commented Nov 29, 2024

blakerouse left a comment

		@@ -336,6 +336,9 @@ func (b *Bulker) Run(ctx context.Context) error {

		w := semaphore.NewWeighted(int64(b.opts.maxPending))

		ctx, cancel := context.WithTimeout(ctx, 5*time.Minute)

	go func() {
	start := time.Now()

	if b.tracer != nil {
	trans := b.tracer.StartTransaction(fmt.Sprintf("Flush queue %s", queue.Type()), "bulker")
	trans.Context.SetLabel("queue.size", queue.cnt)
	trans.Context.SetLabel("queue.pending", queue.pending)
	ctx = apm.ContextWithTransaction(ctx, trans)
	defer trans.End()
	}

	defer w.Release(1)

	var err error
	switch queue.ty {

Add timeout to bulker flush, add default case in failQueue #3986

Are you sure you want to change the base?

Add timeout to bulker flush, add default case in failQueue #3986

Conversation

juliaElastic commented Oct 8, 2024 • edited Loading

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

mergify bot commented Oct 8, 2024

mergify bot commented Oct 8, 2024

Choose a reason for hiding this comment

juliaElastic Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic commented Oct 22, 2024

Choose a reason for hiding this comment

juliaElastic Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

juliaElastic Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

juliaElastic Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

mergify bot commented Nov 8, 2024

blakerouse commented Nov 13, 2024

juliaElastic commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

elastic-sonarqube bot commented Nov 29, 2024

Quality Gate passed

blakerouse left a comment

Choose a reason for hiding this comment

juliaElastic commented Oct 8, 2024 •

edited

Loading

juliaElastic Oct 25, 2024 •

edited

Loading

juliaElastic Nov 6, 2024 •

edited

Loading

juliaElastic Nov 8, 2024 •

edited

Loading

juliaElastic Nov 8, 2024 •

edited

Loading

juliaElastic Nov 26, 2024 •

edited

Loading

juliaElastic Nov 29, 2024 •

edited

Loading