Process HTTP send operations from the dedicated task in deferred queue #3207

rouming · 2023-05-12T09:53:20Z

Zedagent main event loop is responsible for handling pubsub updates along with sending messages to the controller. HTTP send is synchronous call and can stall the whole main event loop for up to 6 minutes (overall send timeout).
This badly affects the system responsiveness which can be observed in air-gaped environment (no connectivity to the controller), when the queue is stuck waiting for send to be completed and no other pubsub updates are handled.

Three things are done in this PR:

Introduced connect() ("dial" in terms of Go) timeout, which is much less than the send() timeout (helps to fail faster).
All direct HTTP send() calls are moved away from the main event loop and executed in dedicated deferred goroutines.
Also make sure that VMs (such as Local Profile Server or Local Operator Console) are not delayed by NIM testing DPC. In offline mode connectivity probes will take longer (until dial timeout elapsed).

This PR borrows a few commits from the original #3169 PR from @milan-zededa: in general the idea is the same (don't call HTTP send directly), implementation is different.

CC: @milan-zededa

PS. Milan, please fill free to ping me if you do not agree on the missing signed-off of some of my patches. This heavily based on what you did, but implementation is a bit different, but I'm absolutely fine to put you as a co-author.

milan-zededa · 2023-05-12T09:57:39Z

PS. Milan, please fill free to ping me if you do not agree on the missing signed-off of some of my patches. This heavily based on what you did, but implementation is a bit different, but I'm absolutely fine to put you as a co-author.

I don't really care about authorship :)

milan-zededa

I believe it would be very helpful to have a paragraph under pillar/docs explaining:

what is deferred context and how it works (in terms of parallelism, queues, tickers, task priorities, etc.)
how many deferred context are created in zedagent and what is the purpose of each

Documenting interaction with LPS, LOC and controller can be done later, but at least deferred stuff could be explained by this PR. I still do not quite get it :)

pkg/pillar/zedcloud/deferred.go

milan-zededa · 2023-05-12T13:26:04Z

pkg/pillar/cmd/zedagent/zedagent.go

@@ -2084,10 +2024,6 @@ func handleDNSImpl(ctxArg interface{}, key string,
 		ctx.DNSinitialized = true
 		return
 	}
-	// if status changed to DPCStateSuccess try to send deferred objects


With this removed, how is transition to DPCStateSuccess handled now? Shouldn't we send kick to tickers of deferred contexts?

Everything added to the deferred queue through the SetDeferred will be executed to a completion through the following queue kick, see the commit: 8c54d14
So I assume once we reach this point, everything should be kicked and processing should be started.
So we do not do any explicit queue processing.

I try to understand what happens when EVE is without network connectivity for some time and then suddenly the connectivity is restored. This was previously explicitly handled by this if statement. I'm not sure if anything needs to happen when restored connectivity is detected.

This is a valid point. Once something enters the queue the timer will be set to a range 1-15m. So the case you've described can lead to a delay, but not to a lost wakeup. I can add an explicit kick for dns processing, just to be on the safe side.

pkg/pillar/zedcloud/deferred.go

rouming · 2023-05-12T15:03:19Z

Difference to the previous version:

Kick the deferred queue right away in case of DNS processing, don't let any delays happen

pkg/pillar/zedcloud/deferred.go

pkg/pillar/cmd/zedagent/zedagent.go

eriknordmark

Some suggestions plus one mustfix for the startTimer call placement.

eriknordmark · 2023-05-15T09:14:40Z

You also have some comment to update:
// queueInfoToDest - queues "info" requests according to the specified
//
// destination. Deferred event queue runs to a completion
// from this context, but deferred periodic queue will

That function no longer runs to completion.

eriknordmark

LGTM; some comment suggestions. Can you rebase on master so the eden tests are more likely to pass?

I want the PR to stay not-approved so that the tests don't run until it has been rebased on master and approved explicitly.

eriknordmark · 2023-05-15T17:48:59Z

@milan-zededa @rouming do we have any tests in Eden which run with a network outage which could be updated/extended to cover the case of link-down and link-up but unreachable controller?

eriknordmark

Kick off tests even though it is not yet rebased on master.

rouming · 2023-05-16T08:15:27Z

@milan-zededa @rouming do we have any tests in Eden which run with a network outage which could be updated/extended to cover the case of link-down and link-up but unreachable controller?

@eriknordmark I reproduced connectivity problems manually by applying firewall rules, but I do not know is there automated tests of doing the same.

TCP connect has to be covered by a separate timeout value for all HTTP send operations. This patches introduces the timeout for TCP connect calls ("dial" in terms of Go), which is much less than the send() operation timeout (this helps to fail faster). Signed-off-by: Milan Lenco <milan@zededa.com> Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Fix copy-paste wrong name. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Processing task of all deferred requests should be handled by the deferred queue implementation, and not by the caller. In the next patches calling of the `HandleDeferred` will be forbidden and zedagent should rely only on the internal processing. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Each and every `SetDeferred` call leads to a kick of the internal goroutine, which starts processing the deferred queue. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

When deferred queue is populated by calling the `SetDeferred` the internal processing task is kicked and deferred queue will be processed from the dedicated goroutine. There is no need in direct processing calls, which can stall for quite significant amount of time because of the send timeout. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

deferred queue is processed by the internal goroutine, don't expose the direct call is a public API. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Now the `handleDeferred` is called from a dedicated goroutine, so no need to process only one request or to measure time and break earlier or sleep. Dedicated goroutine starts processing the queue once is kicked and does not break the loop unless error happens. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

No functional changes, just comments updates. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Make sure that VMs (such as Local Profile Server or Local Operator Console) are not delayed by NIM testing DPC. In offline mode connectivity probes will take longer (until dial timeout elapsed). Signed-off-by: Milan Lenco <milan@zededa.com>

rouming · 2023-05-16T08:21:55Z

Difference to the previous version:

Comments tweaks

eriknordmark

Kick eden again

rouming requested review from eriknordmark and rvs as code owners May 12, 2023 09:53

rouming mentioned this pull request May 12, 2023

Move HTTP send operations away from the zedagent main event loop #3169

Closed

rouming added the stable Should be backported to stable release(s) label May 12, 2023

milan-zededa reviewed May 12, 2023

View reviewed changes

pkg/pillar/zedcloud/deferred.go Show resolved Hide resolved

rouming force-pushed the async-send branch from 1d0747a to d8a066d Compare May 12, 2023 15:01

eriknordmark reviewed May 15, 2023

View reviewed changes

pkg/pillar/zedcloud/deferred.go Show resolved Hide resolved

eriknordmark reviewed May 15, 2023

View reviewed changes

pkg/pillar/zedcloud/deferred.go Outdated Show resolved Hide resolved

eriknordmark reviewed May 15, 2023

View reviewed changes

pkg/pillar/cmd/zedagent/zedagent.go Show resolved Hide resolved

eriknordmark previously requested changes May 15, 2023

View reviewed changes

eriknordmark reviewed May 15, 2023

View reviewed changes

eriknordmark approved these changes May 15, 2023

View reviewed changes

rouming and others added 9 commits May 16, 2023 10:21

zedagent: fix watchdog name for the deferred processing task

c0edc87

Fix copy-paste wrong name. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

deferred: kick the processing goroutine from the SetDeferred

2f95b8b

Each and every `SetDeferred` call leads to a kick of the internal goroutine, which starts processing the deferred queue. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

deferred: make HandleDeferred private

a9d13c0

deferred queue is processed by the internal goroutine, don't expose the direct call is a public API. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

deferred,send: comments tweaks

cdd5de1

No functional changes, just comments updates. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Do not block domainmgr on DPC testing

c79e119

Make sure that VMs (such as Local Profile Server or Local Operator Console) are not delayed by NIM testing DPC. In offline mode connectivity probes will take longer (until dial timeout elapsed). Signed-off-by: Milan Lenco <milan@zededa.com>

rouming force-pushed the async-send branch from d8a066d to c79e119 Compare May 16, 2023 08:21

eriknordmark approved these changes May 16, 2023

View reviewed changes

eriknordmark merged commit 22de1c8 into lf-edge:master May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process HTTP send operations from the dedicated task in deferred queue #3207

Process HTTP send operations from the dedicated task in deferred queue #3207

rouming commented May 12, 2023

milan-zededa commented May 12, 2023

milan-zededa left a comment

milan-zededa May 12, 2023 •

edited

Loading

rouming May 12, 2023 •

edited

Loading

milan-zededa May 12, 2023

rouming May 12, 2023

rouming commented May 12, 2023

eriknordmark left a comment

eriknordmark commented May 15, 2023

eriknordmark left a comment

eriknordmark commented May 15, 2023

eriknordmark left a comment

rouming commented May 16, 2023

rouming commented May 16, 2023

eriknordmark left a comment

Process HTTP send operations from the dedicated task in deferred queue #3207

Process HTTP send operations from the dedicated task in deferred queue #3207

Conversation

rouming commented May 12, 2023

milan-zededa commented May 12, 2023

milan-zededa left a comment

Choose a reason for hiding this comment

milan-zededa May 12, 2023 • edited Loading

Choose a reason for hiding this comment

rouming May 12, 2023 • edited Loading

Choose a reason for hiding this comment

milan-zededa May 12, 2023

Choose a reason for hiding this comment

rouming May 12, 2023

Choose a reason for hiding this comment

rouming commented May 12, 2023

eriknordmark left a comment

Choose a reason for hiding this comment

eriknordmark commented May 15, 2023

eriknordmark left a comment

Choose a reason for hiding this comment

eriknordmark commented May 15, 2023

eriknordmark left a comment

Choose a reason for hiding this comment

rouming commented May 16, 2023

rouming commented May 16, 2023

eriknordmark left a comment

Choose a reason for hiding this comment

milan-zededa May 12, 2023 •

edited

Loading

rouming May 12, 2023 •

edited

Loading