Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process HTTP send operations from the dedicated task in deferred queue #3207

Merged
merged 9 commits into from
May 17, 2023

Conversation

rouming
Copy link
Contributor

@rouming rouming commented May 12, 2023

Zedagent main event loop is responsible for handling pubsub updates along with sending messages to the controller. HTTP send is synchronous call and can stall the whole main event loop for up to 6 minutes (overall send timeout).
This badly affects the system responsiveness which can be observed in air-gaped environment (no connectivity to the controller), when the queue is stuck waiting for send to be completed and no other pubsub updates are handled.

Three things are done in this PR:

  • Introduced connect() ("dial" in terms of Go) timeout, which is much less than the send() timeout (helps to fail faster).
  • All direct HTTP send() calls are moved away from the main event loop and executed in dedicated deferred goroutines.
  • Also make sure that VMs (such as Local Profile Server or Local Operator Console) are not delayed by NIM testing DPC. In offline mode connectivity probes will take longer (until dial timeout elapsed).

This PR borrows a few commits from the original #3169 PR from @milan-zededa: in general the idea is the same (don't call HTTP send directly), implementation is different.

CC: @milan-zededa

PS. Milan, please fill free to ping me if you do not agree on the missing signed-off of some of my patches. This heavily based on what you did, but implementation is a bit different, but I'm absolutely fine to put you as a co-author.

@rouming rouming requested review from eriknordmark and rvs as code owners May 12, 2023 09:53
@rouming rouming added the stable Should be backported to stable release(s) label May 12, 2023
@milan-zededa
Copy link
Contributor

PS. Milan, please fill free to ping me if you do not agree on the missing signed-off of some of my patches. This heavily based on what you did, but implementation is a bit different, but I'm absolutely fine to put you as a co-author.

I don't really care about authorship :)

Copy link
Contributor

@milan-zededa milan-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would be very helpful to have a paragraph under pillar/docs explaining:

  • what is deferred context and how it works (in terms of parallelism, queues, tickers, task priorities, etc.)
  • how many deferred context are created in zedagent and what is the purpose of each

Documenting interaction with LPS, LOC and controller can be done later, but at least deferred stuff could be explained by this PR. I still do not quite get it :)

pkg/pillar/zedcloud/deferred.go Show resolved Hide resolved
@@ -2084,10 +2024,6 @@ func handleDNSImpl(ctxArg interface{}, key string,
ctx.DNSinitialized = true
return
}
// if status changed to DPCStateSuccess try to send deferred objects
Copy link
Contributor

@milan-zededa milan-zededa May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this removed, how is transition to DPCStateSuccess handled now? Shouldn't we send kick to tickers of deferred contexts?

Copy link
Contributor Author

@rouming rouming May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything added to the deferred queue through the SetDeferred will be executed to a completion through the following queue kick, see the commit: 8c54d14
So I assume once we reach this point, everything should be kicked and processing should be started.
So we do not do any explicit queue processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to understand what happens when EVE is without network connectivity for some time and then suddenly the connectivity is restored. This was previously explicitly handled by this if statement. I'm not sure if anything needs to happen when restored connectivity is detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid point. Once something enters the queue the timer will be set to a range 1-15m. So the case you've described can lead to a delay, but not to a lost wakeup. I can add an explicit kick for dns processing, just to be on the safe side.

@rouming
Copy link
Contributor Author

rouming commented May 12, 2023

Difference to the previous version:

  • Kick the deferred queue right away in case of DNS processing, don't let any delays happen

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions plus one mustfix for the startTimer call placement.

@eriknordmark
Copy link
Contributor

You also have some comment to update:
// queueInfoToDest - queues "info" requests according to the specified
//
// destination. Deferred event queue runs to a completion
// from this context, but deferred periodic queue will

That function no longer runs to completion.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; some comment suggestions. Can you rebase on master so the eden tests are more likely to pass?

@eriknordmark eriknordmark dismissed their stale review May 15, 2023 12:54

I want the PR to stay not-approved so that the tests don't run until it has been rebased on master and approved explicitly.

@eriknordmark
Copy link
Contributor

@milan-zededa @rouming do we have any tests in Eden which run with a network outage which could be updated/extended to cover the case of link-down and link-up but unreachable controller?

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kick off tests even though it is not yet rebased on master.

@rouming
Copy link
Contributor Author

rouming commented May 16, 2023

@milan-zededa @rouming do we have any tests in Eden which run with a network outage which could be updated/extended to cover the case of link-down and link-up but unreachable controller?

@eriknordmark I reproduced connectivity problems manually by applying firewall rules, but I do not know is there automated tests of doing the same.

rouming and others added 9 commits May 16, 2023 10:21
TCP connect has to be covered by a separate timeout value
for all HTTP send operations. This patches introduces
the timeout for TCP connect calls ("dial" in terms of Go),
which is much less than the send() operation timeout
(this helps to fail faster).

Signed-off-by: Milan Lenco <milan@zededa.com>
Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Fix copy-paste wrong name.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Processing task of all deferred requests should be handled by
the deferred queue implementation, and not by the caller.

In the next patches calling of the `HandleDeferred` will be
forbidden and zedagent should rely only on the internal
processing.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Each and every `SetDeferred` call leads to a kick of the internal
goroutine, which starts processing the deferred queue.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
When deferred queue is populated by calling the `SetDeferred` the
internal processing task is kicked and deferred queue  will be
processed from the dedicated goroutine. There is no need in direct
processing calls, which can stall for quite significant amount of
time because of the send timeout.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
deferred queue is processed by the internal goroutine,
don't expose the direct call is a public API.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Now the `handleDeferred` is called from a dedicated goroutine,
so no need to process only one request or to measure time and
break earlier or sleep. Dedicated goroutine starts processing
the queue once is kicked and does not break the loop unless
error happens.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
No functional changes, just comments updates.

Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Make sure that VMs (such as Local Profile Server or Local
Operator Console) are not delayed by NIM testing DPC.
In offline mode connectivity probes will take longer
(until dial timeout elapsed).

Signed-off-by: Milan Lenco <milan@zededa.com>
@rouming
Copy link
Contributor Author

rouming commented May 16, 2023

Difference to the previous version:

  • Comments tweaks

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kick eden again

@eriknordmark eriknordmark merged commit 22de1c8 into lf-edge:master May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stable Should be backported to stable release(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants