-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process HTTP send operations from the dedicated task in deferred queue #3207
Conversation
I don't really care about authorship :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it would be very helpful to have a paragraph under pillar/docs explaining:
- what is deferred context and how it works (in terms of parallelism, queues, tickers, task priorities, etc.)
- how many deferred context are created in zedagent and what is the purpose of each
Documenting interaction with LPS, LOC and controller can be done later, but at least deferred stuff could be explained by this PR. I still do not quite get it :)
pkg/pillar/cmd/zedagent/zedagent.go
Outdated
@@ -2084,10 +2024,6 @@ func handleDNSImpl(ctxArg interface{}, key string, | |||
ctx.DNSinitialized = true | |||
return | |||
} | |||
// if status changed to DPCStateSuccess try to send deferred objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this removed, how is transition to DPCStateSuccess
handled now? Shouldn't we send kick to tickers of deferred contexts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything added to the deferred queue through the SetDeferred
will be executed to a completion through the following queue kick, see the commit: 8c54d14
So I assume once we reach this point, everything should be kicked and processing should be started.
So we do not do any explicit queue processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to understand what happens when EVE is without network connectivity for some time and then suddenly the connectivity is restored. This was previously explicitly handled by this if statement. I'm not sure if anything needs to happen when restored connectivity is detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid point. Once something enters the queue the timer will be set to a range 1-15m. So the case you've described can lead to a delay, but not to a lost wakeup. I can add an explicit kick for dns processing, just to be on the safe side.
Difference to the previous version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions plus one mustfix for the startTimer call placement.
You also have some comment to update: That function no longer runs to completion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; some comment suggestions. Can you rebase on master so the eden tests are more likely to pass?
I want the PR to stay not-approved so that the tests don't run until it has been rebased on master and approved explicitly.
@milan-zededa @rouming do we have any tests in Eden which run with a network outage which could be updated/extended to cover the case of link-down and link-up but unreachable controller? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kick off tests even though it is not yet rebased on master.
@eriknordmark I reproduced connectivity problems manually by applying firewall rules, but I do not know is there automated tests of doing the same. |
TCP connect has to be covered by a separate timeout value for all HTTP send operations. This patches introduces the timeout for TCP connect calls ("dial" in terms of Go), which is much less than the send() operation timeout (this helps to fail faster). Signed-off-by: Milan Lenco <milan@zededa.com> Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Fix copy-paste wrong name. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Processing task of all deferred requests should be handled by the deferred queue implementation, and not by the caller. In the next patches calling of the `HandleDeferred` will be forbidden and zedagent should rely only on the internal processing. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Each and every `SetDeferred` call leads to a kick of the internal goroutine, which starts processing the deferred queue. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
When deferred queue is populated by calling the `SetDeferred` the internal processing task is kicked and deferred queue will be processed from the dedicated goroutine. There is no need in direct processing calls, which can stall for quite significant amount of time because of the send timeout. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
deferred queue is processed by the internal goroutine, don't expose the direct call is a public API. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Now the `handleDeferred` is called from a dedicated goroutine, so no need to process only one request or to measure time and break earlier or sleep. Dedicated goroutine starts processing the queue once is kicked and does not break the loop unless error happens. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
No functional changes, just comments updates. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Make sure that VMs (such as Local Profile Server or Local Operator Console) are not delayed by NIM testing DPC. In offline mode connectivity probes will take longer (until dial timeout elapsed). Signed-off-by: Milan Lenco <milan@zededa.com>
Difference to the previous version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kick eden again
Zedagent main event loop is responsible for handling pubsub updates along with sending messages to the controller. HTTP send is synchronous call and can stall the whole main event loop for up to 6 minutes (overall send timeout).
This badly affects the system responsiveness which can be observed in air-gaped environment (no connectivity to the controller), when the queue is stuck waiting for send to be completed and no other pubsub updates are handled.
Three things are done in this PR:
This PR borrows a few commits from the original #3169 PR from @milan-zededa: in general the idea is the same (don't call HTTP send directly), implementation is different.
CC: @milan-zededa
PS. Milan, please fill free to ping me if you do not agree on the missing signed-off of some of my patches. This heavily based on what you did, but implementation is a bit different, but I'm absolutely fine to put you as a co-author.