-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement NTP source info messages to the controller #3968
Conversation
6f78709
to
3513093
Compare
3513093
to
e5455c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please mention in EVE docs:
- that we use chrony
- how we pick NTP server (first from the list obtained via DHCP or static config - unless we improve this to apply all)
- what is the default NTP if none configured
- NTP status information that EVE publishes (what info is available, how often it is published and how to change the interval)
I have started something here, but separate md file for ntp could also make sense.
e5455c5
to
70ad87c
Compare
4582cf4
to
b8578cd
Compare
Mentioned
Those two are already covered by you.
Done.
Meh, separate file is too much, I do not know what to write there except those 4 lines I added. |
089d4f2
to
4117601
Compare
e16425e
to
0119979
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A periodic message makes sense to a LOC/LPS (and I don't understand why that isn't included - we shouldn't have that provide less info that the controller).
But for the info sent to the controller I don't think periodic updates makes sense. That should be done when there is a significant change. 10 minutes is too slow when there are changes, and too frequent when there are no significant changes.
Not sure what you mean by that isn't included, LOC is updated periodically by the
My main question is how many bytes we need to spare here? e.g. we do periodic updates for the location (interval is bigger though). Is that bad sending ~100 bytes in 10min for updating NTP sources? The other observation is that NTP sources do update very frequently. (server set is changing, reachability, other fields and metrics). So simple bytes comparison on previous buffer and the one we retrieve from chrony will likely be always different, so having this simple algorithm we likely update controller withing 1 minute. Having smarter algorithm (like you said "significant change") makes me confused: what is exactly "significant change" for the final user? set of servers? status/mode change? |
The point is that when there is an issue with connectivity (Ethernet not connected, firewall not letting outbound NTP through) and the user wants to fix that and checked that it worked, then a 10 minute wait to see whether it worked is not acceptable. And I would argue that even an 1 minute wait is painful. But 10,000 devices reporting periodically every 1 minute or less would be a huge load on the controller. Hence the suggestion that the reporting should be done when there is a signficant change and not periodically.
That is why I said "significant change" - when the set of synchronized servers change. Most important is when the set of sychronoized servers goes between zero and non-zero, but even if the count of synch servers is the same (but IP changes), or the count changes in general, it would make sense to send an update. I don't know if there is some high-level status/mode changes which the user would care about reporting. Perhaps we can ask for feedback on that later? |
0119979
to
e43ca92
Compare
e43ca92
to
b00279b
Compare
Difference to the previous version:
|
@eriknordmark could you please take a look |
@rouming Will this also trigger publishing of NTP info: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/zedagent/zedagent.go#L2182 ? (this code executes when network connectivity that previously didn't work is restored) |
If I understand correctly, this will eventually lead to this timer kick: Which kicks the stalled queue (stalled because of missing connectivity for example) to repeat the send. This does not kick any of the higher level calls, like ntp sources send, only to kick the queue which is already populated with stalled requests. |
Which will include NTP info messages if there was any NTP change during broken controller connectivity, correct? |
"recurrently"? I can't quite understand what you mean by stalling the queue. I assume the same key is used with the deferred logic meaning that there is at most one NTP info message queued (and subsequent messages replace a message which has not yet been sent.) And note that if the controller is down the the deferred logic will retry (until it gets a 200 response) so strictly speaking you shouldn't need any more periodic message here than for e.g., device, app, or volume info messages. But no harm in sending occaionally so folks can see the skew/delta type NTP info.
|
Understanding is correct, the request will be the only one in the queue from ntp in case of lost connectivity. We need to update ntp sources recurrently because we need to reflect updates of other fields in the ntp peer structure (we can't guarantee other fields are up-to-date with only checking the change in set of peers or states). So we need to do that recurrently with low frequency, putting this request into the recurrent queue, not the event queue. |
pkg/pillar/cmd/zedagent/handlentp.go
Outdated
// Even for the controller destination we can't stall the queue on error, | ||
// set @forcePeriodic to true and let the timer repeat the send in a while |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean that if the network is not up (or that there is a timeout), you will wait for the full 10 minutes until you try again?
Why can't you have the queue retry on error for the triggered updates (and have the periodic updates not retry)?
How will the queue "stall" in a way which will be more upsetting than e.g., a device INFO message being queued and retried?
(You might want to think about the relative priority of the new NTP INFO message if you didn't already.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will wait for the full 10 minutes until you try again?
Yes, if network is not up then we have to wait another 10 minutes. Exactly as all other info messages function (with less interval though). So do not quite understand why NTP status (which has pure informative role on the controller) should be exceptional here.
Why can't you have the queue retry on error for the triggered updates (and have the periodic updates not retry)?
We don't have any callback mechanism and any error checks and retries. My impression is that this particular problem does not worth any over-complication, at least that what I understood discussing this with Daniel.
How will the queue "stall" in a way which will be more upsetting than e.g., a device INFO message being queued and retried?
Not upsetting at all, but abuses the semantics chosen long time ago to have two queues: one for events and another one for recurring calls. If you are satisfied with making NTP status exceptional and pushing it to event queue and not recurring queue (although we are forced to call it recurrently every 10 min), and eventually we can merge this PR, I can cover the code with fat comment explaining this and change the boolean flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if network is not up then we have to wait another 10 minutes. Exactly as all other info messages function (with less interval though). So do not quite understand why NTP status (which has pure informative role on the controller) should be exceptional here.
I see you changed this to pass forcePeriodic=false like for the other into messages, which is what I was asking for.
I see that zedagent calls zedcloudCtx.DeferredEventCtx.KickTimer() when it thinks it makes sense to try to send from the queue. In addition there binary exponential timer (from 1 to 15 minutes) in deferred.go. So this would normally be retried relatively quickly if the network was not connected or there is a short outage.
Use chronyd for time synchchronization instead of busybox ntpd because of the rich set of features it has and implementation of the monitor API in golang, which we can use for getting all the useful information for ntp status sharing to the controller. The only difference between how ntpd and chronyd starts is the way how an NTP server/peer should be passed to the daemon. In case of chronyd the chrony client (chronyc) should be used. Also now the default pool.ntp.org is explicitly passed to the chronyd as 'pool', other NTP servers which come from the nim passed as 'server'. In this change all common code was moved to the functions, namely single_ntp_sync(), start_ntp_daemon() and restart_ntp_daemon(). This reduces code duplication. Also chrony requires configuration parsing and passing it as a single line to the chronyd which is running in the `-q` client mode (see the `single_ntp_sync()`). Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
The ZInfoNTPSources info message will be publish by EVE. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Nothing fancy in this PR, just vendor updates. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
chrony API will be used later in this series in order to get NTP sources data from chronyd. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Nothing to look at, just vendor files. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
…rred Since a while we support a few destinations: Controller, LOC, LPS. There are a few calls which send messages to all destinations simultaneiously. The semantics of the `SetDeferred` function is to replace a deferred item if key is found. The `queueInfoToDest()` can add two items to the same `DeferredPeriodicCtx` context with different URLs. The original behavior of the `SetDeferred` call is to replace first item with the second item by the key, which will be same. This leads to a missing packet to one of the URLs if Controller and LOC are both enabled. This patch adds `URL` check along with the `key` check on deferred item add, makes both items with different URLs co-exist in the same queue. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
The flag acts as a hint for request handlers and tells that request should be sent in any case. In the next patches the NTP sources reports will be added and will be sent to the controller if NTP peers have been changed, so a comparison will take place. LOC destination is special and should bypass any checks and request should be forcibly sent. This scenario is the main user of the `ForceSend` flag. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
This implements NTP sources eve-api/ptoto/info/ntpsources.proto info messages, which periodically (at least once per 10 min) are sent to the controller. NTP sources data is fetched from the chronyd running on EVE over the unix domain socket with the help of vendor API from the github.com/facebook/time/ntp/chrony package. Currently this API supports only monitoring (RO), not control packages. In the future this can be extended, so `nim` can have full control over the chronyd and update its servers and pools. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
Documentation bits. Never harms. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>
b00279b
to
f826a78
Compare
Difference to the previous version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I also have these failures in my PR:
and
Something very recently merged to master must have created this. I will investigate. |
@milan-zededa I will proceed to merge this and your PR today, but can you please let us know whether you think these test failures indicate product issues (as opposed to test issues). |
Yes, it is some regression in the product. I can reproduce these two test failures locally and was able to narrow it down to this commit: bd68e04 |
@milan-zededa I'll check it out |
It panics here: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/msrv/pubsub.go#L46 |
@milan-zededa see #3998 |
This implements NTP sources eve-api/ptoto/info/ntpsources.proto info messages, which periodically (once per 10 min) are sent to the controller.
NTP sources data is fetched from the
chronyd
running on EVE over the unix domain socket with the help of vendor API from thegit.luolix.top/facebook/time/ntp/chrony package
. Currently this API supports only monitoring (RO), not control packages. In the future this can be extended, sonim
can have full control over the chronyd and update its servers and pools.CC: @rene