Implement NTP source info messages to the controller #3968

rouming · 2024-06-10T15:51:38Z

This implements NTP sources eve-api/ptoto/info/ntpsources.proto info messages, which periodically (once per 10 min) are sent to the controller.

NTP sources data is fetched from the chronyd running on EVE over the unix domain socket with the help of vendor API from the github.com/facebook/time/ntp/chrony package. Currently this API supports only monitoring (RO), not control packages. In the future this can be extended, so nim can have full control over the chronyd and update its servers and pools.

CC: @rene

pkg/pillar/cmd/zedagent/handlentp.go

milan-zededa

Could you please mention in EVE docs:

that we use chrony
how we pick NTP server (first from the list obtained via DHCP or static config - unless we improve this to apply all)
what is the default NTP if none configured
NTP status information that EVE publishes (what info is available, how often it is published and how to change the interval)

I have started something here, but separate md file for ntp could also make sense.

pkg/pillar/cmd/zedagent/handlentp.go

pkg/pillar/scripts/device-steps.sh

pkg/pillar/types/global.go

rouming · 2024-06-12T15:12:47Z

Could you please mention in EVE docs:

that we use chrony

Mentioned

how we pick NTP server (first from the list obtained via DHCP or static config - unless we improve this to apply all)

what is the default NTP if none configured

Those two are already covered by you.

NTP status information that EVE publishes (what info is available, how often it is published and how to change the interval)

Done.

I have started something here, but separate md file for ntp could also make sense.

Meh, separate file is too much, I do not know what to write there except those 4 lines I added.

pkg/pillar/scripts/device-steps.sh

eriknordmark

A periodic message makes sense to a LOC/LPS (and I don't understand why that isn't included - we shouldn't have that provide less info that the controller).

But for the info sent to the controller I don't think periodic updates makes sense. That should be done when there is a significant change. 10 minutes is too slow when there are changes, and too frequent when there are no significant changes.

pkg/pillar/cmd/zedagent/handlentp.go

pkg/pillar/rootfs/etc/chrony/chrony.conf

rouming · 2024-06-13T10:54:32Z

A periodic message makes sense to a LOC/LPS (and I don't understand why that isn't included - we shouldn't have that provide less info that the controller).

Not sure what you mean by that isn't included, LOC is updated periodically by the triggerPublishAllInfo() call. Is that what you mean?

But for the info sent to the controller I don't think periodic updates makes sense. That should be done when there is a significant change.

My main question is how many bytes we need to spare here? e.g. we do periodic updates for the location (interval is bigger though). Is that bad sending ~100 bytes in 10min for updating NTP sources?

The other observation is that NTP sources do update very frequently. (server set is changing, reachability, other fields and metrics). So simple bytes comparison on previous buffer and the one we retrieve from chrony will likely be always different, so having this simple algorithm we likely update controller withing 1 minute. Having smarter algorithm (like you said "significant change") makes me confused: what is exactly "significant change" for the final user? set of servers? status/mode change?

eriknordmark · 2024-06-14T18:50:48Z

My main question is how many bytes we need to spare here? e.g. we do periodic updates for the location (interval is bigger though). Is that bad sending ~100 bytes in 10min for updating NTP sources?

The point is that when there is an issue with connectivity (Ethernet not connected, firewall not letting outbound NTP through) and the user wants to fix that and checked that it worked, then a 10 minute wait to see whether it worked is not acceptable. And I would argue that even an 1 minute wait is painful.

But 10,000 devices reporting periodically every 1 minute or less would be a huge load on the controller.

Hence the suggestion that the reporting should be done when there is a signficant change and not periodically.

The other observation is that NTP sources do update very frequently. (server set is changing, reachability, other fields and metrics). So simple bytes comparison on previous buffer and the one we retrieve from chrony will likely be always different, so having this simple algorithm we likely update controller withing 1 minute. Having smarter algorithm (like you said "significant change") makes me confused: what is exactly "significant change" for the final user? set of servers? status/mode change?

That is why I said "significant change" - when the set of synchronized servers change. Most important is when the set of sychronoized servers goes between zero and non-zero, but even if the count of synch servers is the same (but IP changes), or the count changes in general, it would make sense to send an update.
(Plus send one on boot to make sure there is an initial set of servers to compare with.)

I don't know if there is some high-level status/mode changes which the user would care about reporting. Perhaps we can ask for feedback on that later?

pkg/pillar/cmd/zedagent/handlentp.go

rouming · 2024-06-18T16:38:27Z

Difference to the previous version:

Change NTP source update by two events:
- Do force send every 10 minutes. This is needed just to make sure no updates are lost due controller downtime. We still send NTP updates recurrently in order not to stall the send queue.
- NTP sources updates are also sent if the list of NTP peers or NTP peer fields, such as mode, state, have changed. This guarantees us we see updates more frequent.
Introduce ForceSend flag for a destination bitset to control operation from the sender.

rouming · 2024-06-21T05:44:00Z

@eriknordmark could you please take a look

milan-zededa · 2024-06-21T07:19:28Z

@rouming Will this also trigger publishing of NTP info: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/zedagent/zedagent.go#L2182 ? (this code executes when network connectivity that previously didn't work is restored)
I guess it does because you are also using the deferred queue, right?

rouming · 2024-06-21T07:42:41Z

@rouming Will this also trigger publishing of NTP info: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/zedagent/zedagent.go#L2182 ? (this code executes when network connectivity that previously didn't work is restored) I guess it does because you are also using the deferred queue, right?

If I understand correctly, this will eventually lead to this timer kick:
zedcloudCtx.DeferredEventCtx.KickTimer()

Which kicks the stalled queue (stalled because of missing connectivity for example) to repeat the send. This does not kick any of the higher level calls, like ntp sources send, only to kick the queue which is already populated with stalled requests.

milan-zededa · 2024-06-21T07:46:18Z

which is already populated with stalled requests

Which will include NTP info messages if there was any NTP change during broken controller connectivity, correct?

eriknordmark · 2024-06-21T09:12:24Z

Difference to the previous version:

Change NTP source update by two events:

Do force send every 10 minutes. This is needed just to make sure no updates are lost due controller downtime. We still send NTP updates recurrently in order not to stall the send queue.

"recurrently"? I can't quite understand what you mean by stalling the queue. I assume the same key is used with the deferred logic meaning that there is at most one NTP info message queued (and subsequent messages replace a message which has not yet been sent.)

And note that if the controller is down the the deferred logic will retry (until it gets a 200 response) so strictly speaking you shouldn't need any more periodic message here than for e.g., device, app, or volume info messages. But no harm in sending occaionally so folks can see the skew/delta type NTP info.

NTP sources updates are also sent if the list of NTP peers or NTP peer fields, such as mode, state, have changed. This guarantees us we see updates more frequent.

Introduce ForceSend flag for a destination bitset to control operation from the sender.

rouming · 2024-06-21T11:34:22Z

Understanding is correct, the request will be the only one in the queue from ntp in case of lost connectivity. We need to update ntp sources recurrently because we need to reflect updates of other fields in the ntp peer structure (we can't guarantee other fields are up-to-date with only checking the change in set of peers or states). So we need to do that recurrently with low frequency, putting this request into the recurrent queue, not the event queue.

pkg/pillar/cmd/zedagent/handlentp.go

eriknordmark · 2024-06-22T13:50:06Z

pkg/pillar/cmd/zedagent/handlentp.go

+	// Even for the controller destination we can't stall the queue on error,
+	// set @forcePeriodic to true and let the timer repeat the send in a while


Doesn't this mean that if the network is not up (or that there is a timeout), you will wait for the full 10 minutes until you try again?

Why can't you have the queue retry on error for the triggered updates (and have the periodic updates not retry)?
How will the queue "stall" in a way which will be more upsetting than e.g., a device INFO message being queued and retried?
(You might want to think about the relative priority of the new NTP INFO message if you didn't already.)

you will wait for the full 10 minutes until you try again?

Yes, if network is not up then we have to wait another 10 minutes. Exactly as all other info messages function (with less interval though). So do not quite understand why NTP status (which has pure informative role on the controller) should be exceptional here.

Why can't you have the queue retry on error for the triggered updates (and have the periodic updates not retry)?

We don't have any callback mechanism and any error checks and retries. My impression is that this particular problem does not worth any over-complication, at least that what I understood discussing this with Daniel.

How will the queue "stall" in a way which will be more upsetting than e.g., a device INFO message being queued and retried?

Not upsetting at all, but abuses the semantics chosen long time ago to have two queues: one for events and another one for recurring calls. If you are satisfied with making NTP status exceptional and pushing it to event queue and not recurring queue (although we are forced to call it recurrently every 10 min), and eventually we can merge this PR, I can cover the code with fat comment explaining this and change the boolean flag.

Yes, if network is not up then we have to wait another 10 minutes. Exactly as all other info messages function (with less interval though). So do not quite understand why NTP status (which has pure informative role on the controller) should be exceptional here.

I see you changed this to pass forcePeriodic=false like for the other into messages, which is what I was asking for.
I see that zedagent calls zedcloudCtx.DeferredEventCtx.KickTimer() when it thinks it makes sense to try to send from the queue. In addition there binary exponential timer (from 1 to 15 minutes) in deferred.go. So this would normally be retried relatively quickly if the network was not connected or there is a short outage.

Use chronyd for time synchchronization instead of busybox ntpd because of the rich set of features it has and implementation of the monitor API in golang, which we can use for getting all the useful information for ntp status sharing to the controller. The only difference between how ntpd and chronyd starts is the way how an NTP server/peer should be passed to the daemon. In case of chronyd the chrony client (chronyc) should be used. Also now the default pool.ntp.org is explicitly passed to the chronyd as 'pool', other NTP servers which come from the nim passed as 'server'. In this change all common code was moved to the functions, namely single_ntp_sync(), start_ntp_daemon() and restart_ntp_daemon(). This reduces code duplication. Also chrony requires configuration parsing and passing it as a single line to the chronyd which is running in the `-q` client mode (see the `single_ntp_sync()`). Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

The ZInfoNTPSources info message will be publish by EVE. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Nothing fancy in this PR, just vendor updates. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

chrony API will be used later in this series in order to get NTP sources data from chronyd. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Nothing to look at, just vendor files. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

…rred Since a while we support a few destinations: Controller, LOC, LPS. There are a few calls which send messages to all destinations simultaneiously. The semantics of the `SetDeferred` function is to replace a deferred item if key is found. The `queueInfoToDest()` can add two items to the same `DeferredPeriodicCtx` context with different URLs. The original behavior of the `SetDeferred` call is to replace first item with the second item by the key, which will be same. This leads to a missing packet to one of the URLs if Controller and LOC are both enabled. This patch adds `URL` check along with the `key` check on deferred item add, makes both items with different URLs co-exist in the same queue. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

The flag acts as a hint for request handlers and tells that request should be sent in any case. In the next patches the NTP sources reports will be added and will be sent to the controller if NTP peers have been changed, so a comparison will take place. LOC destination is special and should bypass any checks and request should be forcibly sent. This scenario is the main user of the `ForceSend` flag. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

This implements NTP sources eve-api/ptoto/info/ntpsources.proto info messages, which periodically (at least once per 10 min) are sent to the controller. NTP sources data is fetched from the chronyd running on EVE over the unix domain socket with the help of vendor API from the github.com/facebook/time/ntp/chrony package. Currently this API supports only monitoring (RO), not control packages. In the future this can be extended, so `nim` can have full control over the chronyd and update its servers and pools. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

Documentation bits. Never harms. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

rouming · 2024-06-24T12:32:44Z

Difference to the previous version:

Pass configuration from the config to chrony in a client -q mode as a cmd line to allow chrony initial large adjustment.
NTP source requests queued with the cleared forcePeriodic.

eriknordmark

LGTM

milan-zededa · 2024-06-25T07:56:29Z

I also have these failures in my PR:

        FAIL: ../eclient/testdata/metadata.txt:26: command failure

and

        Bad port ''
        [exit status 1]
        FAIL: ../eclient/testdata/port_forward.txt:48: command failure

Something very recently merged to master must have created this. I will investigate.

eriknordmark · 2024-06-25T08:27:12Z

I also have these failures in my PR:
        FAIL: ../eclient/testdata/metadata.txt:26: command failure
and
        Bad port ''
        [exit status 1]
        FAIL: ../eclient/testdata/port_forward.txt:48: command failure
Something very recently merged to master must have created this. I will investigate.

@milan-zededa I will proceed to merge this and your PR today, but can you please let us know whether you think these test failures indicate product issues (as opposed to test issues).

milan-zededa · 2024-06-25T09:32:33Z

I also have these failures in my PR:
        FAIL: ../eclient/testdata/metadata.txt:26: command failure
and
        Bad port ''
        [exit status 1]
        FAIL: ../eclient/testdata/port_forward.txt:48: command failure
Something very recently merged to master must have created this. I will investigate.
@milan-zededa I will proceed to merge this and your PR today, but can you please let us know whether you think these test failures indicate product issues (as opposed to test issues).

Yes, it is some regression in the product. I can reproduce these two test failures locally and was able to narrow it down to this commit: bd68e04
But at the moment I don't know what change from the commit is causing the failures.

uncleDecart · 2024-06-25T09:40:37Z

@milan-zededa I'll check it out

milan-zededa · 2024-06-25T09:49:05Z

@milan-zededa I'll check it out

It panics here: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/msrv/pubsub.go#L46
I don't see srv.deviceNetworkStatus being set anywhere. It seems you are missing subscription for DeviceNetworkStatus...

uncleDecart · 2024-06-25T14:38:50Z

@milan-zededa see #3998

rouming requested review from milan-zededa and eriknordmark as code owners June 10, 2024 15:51

github-actions bot requested review from christoph-zededa, naiming-zededa, OhmSpectator and uncleDecart June 10, 2024 15:52

milan-zededa reviewed Jun 10, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/handlentp.go Outdated Show resolved Hide resolved

christoph-zededa reviewed Jun 10, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/handlentp.go Outdated Show resolved Hide resolved

rouming force-pushed the chrony-ntp-sources branch from 6f78709 to 3513093 Compare June 11, 2024 09:24

rouming changed the title ~~[WIP] Implement NTP source info messages to the controller~~ Implement NTP source info messages to the controller Jun 11, 2024

github-actions bot requested a review from milan-zededa June 11, 2024 09:24

rouming force-pushed the chrony-ntp-sources branch from 3513093 to e5455c5 Compare June 11, 2024 14:40

milan-zededa reviewed Jun 12, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/handlentp.go Outdated Show resolved Hide resolved

pkg/pillar/scripts/device-steps.sh Outdated Show resolved Hide resolved

milan-zededa reviewed Jun 12, 2024

View reviewed changes

pkg/pillar/types/global.go Show resolved Hide resolved

rouming force-pushed the chrony-ntp-sources branch from e5455c5 to 70ad87c Compare June 12, 2024 14:48

github-actions bot requested a review from milan-zededa June 12, 2024 14:48

rouming force-pushed the chrony-ntp-sources branch 2 times, most recently from 4582cf4 to b8578cd Compare June 12, 2024 14:55

milan-zededa approved these changes Jun 12, 2024

View reviewed changes

milan-zededa reviewed Jun 12, 2024

View reviewed changes

pkg/pillar/scripts/device-steps.sh Show resolved Hide resolved

rouming force-pushed the chrony-ntp-sources branch from 089d4f2 to 4117601 Compare June 13, 2024 07:47

github-actions bot requested a review from milan-zededa June 13, 2024 07:48

rouming force-pushed the chrony-ntp-sources branch 2 times, most recently from e16425e to 0119979 Compare June 13, 2024 08:16

eriknordmark requested changes Jun 13, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/handlentp.go Outdated Show resolved Hide resolved

pkg/pillar/cmd/zedagent/handlentp.go Show resolved Hide resolved

pkg/pillar/rootfs/etc/chrony/chrony.conf Show resolved Hide resolved

eriknordmark reviewed Jun 14, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/handlentp.go Show resolved Hide resolved

rouming force-pushed the chrony-ntp-sources branch from 0119979 to e43ca92 Compare June 18, 2024 16:27

rouming force-pushed the chrony-ntp-sources branch from e43ca92 to b00279b Compare June 18, 2024 16:36

eriknordmark reviewed Jun 22, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/handlentp.go Show resolved Hide resolved

eriknordmark reviewed Jun 22, 2024

View reviewed changes

rouming added 9 commits June 24, 2024 13:47

pillar: pull latest eve-api with ZInfoNTPSources message

9717b95

The ZInfoNTPSources info message will be publish by EVE. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

pillar: add vendor changes after pulling latest eve-api

2604985

Nothing fancy in this PR, just vendor updates. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

pillar: add github.com/facebook/time to go.mod/go.sum

34ea0c1

chrony API will be used later in this series in order to get NTP sources data from chronyd. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

pkg: add vendor of the githib.com/facebook/time package

dc100a2

Nothing to look at, just vendor files. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

docs/DEVICE-CONNECTIVITY.md: add a few sentences about chronyd daemon

f826a78

Documentation bits. Never harms. Signed-off-by: Roman Penyaev <r.peniaev@gmail.com>

rouming force-pushed the chrony-ntp-sources branch from b00279b to f826a78 Compare June 24, 2024 12:28

github-actions bot requested a review from eriknordmark June 24, 2024 12:29

eriknordmark approved these changes Jun 24, 2024

View reviewed changes

eriknordmark merged commit 98701ed into lf-edge:master Jun 25, 2024
29 of 35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement NTP source info messages to the controller #3968

Implement NTP source info messages to the controller #3968

rouming commented Jun 10, 2024 •

edited

Loading

milan-zededa left a comment •

edited

Loading

rouming commented Jun 12, 2024

eriknordmark left a comment

rouming commented Jun 13, 2024

eriknordmark commented Jun 14, 2024

rouming commented Jun 18, 2024 •

edited

Loading

rouming commented Jun 21, 2024

milan-zededa commented Jun 21, 2024 •

edited

Loading

rouming commented Jun 21, 2024 •

edited

Loading

milan-zededa commented Jun 21, 2024

eriknordmark commented Jun 21, 2024

rouming commented Jun 21, 2024

eriknordmark Jun 22, 2024

rouming Jun 24, 2024

eriknordmark Jun 24, 2024

rouming commented Jun 24, 2024

eriknordmark left a comment

milan-zededa commented Jun 25, 2024

eriknordmark commented Jun 25, 2024

milan-zededa commented Jun 25, 2024

uncleDecart commented Jun 25, 2024

milan-zededa commented Jun 25, 2024

uncleDecart commented Jun 25, 2024

		// Even for the controller destination we can't stall the queue on error,
		// set @forcePeriodic to true and let the timer repeat the send in a while

Implement NTP source info messages to the controller #3968

Implement NTP source info messages to the controller #3968

Conversation

rouming commented Jun 10, 2024 • edited Loading

milan-zededa left a comment • edited Loading

Choose a reason for hiding this comment

rouming commented Jun 12, 2024

eriknordmark left a comment

Choose a reason for hiding this comment

rouming commented Jun 13, 2024

eriknordmark commented Jun 14, 2024

rouming commented Jun 18, 2024 • edited Loading

rouming commented Jun 21, 2024

milan-zededa commented Jun 21, 2024 • edited Loading

rouming commented Jun 21, 2024 • edited Loading

milan-zededa commented Jun 21, 2024

eriknordmark commented Jun 21, 2024

rouming commented Jun 21, 2024

eriknordmark Jun 22, 2024

Choose a reason for hiding this comment

rouming Jun 24, 2024

Choose a reason for hiding this comment

eriknordmark Jun 24, 2024

Choose a reason for hiding this comment

rouming commented Jun 24, 2024

eriknordmark left a comment

Choose a reason for hiding this comment

milan-zededa commented Jun 25, 2024

eriknordmark commented Jun 25, 2024

milan-zededa commented Jun 25, 2024

uncleDecart commented Jun 25, 2024

milan-zededa commented Jun 25, 2024

uncleDecart commented Jun 25, 2024

rouming commented Jun 10, 2024 •

edited

Loading

milan-zededa left a comment •

edited

Loading

rouming commented Jun 18, 2024 •

edited

Loading

milan-zededa commented Jun 21, 2024 •

edited

Loading

rouming commented Jun 21, 2024 •

edited

Loading