Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check cert dates #2952

Closed
wants to merge 126 commits into from
Closed

Check cert dates #2952

wants to merge 126 commits into from

Conversation

famleebob
Copy link
Contributor

This is a draft pull request, for review and discussion. Needs to be re-based, and interface work added.

@famleebob famleebob changed the title Check cert dates DRAFT: Check cert dates Dec 2, 2022
@famleebob famleebob marked this pull request as draft December 2, 2022 07:33
@famleebob famleebob changed the title DRAFT: Check cert dates Check cert dates Dec 2, 2022
@@ -511,6 +512,20 @@ func VerifySigningCertChain(log *base.LogObject, certs []*zcert.ZCert) ([]byte,
}
}
}

// verify date range of certificates
for _, cert := range certs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function gets called very time we receives a message from the cloud, to check the validity on every time is not necessary. have a way to check say once a day would be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I don't know if it is better having a separate timer for this, or just have this function track the last time it checked the certificates.

@giggsoff
Copy link
Contributor

I try to understand the consumer of the novel information you try to show. You set check inside the handler of /api/v2/edgeDevice/certs. Is it controller's responsibility to keep its certs valid across the time? Should the controller receive and handle something from device (which may have broken ntp source) instead of having internal checks for certs validity?

@eriknordmark
Copy link
Contributor

I try to understand the consumer of the novel information you try to show. You set check inside the handler of /api/v2/edgeDevice/certs. Is it controller's responsibility to keep its certs valid across the time? Should the controller receive and handle something from device (which may have broken ntp source) instead of having internal checks for certs validity?

@giggsoff it is useful to have some warnings from the devices, since currently when the controller certificates are about to expire there is no indication until 1) they have expired and 2) a device is installed or rebooted, since there are no periodic checks at all in running devices.

certName,
-validDuration)
certValid = false
CurrCertStatus.CertStatus = aCertIsINVALID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function which checks this should also 1) stop accepting new configuration from the controller 2) periodically fetch from /certs to get updated and valid certificate(s), and 3) put the device in maintenance mode until valid certs are received.

We can do 3) as a separate PR now or later; requires definiting a new MaintenanceModeReason value similar to the vaultlocked, and have EVE-OS set that new value and then clear it later. #3 makes sure that things are clearly flagged should be controller send unusable invalid certificate(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function posts logs based on the particular threshold. (Default is 60 days prior a Notice log, escalating to Warning at 30 days -- both deprived from the config, and thus can be administratively set). Obviously, it posts a Failing log when out of date.

I can refactor this into a function to test the cert against a given date, returning the cert state based on the date specified, and a higher level function that evaluates the cert with the current date/time, and do the logging in a caller to that function. Thought about that from the start, but opted for what seemed to be the most straight forward approach.

It's about an hour or so to do the refactor. Let me know if you prefer it that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already is a function in the x509 package to to the validation based on a given date.
And I swear I added the name of that function in a comment in this PR a few weeks back; not the first time I find the gh looses comments.

Set CurrentTime with and call https://pkg.go.dev/crypto/x509#Certificate.Verify just like we do elsewhere in EVE-OS code.

certInfoThreshold = 30
certWarnThreshold = 15
)

func verifySignature(log *base.LogObject, certByte []byte, interm *x509.CertPool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this function does not get called until a new certificate chain is downloaded from the controller. We want a periodic verification that the chain is still valid, which means having zedagent run a timer to verify the chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look, but I thought it was executed with roughly the same frequency of where I originally had the check code. I obviously need to take a better look.

Comment on lines +126 to +165
CertExpireInfo GlobalSettingKey = "timer.certcheck.infolog"
CertExpireWarn GlobalSettingKey = "timer.certcheck.warnlog"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, these are not used.


var (
lastCertTimeCheck time.Time
CurCertStatus = allCertsAreValid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a global variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best way I could think to expose a warning. Also, a way to clear once any problems have been cleared.

Comment on lines +534 to +535
certInfoThreshold = 30
certWarnThreshold = 15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these removed and use the variables you added to types/global.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a place holder until I understand and create the appropriate accessors.

@@ -559,6 +576,43 @@ func verifySignature(log *base.LogObject, certByte []byte, interm *x509.CertPool
log.Errorln("verifySignature: " + errStr)
return errors.New(errStr)
}

// verify cert validity dates against thresholds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is both simpler and more correct to do not this here, but instead add a "when" argument to verifySignature() and call verifySignature() with different values of when (not+30 days etc)

And the feed what when into x509.VerifyOptions so that the leafcert.Verify above will do all the checks.

Copy link
Contributor Author

@famleebob famleebob Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds better, I assume that I t would be better to implement a timer callout that fetches the thresholds, locates the cert chain (or chains) to process, and then call verifySignature with each of the threshold values.

Sound more what you'd like to see?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, should we return a failure, or just log on the threshold events?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now let's get the logging in place (with the periodic check) and in a future PR once we've gone through a certificate roll in production, we can add the hard fail.

In terms of placement of the periodic check, after you started this work we've made controllerCertsTask() periodically try to re-fetch the certificates. It probably makes sense having that function also re-verify the validity of the certificates both each time when it fetched new ones, and infrequently (once a day?) when it did not fetch new certificates.

check cert valid for date, info at longest threshold, warn
at the shortest threshold, and error when  out of date.

Attempt to include these thresholds within the global configuration
data.

This commit for review and help integrating service into the rest
of EVE.

SEE: EVE-125
Correct use of threshold inputs to days.
Make default log a debug/trace message
Config name change to better clarify its use.
Use the `Verify` interface to a x509 Certificate to validate the
Certificate for the next fifteen or thirty days.  Events are logged
and currently return an error to the caller.  Also reflects the
most urgent state seen in `CurCertStatus`.  A WARN log is issued
for a cert expiring in the next fifteen days, and INFO if within
the next thirty days.

Had to pull in the `logrus` package for an INFO log.

Signed-off-by: Gerald (Bob) Lee <gerald@zededa.com>
eriknordmark and others added 8 commits August 4, 2023 11:11
This changes the TAG variable from being the branch to the actual
tag e.g., 10.1 vs. 10.1.0, which is needed so we can pull the correct
container from docker hub.

Signed-off-by: eriknordmark <erik@zededa.com>
The format changes in the new version we pulled in PR lf-edge#3293 resulted in
producing log entries in /persist/newlog/collect/ which are effectively
empty. The change in memlogd as seen by logread is a common vs a a
semicolon but the underlying change is to produce json which we need to
parse in newlogd.

Signed-off-by: eriknordmark <erik@zededa.com>
The old getSourceFromMsg() is no longer needed, and the repaireMsg() has
become unnecessary.

Signed-off-by: eriknordmark <erik@zededa.com>
The messages from containerd are not in json format but with time="X"
level=Y msg="Z". This adds back the ability to parse those (without
trying to repair some json format).

Signed-off-by: eriknordmark <erik@zededa.com>
Signed-off-by: eriknordmark <erik@zededa.com>
Signed-off-by: eriknordmark <erik@zededa.com>
when grub queries for available disks it doesn't take into account that
the disk can be removable e.g. USB stick. The disk can appear in front of regular
HDDs and the numbering will be different e.g. hd0 become hd1 when the
USB stick is plugged in. It is not a problem for GRUB to find a correct
partition in this case and the system can be booted just fine. However
every command from grub.cfg is measured into PCR8 while being executed
and HDD names appear in those commands e.g. 'set root=(hd2,gpt5)'. If
any key is sealed into TPM using PCR8 then that key cannot be unsealed when a
random USB stick is inserted (or removed if it was inserted when the key
was sealed)

the original issue should not affect PC BIOS case because USB devices
are usually emulated as either CD or floppy drives and have their unique
numbering

Signed-off-by: Mikhail Malyshev <mikem@zededa.com>
When a cloned ./grub repository exists in the pkg/grub
it is taken instead of tar file. autogen.sh must be run in this case
and gawk util is required

Signed-off-by: Mikhail Malyshev <mikem@zededa.com>
OhmSpectator and others added 25 commits August 4, 2023 11:11
…nstanceStatus.

This commit introduces robust serialization and deserialization functionality
for the snapshot instance status, ensuring continuity of snapshot lists after
system reboots.

Key changes:

Addition of the serializeSnapshotInstanceStatus function, which serializes the
SnapshotInstanceStatus into a JSON-formatted file.

Addition of the deserializeSnapshotInstanceStatus function. During
deserialization, comprehensive field checks are performed. These checks ensure
that all expected fields are present, and unknown fields are appropriately
flagged. This ensures a safe deserialization process even when the struct
format changes over time.

We've defined a list of critical fields essential for snapshot restoration. If
any of these critical fields are missing in the serialized data, the
deserialization will report an error, avoiding potential runtime issues.

The moveSnapshotToAvailable function was updated to call the serialization
function.

The restoreAvailableSnapshots function was added to the Run function in
zedmanager.go. It handles the restoration of the snapshot status for each
snapshot present in the snapshots directory after a system reboot.

With these changes, our snapshot management becomes more resilient and
flexible, easily accommodating changes in the SnapshotInstanceStatus structure.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit introduces several utility functions in the zedmanagertypes.go
file. These functions are designed to facilitate the access of snapshot-related
files across different modules.

Each of these functions constructs the appropriate file path using the provided
snapshot ID, enabling easy access to the corresponding snapshot-related files.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Enhanced `lookupAppInstanceConfig` function to include an additional boolean
parameter that checks local AppInstanceConfig when set to true. Also, added
new function `lookupLocalAppInstanceConfig` to retrieve local
AppInstanceConfig.

Added subscription and publication for local AppInstanceConfig in the
zedmanager context. This serves as a placeholder for a config copy that Eve
reverts to when necessary. So, this subscriptions is checked first now. It
includes logic to use local AppInstanceConfig in `checkDelayedStartApps` and
`handleModify`.

Updated all calls to `lookupAppInstanceConfig` in multiple files to include the
new boolean parameter.

Implemented `unpublishLocalAppInstanceConfig` and
`publishLocalAppInstanceConfig` for managing local AppInstanceConfig
publication.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…unt.

The increment in the reference count for VolumeRefConfig was removed. This
operation was intended to prevent a volume from being deleted when it's
temporarily not in use by any application. However, this extra layer of
protection is not necessary as VolumeConfig is sufficient to maintain the
corresponding file.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Remove serialization of VolumeRefStatus to avoid persistent subscription issues
during Eve upgrades.

This commit removes the serialization and deserialization of VolumeRefStatus to
avoid potential problems with persistent subscriptions during Eve upgrades.
Changes in struct formats during an upgrade could lead to issues, hence this
preventative measure.

Specifically, the serialization and deserialization methods for VolumeRefStatus
have been removed, along with the related helper methods such as
getVolumeRefStatusFilename. Consequently, all error handling tied to these
operations has also been discarded.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…n names for clarity.

This commit refactors the names of several serialization and deserialization
functions for improved clarity and code readability. The changes include:

- `restoreConfigFromSnapshot` is now `restoreAppInstanceConfigFromSnapshot`
- `addFixupsIntoSnappedConfig` is now `fixupAppInstanceConfig`
- `deserializeConfigFromSnapshot` is now `deserializeAppInstanceConfigFromSnapshot`
- `saveConfigForSnapshots` is now `saveAppInstanceConfigForSnapshot`
- `serializeConfigToSnapshot` is now `serializeAppInstanceConfigToSnapshot`

The function argument lists and bodies remain unchanged; only the function
names have been modified. The new names make it clear that these functions
pertain to AppInstanceConfigs, improving their self-documentation and making
the codebase easier to understand and maintain.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit replaces the use of locally defined functions and string
manipulations for deriving snapshot-related directory paths with newly defined,
centralized functions.

Replaced `getSnapshotDir` function and subsequent filename derivation with
`types.GetSnapshotDir` and `types.GetSnapshotAppInstanceConfigFile`
respectively.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Refactoring of handlers for VolumesSnapshotStatus: The functions
handleVolumesSnapshotStatusCreate, handleVolumesSnapshotStatusModify, and
handleVolumesSnapshotStatusDelete have been refactored. Now, they call a new
function handleVolumesSnapshotStatusImpl which takes in the context and
volumesSnapshotStatus as arguments. Depending on the ResultOfAction in the
volumesSnapshotStatus, it calls one of the new functions reactToSnapshotCreate,
reactToSnapshotRollback, or reactToSnapshotDelete.

Separate handlers for different snapshot actions: The new functions
reactToSnapshotCreate, reactToSnapshotRollback, and reactToSnapshotDelete
handle the respective snapshot actions. Each of these functions fetches the
appInstanceStatus and performs specific operations based on the action and
whether there is an error in the volumesSnapshotStatus.

Snapshot deletion handled separately: A new function DeleteSnapshotFiles has
been added to delete snapshot files. It checks if a directory exists and then
removes it. This is used by reactToSnapshotDelete to remove snapshot files.

Changes in error handling: The error handling has been adjusted to consider the
error severity in reactToSnapshotCreate.

Adjustments in handleVolumesSnapshotStatusDelete: Now, it sets the
ResultOfAction in volumesSnapshotStatus to VolumesSnapshotDelete before calling
handleVolumesSnapshotStatusImpl.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
In updatestatus.go, the check for RollbackInProgress has been removed from the
updateAIStatusUUID and doUpdate functions, thus not skipping the update process
during a rollback. We should not skip now, as the config version is taken from
a special local subscription that provides us with the proper version. The
check for RollbackInProgress has also been removed from the doInstall function.
Instead, a new check for RollbackInProgress is added after checking if an app
purge is in progress. This means the system will not activate an application if
a rollback is still in progress.

In zedmanager.go, the triggering of the rollback has been moved to the
handleModify function from the doUpdate function. This means a rollback request
can now be handled directly when the app configuration is modified. The
initiation of the rollback now sets HasRollbackRequest and RollbackInProgress
to true. The rollback also publishes the application instance status and
updates the configuration to the one from the snapshot.

The handleModify function also has changes to the purge handling. The needPurge
condition now has one more option: if HasRollbackRequest is true. This is to
ensure that a purge occurs during a rollback (we need it to upgrade app
instance volumes).

The rollback handling that was previously in handleModify has been removed, as
it's now handled earlier in the same function.

In zedmanagertypes.go, the comments for HasRollbackRequest and
RollbackInProgress in the SnapshottingStatus struct have been updated to
reflect the new handling of rollbacks. The ConfigBeforeRollback field has been
removed because the rollback flow is now controlled by the RollbackInProgress
field, and the version of the configuration available in the module is the
correct one, not the previous one. Hence, it doesn't need to store the previous
configuration version.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…, synced one.

Substituted os.WriteFile with fileutils.WriteRename for a snapshot-related file
operation in zedmanager.go. This alteration ensures atomicity and synchronicity
in disk writes, preventing the creation of partially written files during
snapshot serialization.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit introduces a function to clean up and remove unused
VolumeRefStatuses. The function iterates through the VolumeRefStatusList and
checks if the volume is still referenced in the current configuration. If not,
the volume status is removed from the list and its associated VolumeRefConfig
is unpublished. This ensures that any volume statuses that are no longer in use
are properly cleaned up.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit significantly revises the triggerRollback function within the
Zedmanager's updatestatus module. Instead of restoring VolumeRefStatuses and
the AppInstanceConfig from a snapshot, the function now creates or updates a
VolumesSnapshotConfig to indicate a rollback action. If a VolumesSnapshotConfig
is found for the snapshot in question, its action is set to rollback. If no
matching VolumesSnapshotConfig is found, a new one is created with the relevant
VolumeIDs from the current AppInstanceStatus, and the action set to rollback.
This change simplifies the rollback process and makes it more resilient to
system reboots.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit introduces snapshot cleanup during the uninstallation process of an
application. Now, when an application is being uninstalled, all related
snapshot files will be deleted.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
The comment about the ConfigID being the ID of the config that created the
snapshot is no longer relevant or accurate, so it has been removed to avoid
confusion.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit introduces significant changes to the handling of volume snapshots
in our system. Key changes include:

In volumemgr.go, the handlers for creating, modifying, and deleting volume
snapshots have been reworked. The new handlers are
`handleVolumesSnapshotConfigCreate`, `handleVolumesSnapshotConfigModify`, and
`handleVolumesSnapshotConfigDelete`. These handlers now call a common
implementation function `handleVolumesSnapshotConfigImpl`.

In handlesnapshot.go, the functions `handleVolumesSnapshotCreate`,
`handleVolumesSnapshotModify`, and `handleVolumesSnapshotDelete` have been
significantly reworked. They now call `handleVolumesSnapshotConfigImpl` with
the appropriate action set in the snapshot config. The function
`handleVolumesSnapshotConfigImpl` handles the creation, rollback, and deletion
of snapshots based on the action in the config.

New functions `createSnapshot`, `rollbackSnapshot`, and `deleteSnapshot` have
been introduced in handlesnapshot.go to handle the respective actions.

The functions `publishVolumesSnapshotStatus` and
`unpublishVolumesSnapshotStatus` remain largely unchanged, but the function
`lookupVolumesSnapshotStatus` has been updated to check if the snapshot status
exists in the store if it is not found in the published state. It's necessary
to be able to do rollback after the system reboot. For that a new function
`deserializeVolumesSnapshotStatus` has been introduced to read the snapshot
status from a file.

In updatestatus.go, the function `triggerSnapshotDeletion` has been updated to
set the `Action` of `volumesSnapshotConfig` to `types.VolumesSnapshotDelete`
before unpublishing the snapshot config. It's an optional change.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
… in handleDeferredVolumeCreate.

After reworking the snapshot serialization, volume status is no longer stored.
As a result, it becomes an unexpected situation when a status exists at the
moment of calling this function. Therefore, the log level has been changed from
Warning to Fatal to reflect the seriousness of this unexpected state.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit refactors the deserialization logic used in volumemgr and
zedmanager packages into a new utility function, utils.DeserializeToStruct, to
reduce code duplication and improve readability. The redundant code in the
handleSnapshot and handlevolumemgr files is replaced with this new utility
function. In addition, a new utility file deserialize.go is introduced to
contain the shared deserialization logic. The related imports are also updated
accordingly.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Refactor the DeserializeToStruct function to accept a pointer to an empty
structure instead of requiring the caller to provide a reflect.Type and
separately manage critical fields.

Remove the usage of reflect package from snapshot handling methods. Reflected
code has been replaced by direct deserialization into the structs.

Replaced the critical fields maps in VolumesSnapshotStatus and
SnapshotInstanceStatus structures with struct tags. Introduced 'mandatory' tag
to mark critical fields directly in the struct declaration. This approach is
more idiomatic and makes it easier to understand which fields are critical.

Updated extractFields function to identify the mandatory fields based on the
new 'mandatory' struct tag.

Simplified error handling in DeserializeToStruct function by directly returning
formatted error messages.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit introduces a suite of tests for the DeserializeToStruct function in
the utils package. The tests cover various scenarios including simple struct,
error conditions, mandatory field logic, extra field in file, missing
non-mandatory field in file, struct with anonymous field, nested structs, array
of structs, and anonymous structs. These tests ensure the robustness and
reliability of the DeserializeToStruct function.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Signed-off-by: Avi Deitcher <avi@deitcher.net>
Some pillar/types on which Edgeview depends were changed during
the Zedrouter refactoring. Please note that these types
are not persisted, so we do not concern ourselves with backward
compatibility (as it would significantly limit development).
The Edgeview changes done in this commit only affects the server
side, meaning that users will not have to update the client-side
container.

Signed-off-by: Milan Lenco <milan@zededa.com>
Add GNU tar to the container, otherwise running collect-info.sh script
while the device is offline fails to gzip the collected logs, because
the BusyBox tar lacks some features compared to GNU version.

Signed-off-by: Shahriyar Jalayeri <shahriyar@zededa.com>
Signed-off-by: Shahriyar Jalayeri <shahriyar@zededa.com>
Signed-off-by: Shahriyar Jalayeri <shahriyar@zededa.com>
check cert valid for date, info at longest threshold, warn
at the shortest threshold, and error when  out of date.

Attempt to include these thresholds within the global configuration
data.

This commit for review and help integrating service into the rest
of EVE.

SEE: EVE-125
@eriknordmark
Copy link
Contributor

@famleebob looks like the PR got messed up since it is showing there are 126 commits.
Did you to a git pull + merge? You need to do a git fetch + rebase to place your commits on top of master (and avoid any merge commits.)

@rouming
Copy link
Contributor

rouming commented Dec 11, 2023

@famleebob are you still keen to continue work on this? or we can close?

@rouming rouming added the stalled PR has stalled and was closed. Might make sense to revisit later. label Dec 14, 2023
@rouming
Copy link
Contributor

rouming commented Dec 14, 2023

Labeled as stalled. Closing this because no activity.

@rouming rouming closed this Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stalled PR has stalled and was closed. Might make sense to revisit later.
Projects
None yet
Development

Successfully merging this pull request may close these issues.