Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve log during restore PVs and PVS for VolRep to reflect the core error #1460

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ELENAGER
Copy link
Member

During restore PVs and PVCs for VolRep we are not reflecting the real errors, but printing more general ones. This PR is fixing the problem, now we are able to see, for example, if failure is due to error in restoring k8s resources or due to error in restoring PV/PVCs.
Fixes: #1452

@ELENAGER ELENAGER marked this pull request as ready for review June 18, 2024 07:12
@@ -1877,6 +1878,8 @@ func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error)
// Restore all PVs found in the s3 store. If any failure, the next profile will be retried
pvCount, err = v.restorePVsFromObjectStore(objectStore, s3ProfileName)
if err != nil {
v.log.Error(err, fmt.Sprintf("error fetching PV cluster data from S3 profile %s", s3ProfileName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the failing function name, this should be: "cannot restore PVs from s3 store $q"

Since we retry using the next profile, is this really an error or we can use INFO level message?

(We should really use WARNING log level but the logger we used does not have that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to warning, added new context

@@ -1889,7 +1892,13 @@ func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error)
// Ignoring PVC restore errors helps with the upgrade from ODF-4.12.x to 4.13
pvcCount, err = v.restorePVCsFromObjectStore(objectStore, s3ProfileName)

if err != nil || pvCount != pvcCount {
if err != nil {
v.log.Error(err, fmt.Sprintf("error fetching PVC cluster data from S3 profile %s", s3ProfileName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the message and log level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to warning, added new context

continue
}

if pvCount != pvcCount {
v.log.Info(fmt.Sprintf("Warning: Mismatch in PV/PVC count %d/%d (%v)",
pvCount, pvcCount, err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the same style to do WARNING level messages for pv/pvc restore failures - this is the same use case - we try to use one profile and something failed, so we move to next profile.

Using log.Error() we get a traceback and I'm not sure it is helpful when expected errors and the error message is not enough.

@@ -1847,6 +1847,7 @@ func (v *VRGInstance) restorePVsAndPVCsForVolRep(result *ctrl.Result) (int, erro
return count, nil
}

//nolint:funlen
func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error) {
err := errors.New("s3Profiles empty")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This err will contain the last error calling one of:

  • v.reconciler.ObjStoreGetter.ObjectStore()
  • v.restorePVsFromObjectStore()
  • v.restorePVCsFromObjectStore()

If we fail to restore pvs, pvcs or pvcount/pvccount do not match, we return this err at line 1919.

Since we log all errors during the loop, this change does not fix the issue of logging the same error twice.

I think this function need a rewrite - it is too long and it is too hard to get error handling right with some much code.

I think extracting a helper for restoring from single s3 profile will make it much easier to handle errors.

The helper for restoring from single s3 store can return error early:

func (v *VRGInstance) restorePVsAndPVCsFromS3Profile(profileName string) error {
get object store...
if failed:
return err

restore pvs...
if failed:
    return err

restore pvcs...
if failed:
    return err

check pvs/pvcs counts...
if failed:
    return err

return nil

}

No logs are needed in this function - it just return an error. The function for restoring from all s3 stores can call it twice, logging error for each call.

If we failed to restore from both profiles, we need to return a new error about failing to restore from both profiles, since we already logged the error for every s3 profile.

@@ -1913,19 +1922,13 @@ func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error)
func (v *VRGInstance) restorePVsFromObjectStore(objectStore ObjectStorer, s3ProfileName string) (int, error) {
pvList, err := downloadPVs(objectStore, v.s3KeyPrefix())
if err != nil {
v.log.Error(err, fmt.Sprintf("error fetching PV cluster data from S3 profile %s", s3ProfileName))

return 0, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add context to the error?

return 0, fmt.Errorf("error restoring PVs from profile %q: %w", s3ProfileName, err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirs I am adding context in the caller function. It is now the same in both restorePVsFromObjectStore and restorePVCsFromObjectStore - these functions are returning core error.

v.log.Error(err, fmt.Sprintf("Resolve PV conflict in the S3 store %s to deploy the application", s3ProfileName))

return 0, fmt.Errorf("%s: %w", errMsg, err)
return 0, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add context?

return 0, fmt.Errorf("error restoring PVs from profile %q: %w", s3ProfileName, err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added context in caller function

@@ -1934,8 +1937,6 @@ func (v *VRGInstance) restorePVsFromObjectStore(objectStore ObjectStorer, s3Prof
func (v *VRGInstance) restorePVCsFromObjectStore(objectStore ObjectStorer, s3ProfileName string) (int, error) {
pvcList, err := downloadPVCs(objectStore, v.s3KeyPrefix())
if err != nil {
v.log.Error(err, fmt.Sprintf("error fetching PVC cluster data from S3 profile %s", s3ProfileName))

return 0, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, context for the error may help.

Maybe we get good enough errors - the best way to tell is to simulate failure and look at the logs with the errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding context in the caller function

Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share logs showing real failures with this change?

We can simulate issues by deleting pvs/pvcs from s3 store (e.g using mc) or making the s3 storage inaccessible (e.g. scale down minio deployment).

@@ -1877,6 +1878,8 @@ func (v *VRGInstance) restorePVsAndPVCsFromS3(result *ctrl.Result) (int, error)
// Restore all PVs found in the s3 store. If any failure, the next profile will be retried
pvCount, err = v.restorePVsFromObjectStore(objectStore, s3ProfileName)
if err != nil {
v.log.Info(fmt.Sprintf("Warning: cannot restore PVs from s3 profile %s (%v)", s3ProfileName, err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! new logs are consistent and provide good context for the actual error.

@ELENAGER
Copy link
Member Author

@nirs the real problem is not in PV/PVC restore, but in underline failure (for example, Kube objects group recover error), but inside function restorePVsAndPVCsFromS3 we were suppressed the real error and printed general error. My fix is adding real error message to be propagated to the top

… error

Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errors from restorePVsAndPVCsForVolRep are not specific about PV/PVC restores or from k8s object protection
3 participants