New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Resolving a Ramen Validation Issue in Failover Cluster Post Hub Recovery #1431

Open

BenamarMk wants to merge 3 commits into RamenDR:main from BenamarMk:fix-ramen-validation-issue-post-hub-recovery

Member

BenamarMk commented May 31, 2024 •

edited

Loading

Fixing a day-one issue post-Hub Recovery process was recently uncovered due to the recent change where we added a new validation check in Ramen. This validation was designed to validate the failover cluster before initiating the failover procedure.

Before the introduction of this validation, the sync and failover operations functioned correctly post-hub recovery despite this underlying bug. The reason is that the ReplicationDestination existed on the failover cluster. The missing RDSpec had no effect.

I would like to point out that this issue happens only when the primary cluster is inaccessible. Ramen cannot retrieve the VRG from the primary cluster in these situations. As a result, when Ramen regenerates the ManifestWork for the VRG, the RDSpecs are excluded. These RDSpecs are created for each ProtectedPVC object in the primary VRG.

The solution in this PR involves adding a check for the existence of the ManifestWork before creating it, to prevent accidentally overwriting a valid VRG.spec. If a ManifestWork already exists on the destination cluster, its creation is skipped. Additionally, a utility function has been created that returns a map of VRGs retrieved using MCV. If a primary VRG is not found, the function retrieves it from the S3 store and adds it to the map. This new function replaces all occurrences of d.vrgs throughout the codebase, ensuring that DRPC maintains an accurate view of all VRGs in the managed clusters.

https://bugzilla.redhat.com/show_bug.cgi?id=2284021

TODO:

Test the fix end-to-end


          Fix Ramen Validation Issue with Failover Clusters During Primary Clus…

c8ac521

…ter Inaccessibility

A new validation check in Ramen revealed an issue with failover clusters after hub recovery
when the primary cluster is inaccessible. The problem arises because Ramen cannot retrieve
the VRG from the primary cluster, leading to the omission of RDSpecs in the regenerated
ManifestWork. The fix involves checking for the existence of the ManifestWork before creating
it. Skip creation if a ManifestWork already exists for the target cluster.

https://bugzilla.redhat.com/show_bug.cgi?id=2284021

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

BenamarMk force-pushed the fix-ramen-validation-issue-post-hub-recovery branch from abc5380 to 60374a7 Compare

June 6, 2024 23:08


          Refactor VRG Retrieval with Enhanced MCV and S3 Integration

00d3f10

Created a utility function that returns a map of VRGs retrieved using MCV.
If a primary VRG is not found, the function retrieves it from the S3 store
and adds it to the map. This function replaces every occurrence of d.vrgs
throughout the codebase, ensuring that DRPC has a near-accurate view of
all VRGs in the managed clusters.

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

BenamarMk force-pushed the fix-ramen-validation-issue-post-hub-recovery branch from 60374a7 to 00d3f10 Compare

June 6, 2024 23:26


          Cache the VRG retrieved from S3 store to improve retrieval efficiency

2cc42db

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>

BenamarMk force-pushed the fix-ramen-validation-issue-post-hub-recovery branch from b7e0305 to 2cc42db Compare

June 8, 2024 13:21

Member Author

BenamarMk commented Jun 8, 2024

@ShyamsundarR unit test and e2e complete. Ready for review and merge by Monday if possible

ShyamsundarR reviewed

View reviewed changes

Member

ShyamsundarR left a comment

Posting some initial comments, still looking at other callers to see where we should potentially use the s3 version of VRG.

controllers/drplacementcontrol.go

               	// Check if the reset has already been applied. ResetVolSyncRDOnPrimary resets the VRG
               	// in the MW, but the VRGs in the vrgs slice are fetched using MCV.
-              	vrg, ok := d.vrgs[srcCluster]
+              	vrg, ok := d.fetchVRGsFromMClusterOrS3()[srcCluster]

Member

ShyamsundarR Jun 12, 2024

This is a check to ensure reset has reached the managed cluster, so should use MCV VRG than s3 VRG (if one is found)?

controllers/drplacementcontrol.go

               	d.log.Info("Ensure VRG ManifestWork",
               		"Last State:", d.getLastDRState(), "cluster", homeCluster)
-              	cachedVrg := d.vrgs[homeCluster]
+              	cachedVrg := d.fetchVRGsFromMClusterOrS3()[homeCluster]

Member

ShyamsundarR Jun 12, 2024

I think we should add safety here to fetch the MW, and only create if one does not exist. Otherwise createVRGManifestWork would just create/upate VRG MW without any rdSpec etc.

controllers/drplacementcontrol.go

               	const done = true
-              	vrg, ok := d.vrgs[homeCluster]
+              	vrg, ok := d.fetchVRGsFromMClusterOrS3()[homeCluster]

Member

ShyamsundarR Jun 12, 2024

Should we trust VRG from S3 to tell us vrg.Status.PrepareForFinalSyncComplete or should we wait on the MCV to report the status?

ShyamsundarR reviewed

View reviewed changes

Member

ShyamsundarR left a comment

@BenamarMk let's discuss the changes. Overall we should rely on the S3 version of VRG when MCV version is not found, but I am not sure we can make this the default in all/most cases.

controllers/drplacementcontrol.go

               func (d *DRPCInstance) areMultipleVRGsPrimary() bool {
               	numOfPrimaries := 0
-              	for _, vrg := range d.vrgs {
+              	vrgs := d.fetchVRGsFromMClusterOrS3()

Member

ShyamsundarR Jun 13, 2024

This path is from Relocate where both cluster should be active. So checking for vrg using MCV would be sufficient.

controllers/drplacementcontrol.go

               	const done = true
-              	vrg, ok := d.vrgs[homeCluster]
+              	vrg, ok := d.fetchVRGsFromMClusterOrS3()[homeCluster]

Member

ShyamsundarR Jun 13, 2024

Same as prepareForFinalSync?

controllers/drplacementcontrol.go

               	primaryVRG := ""
-              	for cn, vrg := range d.vrgs {
+              	vrgs := d.fetchVRGsFromMClusterOrS3()

Member

ShyamsundarR Jun 13, 2024

Same as areMultipleVRGsPrimary

controllers/drplacementcontrol.go

               }
               func (d *DRPCInstance) checkReadinessAfterFailover(homeCluster string) bool {
-              	vrg := d.vrgs[homeCluster]
+              	vrg := d.fetchVRGsFromMClusterOrS3()[homeCluster]

Member

ShyamsundarR Jun 13, 2024

Readiness check should rely on MCV.

Member Author

BenamarMk commented Jun 13, 2024 •

edited

Loading

@BenamarMk let's discuss the changes. Overall we should rely on the S3 version of VRG when MCV version is not found, but I am not sure we can make this the default in all/most cases.

One thing to note; we rely on s3 only when there is no primary.

BenamarMk mentioned this pull request

Return information only if primary VRG is found during initial deploy… #1457

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment