Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving a Ramen Validation Issue in Failover Cluster Post Hub Recovery #1431

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

BenamarMk
Copy link
Member

@BenamarMk BenamarMk commented May 31, 2024

Fixing a day-one issue post-Hub Recovery process was recently uncovered due to the recent change where we added a new validation check in Ramen. This validation was designed to validate the failover cluster before initiating the failover procedure.

Before the introduction of this validation, the sync and failover operations functioned correctly post-hub recovery despite this underlying bug. The reason is that the ReplicationDestination existed on the failover cluster. The missing RDSpec had no effect.

I would like to point out that this issue happens only when the primary cluster is inaccessible. Ramen cannot retrieve the VRG from the primary cluster in these situations. As a result, when Ramen regenerates the ManifestWork for the VRG, the RDSpecs are excluded. These RDSpecs are created for each ProtectedPVC object in the primary VRG.

The solution in this PR involves adding a check for the existence of the ManifestWork before creating it, to prevent accidentally overwriting a valid VRG.spec. If a ManifestWork already exists on the destination cluster, its creation is skipped. Additionally, a utility function has been created that returns a map of VRGs retrieved using MCV. If a primary VRG is not found, the function retrieves it from the S3 store and adds it to the map. This new function replaces all occurrences of d.vrgs throughout the codebase, ensuring that DRPC maintains an accurate view of all VRGs in the managed clusters.

https://bugzilla.redhat.com/show_bug.cgi?id=2284021

TODO:

  • Test the fix end-to-end

…ter Inaccessibility

A new validation check in Ramen revealed an issue with failover clusters after hub recovery
when the primary cluster is inaccessible. The problem arises because Ramen cannot retrieve
the VRG from the primary cluster, leading to the omission of RDSpecs in the regenerated
ManifestWork. The fix involves checking for the existence of the ManifestWork before creating
it. Skip creation if a ManifestWork already exists for the target cluster.

https://bugzilla.redhat.com/show_bug.cgi?id=2284021

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the fix-ramen-validation-issue-post-hub-recovery branch from abc5380 to 60374a7 Compare June 6, 2024 23:08
Created a utility function that returns a map of VRGs retrieved using MCV.
If a primary VRG is not found, the function retrieves it from the S3 store
and adds it to the map. This function replaces every occurrence of d.vrgs
throughout the codebase, ensuring that DRPC has a near-accurate view of
all VRGs in the managed clusters.

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the fix-ramen-validation-issue-post-hub-recovery branch from 60374a7 to 00d3f10 Compare June 6, 2024 23:26
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the fix-ramen-validation-issue-post-hub-recovery branch from b7e0305 to 2cc42db Compare June 8, 2024 13:21
@BenamarMk
Copy link
Member Author

@ShyamsundarR unit test and e2e complete. Ready for review and merge by Monday if possible

Copy link
Member

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting some initial comments, still looking at other callers to see where we should potentially use the s3 version of VRG.

@@ -912,7 +970,7 @@ func (d *DRPCInstance) ensureCleanupAndVolSyncReplicationSetup(srcCluster string

// Check if the reset has already been applied. ResetVolSyncRDOnPrimary resets the VRG
// in the MW, but the VRGs in the vrgs slice are fetched using MCV.
vrg, ok := d.vrgs[srcCluster]
vrg, ok := d.fetchVRGsFromMClusterOrS3()[srcCluster]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a check to ensure reset has reached the managed cluster, so should use MCV VRG than s3 VRG (if one is found)?

@@ -1531,7 +1591,7 @@ func (d *DRPCInstance) ensureVRGManifestWork(homeCluster string) error {
d.log.Info("Ensure VRG ManifestWork",
"Last State:", d.getLastDRState(), "cluster", homeCluster)

cachedVrg := d.vrgs[homeCluster]
cachedVrg := d.fetchVRGsFromMClusterOrS3()[homeCluster]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add safety here to fetch the MW, and only create if one does not exist. Otherwise createVRGManifestWork would just create/upate VRG MW without any rdSpec etc.

@@ -984,7 +1042,7 @@ func (d *DRPCInstance) prepareForFinalSync(homeCluster string) (bool, error) {

const done = true

vrg, ok := d.vrgs[homeCluster]
vrg, ok := d.fetchVRGsFromMClusterOrS3()[homeCluster]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we trust VRG from S3 to tell us vrg.Status.PrepareForFinalSyncComplete or should we wait on the MCV to report the status?

Copy link
Member

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenamarMk let's discuss the changes. Overall we should rely on the S3 version of VRG when MCV version is not found, but I am not sure we can make this the default in all/most cases.

@@ -1042,7 +1100,8 @@ func (d *DRPCInstance) runFinalSync(homeCluster string) (bool, error) {
func (d *DRPCInstance) areMultipleVRGsPrimary() bool {
numOfPrimaries := 0

for _, vrg := range d.vrgs {
vrgs := d.fetchVRGsFromMClusterOrS3()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is from Relocate where both cluster should be active. So checking for vrg using MCV would be sufficient.

@@ -1014,7 +1072,7 @@ func (d *DRPCInstance) runFinalSync(homeCluster string) (bool, error) {

const done = true

vrg, ok := d.vrgs[homeCluster]
vrg, ok := d.fetchVRGsFromMClusterOrS3()[homeCluster]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as prepareForFinalSync?

@@ -1067,7 +1126,8 @@ func (d *DRPCInstance) selectCurrentPrimaryAndSecondaries() (string, []string) {

primaryVRG := ""

for cn, vrg := range d.vrgs {
vrgs := d.fetchVRGsFromMClusterOrS3()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as areMultipleVRGsPrimary

@@ -1129,7 +1189,7 @@ func (d *DRPCInstance) readyToSwitchOver(homeCluster string, preferredCluster st
}

func (d *DRPCInstance) checkReadinessAfterFailover(homeCluster string) bool {
vrg := d.vrgs[homeCluster]
vrg := d.fetchVRGsFromMClusterOrS3()[homeCluster]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readiness check should rely on MCV.

@BenamarMk
Copy link
Member Author

BenamarMk commented Jun 13, 2024

@BenamarMk let's discuss the changes. Overall we should rely on the S3 version of VRG when MCV version is not found, but I am not sure we can make this the default in all/most cases.

One thing to note; we rely on s3 only when there is no primary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants