Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce replaydb feature #398

Merged
merged 10 commits into from
Oct 11, 2024
Merged

Conversation

mardim91
Copy link
Contributor

In order to move the evpn-gw-bridge into a more resilient path we have implemented the replaydb feature.

When a module (e.g. FRR module) confronts a permanent error then stops retrying and the replay procedure kicks in. The replay procedure will clean up all the system configuration (e.g. FRR daemon) that is related to the failed module and will replay the relevant objects from the database in order to bring the system configuration back on track again.

The replay feature currently is only supported for the FRR module and the FRR daemon. Support for all the modules will be added in the future

Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
@mardim91 mardim91 requested a review from a team as a code owner September 11, 2024 12:23
Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
pkg/frr/frr.go Outdated Show resolved Hide resolved
pkg/frr/frr.go Outdated Show resolved Hide resolved
pkg/frr/frr.go Outdated
@@ -218,6 +322,12 @@ func handlevrf(objectData *eventbus.ObjectData) {
comp.CompStatus = common.ComponentStatusError
}
log.Printf("%+v\n", comp)

// Checking the timer to decide if we need to replay or not
if comp.Timer > replayThreshold {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we perform this check for each device type, can we have only one check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly you mean by saying "only one check" ? Like creating a common function and calling it from both device types related functions ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look on that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do many impovements here and we can do them in the experiment branch that I have created. The reason for this is:

  1. This code snippet that you have posted above is common in all the modules and not just FRR. So changes in that code will touch all the components and exceeds the scope of this PR so let's do it better on the experiment branch to do it properly
  2. The replayThreshold variable is decided by each module itself and that is why we cannot pass it to the comp structure that easily. If we want to pass it in that Comp structure then we need to find a clever way to do so. That means that it needs a bit more thinking and changes that will affect other part of the code that are outside of this scope. So let's do it to the experiment branch

The rest of your comments I have fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code snippet that you have posted above is common in all the modules and not just FRR. So changes in that code will touch all the components and exceeds the scope of this PR so let's do it better on the experiment branch to do it properly

You just added it in this PR, why do you need to rework the rest?

The replayThreshold variable is decided by each module itself and that is why we cannot pass it to the comp structure that easily.

It is the corresponding handle*() call which creates com, and can pass the threshold at construction?

I just do not want we continue adding debt. If it does not require to rework other components, so I'd prefer it is fixed in place
This change looks pretty straighforward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just added it in this PR, why do you need to rework the rest?

I didnt added this code snippet to this PR that was allready there.

			comp.CompStatus = common.ComponentStatusSuccess
			comp.Timer = 0
		} else {
			if comp.Timer == 0 {
				comp.Timer = 2 * time.Second
			} else {
				comp.Timer *= 2
			}
			comp.CompStatus = common.ComponentStatusError
		}


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about this comp here ?

var comp common.Component

This is not the same comp object which we use in order to call the CheckReplayThreshold function

The object that we use to call this function is created here:

comp = svi.Status.Components[i]

an is a different object so even if we add the replay threshold to the object in the line 163 will not be there in the line 199 because we call a function with a different object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I am saying that we need to find a better smarter way to handle this in the refactoring. Also the modules in general have duplicate code that can be written better so let's handle this also in the refactoring. WDYT ?

pkg/frr/frr.go Outdated Show resolved Hide resolved
pkg/frr/frr.go Show resolved Hide resolved
pkg/infradb/utils.go Outdated Show resolved Hide resolved
pkg/infradb/utils.go Outdated Show resolved Hide resolved
pkg/infradb/utils.go Outdated Show resolved Hide resolved
pkg/infradb/utils.go Outdated Show resolved Hide resolved
pkg/utils/helpers.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@mardim91 mardim91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addresed

Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
@mardim91
Copy link
Contributor Author

mardim91 commented Oct 2, 2024

@artek-koltun I have addressed your comments. Can you please review again ?

Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
Signed-off-by: Dimitrios Markou <dimitrios.markou@ericsson.com>
Copy link
Contributor

@artek-koltun artek-koltun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debt is growing

@artek-koltun artek-koltun merged commit 37653fb into opiproject:main Oct 11, 2024
15 checks passed
@mardim91
Copy link
Contributor Author

#405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants