Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unblock misconfigured subnets #2679

Merged
merged 6 commits into from
Jan 30, 2024
Merged

Conversation

StephenButtolph
Copy link
Contributor

Why this should be merged

Resolves #2594. While this is an unsafe/undesirable configuration. This error has occurred enough that we should probably make it at least recoverable. Especially for users that have no control over the validator set (which can occur on permissioned subnets)

How this works

If there a blocks processing, we now trigger repolls during the periodic Gossip call. This gets the node out of the (generally unexpected) state where there are blocks processing but there are no outstanding polls.

How this was tested

  • Locally on an intentionally misconfigured subnet
  • Added regression unit test
  • CI

@StephenButtolph StephenButtolph added the enhancement New feature or request label Jan 30, 2024
@StephenButtolph StephenButtolph added this to the v1.10.20 milestone Jan 30, 2024
@StephenButtolph StephenButtolph self-assigned this Jan 30, 2024
@@ -171,7 +171,7 @@ func (t *Transitive) Gossip(ctx context.Context) error {
// nodes with a large amount of stake weight.
vdrID, ok := t.ConnectedValidators.SampleValidator()
if !ok {
t.Ctx.Log.Error("skipping block gossip",
t.Ctx.Log.Warn("skipping block gossip",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are no longer Errors as the node can recover.

@@ -932,7 +937,7 @@ func (t *Transitive) sendQuery(

vdrIDs, err := t.Validators.Sample(t.Ctx.SubnetID, t.Params.K)
if err != nil {
t.Ctx.Log.Error("dropped query for block",
t.Ctx.Log.Warn("dropped query for block",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are no longer Errors as the node can recover.

@@ -2921,3 +2921,108 @@ func TestEngineApplyAcceptedFrontierInQueryFailed(t *testing.T) {

require.Equal(choices.Accepted, blk.Status())
}

func TestEngineRepollsMisconfiguredSubnet(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action required) I was able to grok this test, but have some suggestions for increasing maintainability:

  • separate out setup (either to helper function(s) or delineated with e.g. comments) from test
  • use comments to highlight important parts of the test that might not be immediately obvious to the uninitiated e.g.
    • create subnet that is misconfigured (missing validator?)
    • check that a tx won't be committed
    • fix misconfiguration
    • check that new tx can be commited (what about previous tx?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments. I'm hoping to do a large refactor of both this code and the tests... When we do that hopefully all of this will be able to be cleaned up.

@StephenButtolph StephenButtolph merged commit db9e96a into master Jan 30, 2024
1 check passed
@StephenButtolph StephenButtolph deleted the unblock-misconfigured-subnets branch January 30, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Stuck subnet node unable to recover after validator re-stake
3 participants