-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ShouldSendMessage
decider
#344
Conversation
01d6860
to
df9271e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally makes sense to me. I left a few comments.
main/main.go
Outdated
@@ -216,6 +225,33 @@ func main() { | |||
|
|||
// Create listeners for each of the subnets configured as a source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment is duplicated on line 254. Was that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it, but I think the original crept back in as a merge conflict. Thanks for pointing it out. Addressed in 5c6422e
main/main.go
Outdated
panic(err) | ||
} | ||
runtime.SetFinalizer(grpcClient, func(c *grpc.ClientConn) { c.Close() }) | ||
grpcClient.WaitForStateChange(ctx, connectivity.Ready) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe WaitForStateChange is currently marked as experimental:
https://pkg.go.dev/google.golang.org/grpc#ClientConn.WaitForStateChange
Not sure if this is an issue, just wanted to point it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing that out, I removed it in 584556e#diff-327181d0a8c5e6b164561d7910f4eeffd41442d55b2a2788fda2aa2692f17ec0
@@ -167,7 +183,47 @@ func (m *messageHandler) ShouldSendMessage(destinationClient vms.DestinationClie | |||
return false, nil | |||
} | |||
|
|||
return true, nil | |||
var decision bool = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be initialized to false? It looks like the error cases are just returning decision without modifying it, and I assume we should return false in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intended control flow here is to separate the existing ShouldSendMessage
decision logic from the Decider
logic that is dispatched to the service. If the Decider
fails, then we should return the prior decision from the existing logic.
I think this could be made clearer by adding a new helper, and calling it like so:
ShouldSendMessage() {
// Existing decision logic
decision, err := m.dispatchToDecider()
if err != nil
// log the error, but do not return it
log.Warn("Decider returned error")
return true, nil
}
return decision
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted the logic into a new helper, and adapted my changes to ShouldSendMessage
to fit the existing flow logic (if __ return false; if ___ return false; ...; return true), in 584556e
tests/e2e_test.go
Outdated
|
||
var ctx context.Context | ||
ctx, cancelDecider = context.WithCancel(context.Background()) | ||
// we'll call cancelDecider in AfterSuite, but also call it if this | ||
// process is killed, because AfterSuite won't always run then: | ||
signalChan := make(chan os.Signal, 2) | ||
signal.Notify(signalChan, os.Interrupt, syscall.SIGTERM) | ||
go func() { | ||
<-signalChan | ||
cancelDecider() | ||
}() | ||
decider = exec.CommandContext(ctx, "./tests/cmd/decider/decider") | ||
decider.Start() | ||
log.Info("Started decider service") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary. If ginkgo exits for whatever reason, everything that it was running should be cleaned up - we don't currently look for OS signals for the other processes we run.
What we do need is something that watches the decider process, and if it exists abnormally, we should panic. Example here https://github.com/ava-labs/awm-relayer/blob/main/tests/utils/utils.go#L91
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it's not necessary? A little while ago I tried interrupting a test with Ctrl+C, and from the stdout it didn't look like it ran the AfterSuite
hook, and the decider
process was still running. And just now I tried again and saw that it did execute that hook, and the decider process was not running. So I think the behavior is inconsistent.
If the decider is left running/orphaned/defunct after the tests have finished, then it will fail to start in the next test run, because the port it uses is always the same and that port will be in use if it's already running from a previous test run. And if you're working on some changes totally unrelated to the decider, it could be pretty confusing and unclear what's going on.
This uncertainty and fear of confusion is why I decided to put this code into place. I'm happy to defer to the team's opinion though.
The idea of watching the decider for abnormal exit is a great idea, thanks, I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should move to running our e2e tests in a docker container (if feasible), that way we get the cleanup for free, and wouldn't need to check for OS signals, but until then, this is helpful.
- name: Install buf | ||
uses: bufbuild/buf-setup-action@v1.31.0 | ||
with: | ||
github_token: ${{ github.token }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the github token necessary? Does the repo even have access to a gh token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently not. I removed this in ed3f4e7 and everything still ran fine. I had just copied this from avalanchego. Thanks for noticing it.
git update-index --really-refresh >> /dev/null | ||
git diff-index HEAD # to show the differences | ||
git diff-index --quiet HEAD || (echo 'protobuf generated code changes have not all been checked in' && exit 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this method of checking that generated files are checked in versus the one we use in teleporter
? https://github.com/ava-labs/teleporter/blob/main/.github/workflows/abi_bindings_checker.yml#L41
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure to be honest. I pulled this method from https://github.com/ava-labs/avalanchego/blob/49299868db0bdf5b3755c58993cc728173b3eedd/.github/workflows/check-clean-branch.sh#L8-L9
@@ -101,7 +101,7 @@ awm-relayer --help Display awm-relayer usag | |||
|
|||
### Building | |||
|
|||
Before building, be sure to install Go, which is required even if you're just building the Docker image. | |||
Before building, be sure to install Go, which is required even if you're just building the Docker image. You'll also need to install [buf](github.com/bufbuild/buf/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is buf
required even if the Decider
is unused (i.e. the relevant config options are omitted)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to build, yes, because the relayer binary is built to be aware of the protobuf interfaces so that that same binary can be used later if you decide you do want to include those config options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add descriptions of the new config options.
config/config.go
Outdated
DeciderHost string `mapstructure:"decider-host" json:"decider-host"` | ||
DeciderPort *uint16 `mapstructure:"decider-port" json:"decider-port"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some validation on these new fields? If they're provided, they should be able to construct a valid URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in a9c374c
var deciderClient pbDecider.DeciderServiceClient | ||
if grpcClient == nil { | ||
deciderClient = nil | ||
} else { | ||
deciderClient = pbDecider.NewDeciderServiceClient(grpcClient) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to main
so that grpcClient
can be limited in scope to main
? When we get to #365 we'll have to handle dialing different grpcClients
per message type, so some refactoring will be necessary anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in b01b14b
@@ -167,7 +183,47 @@ func (m *messageHandler) ShouldSendMessage(destinationClient vms.DestinationClie | |||
return false, nil | |||
} | |||
|
|||
return true, nil | |||
var decision bool = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intended control flow here is to separate the existing ShouldSendMessage
decision logic from the Decider
logic that is dispatched to the service. If the Decider
fails, then we should return the prior decision from the existing logic.
I think this could be made clearer by adding a new helper, and calling it like so:
ShouldSendMessage() {
// Existing decision logic
decision, err := m.dispatchToDecider()
if err != nil
// log the error, but do not return it
log.Warn("Decider returned error")
return true, nil
}
return decision
}
return decision, err | ||
} | ||
|
||
// TODO: add a timeout to the context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include timeouts in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in f89bd35
proto/decider/v1/decider.proto
Outdated
@@ -0,0 +1,21 @@ | |||
syntax = "proto3"; | |||
|
|||
package decider.v1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent behind v1
to map awm-relayer
versions to Decider
versions? If so, I'd prefer if we followed the pattern used by Avalanchego / subnet-evm where protocol version compatibility is provided out-of-band. Otherwise, the codebase would continue to grow with each new version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intent was just to follow the normal/standard behavior/conventions of buf+proto, at least as our starting point here. I stripped all of the custom lint rules out of the buf config file when copying it from avalanchego, and one of those was about needing to include the version in the service name.
in 3631b7b I restored that linter rule and changed things to not include the version in the service/package names.
config/config.go
Outdated
DeciderHost string `mapstructure:"decider-host" json:"decider-host"` | ||
DeciderPort *uint16 `mapstructure:"decider-port" json:"decider-port"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just calling this out because I wasn't aware how it worked exactly myself:
AvalancheGo communicates with VM plugins via gprc, but instead of assuming the VMs are running and providing their host/port in the configuration, it starts the configured VMs itself on the first port available on 127.0.0.1
here.
That type of pattern may be less error prone for relayers to set up. i.e: the relayer config would specify the decider plugin(s) (if any) to run for a given source chain, and the application would handle initializing/configuring them itself.
Not suggesting we do this now, but could be worth considering going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that what Michael described is the generally preferred pattern for colocated services, and I agree that less config is better and less error prone.
A counter example for why we would want to keep it, is that this allows for using remote deciders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cam and I also discussed this. For this iteration I was going for the simplest thing possible, and sub-process management proved non-trivial, so I went with this host/port approach instead, since we said that ultimately we want to be able to support both local and remote deciders.
Going forward, I'm open to doing the sub-process/plugin approach; we may be able to leverage the sub-process modules available in avalanchego, but if not then the complexity added to the code base for this would be a lot, at least as compared to the host/port approach.
Also, at the moment I'm leaning towards just the host/port approach so that there's one single config interface, rather than one for a local decider (the plugin path) and a separate/different one for a remote decider (the host/port).
addresses review comment #344 (comment)
addresses review comment #344 (comment)
addresses review comment #344 (comment)
addresses review comment #344 (comment)
addresses review comment #344 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
follow the avalanchego convention of excluding the version number from the name of the service. addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment) Signed-off-by: F. Eugene Aumson <feuGeneA@users.noreply.github.com>
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comments ava-labs/icm-services#344 (comment) and ava-labs/icm-services#344 (comment) and ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment) and ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
addresses review comment ava-labs/icm-services#344 (comment)
Why this should be merged
for #363
How this works
the relayer is informed of a decider service via config values that point to its host and port, and it communicates with that service process via grpc to make the decision.
How this was tested
a simple decider implementation is included and incorporated into the tests