-
Notifications
You must be signed in to change notification settings - Fork 114
[DNM] agent: Properly stop the gRPC server #448
Conversation
agent.go
Outdated
case <-done: | ||
return | ||
case <-time.After(timeout): | ||
fieldLogger.Warnf("Could not gracefully stop the server after %v", timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fieldLogger.WithField("timeout", timeout)....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
done := make(chan struct{}) | ||
go func() { | ||
s.gracefulStopGRPC() | ||
close(done) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhmmm this smells like race condition
gracefulStopGRPC
sets s.server
to nil and stopGRPC
can use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let me do that a little bit better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that’s what that smell was....
This commit attempts to close cleanly the gRPC server so that tracing will be ended properly. Fixes kata-containers#445 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
/test |
CI failures about license check should be fixed by kata-containers/tests#1101 |
/test |
The semantic around the agent is that it should be a passive component, hence it should not implicitly shut down the VM. Instead, we expect the kata-runtime to be responsible for this, using the appropriate VM interface to stop it. Fixes kata-containers#449 Depends-on: github.com/kata-containers/tests#1101 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
/test |
Ok so I did a bit of testing and unfortunately, this PR breaks Kata Containers... The problem being that by closing the gRPC server, we end up closing the connection with the proxy, since the serial connection is managed through Yamux who is managed through gRPC... The server is properly stopped after all gRPC connections are returning, but this means we have a race where the proxy connection is being closed while returning the answer to the runtime. That being said, the initial purpose of this PR was to be able to stop the span of tracing, and I think it can be done by having the same mechanism of Go channel to stop the span from the main agent go routine, instead of from the |
Hi @sboeuf - thanks for looking into this.
Alas, that won't work. It could wait for the end of any spans it creates, but what about its own span? One of the main reasons for adding tracing to the agent is to be able to trace the gRPC calls. Hence, The tracing support assumes vsock, which implies no proxy. So we could potentially do the full agent shutdown in the non-proxy case. It isn't ideal having the two code paths. However, the proxy has to be considered "legacy" at some point and there were always going to be conditions on allowing agent tracing. Shutting down the agent cleanly is the right thing to do imho: it makes perfect sense, whereas the current design is much less clear. |
Testing with my tracing code shows there are a few problems here:
|
True, and that's why we should make sure the agent gets the chance to spit the logs before it ends.
I don't like this as it would make the agent actively responsible for the shutdown of the VM, which should be done by the runtime IMO. And as you mentioned, this leads into race conditions. I think that we should simply consider Now, if you don't like the fact that we might miss a few logs from WDYT about this proposal? |
@sboeuf - yes, funnily enough I used to have just such a I'll take another look at this design Monday with a clearer head and more coffee on hand...! ;) |
No but you don't want a span in this call, that's the point. Or if you want one, just start it to notify that we reached this function, but then stop it right before we signal the main agent thread. One more thing, if we don't want to over complicate this, let's put the fact that we don't stop the agent out the picture here. |
Just to mention this is a test/enhancement to handle proper agent termination, but this is not ready yet and being investigated by @jodh-intel and myself. |
Incorporated into #415. |
@sboeuf @jodh-intel - as this is now in #415 - should we close this PR? |
@grahamwhaley I think so! But I'll leave that to @jodh-intel decision. |
Thanks @sboeuf. wfm - closing... |
This commit attempts to close cleanly the gRPC server so that tracing
will be ended properly.
Fixes #445
Signed-off-by: Sebastien Boeuf sebastien.boeuf@intel.com