Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

[DNM] agent: Properly stop the gRPC server #448

Closed
wants to merge 2 commits into from

Conversation

sboeuf
Copy link

@sboeuf sboeuf commented Jan 29, 2019

This commit attempts to close cleanly the gRPC server so that tracing
will be ended properly.

Fixes #445

Signed-off-by: Sebastien Boeuf sebastien.boeuf@intel.com

@sboeuf sboeuf mentioned this pull request Jan 29, 2019
agent.go Outdated
case <-done:
return
case <-time.After(timeout):
fieldLogger.Warnf("Could not gracefully stop the server after %v", timeout)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fieldLogger.WithField("timeout", timeout)....

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

done := make(chan struct{})
go func() {
s.gracefulStopGRPC()
close(done)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhmmm this smells like race condition

gracefulStopGRPC sets s.server to nil and stopGRPC can use it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let me do that a little bit better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that’s what that smell was....

This commit attempts to close cleanly the gRPC server so that tracing
will be ended properly.

Fixes kata-containers#445

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
@sboeuf
Copy link
Author

sboeuf commented Jan 29, 2019

/test

@sboeuf
Copy link
Author

sboeuf commented Jan 30, 2019

CI failures about license check should be fixed by kata-containers/tests#1101

@sboeuf
Copy link
Author

sboeuf commented Jan 30, 2019

/test

The semantic around the agent is that it should be a passive
component, hence it should not implicitly shut down the VM.

Instead, we expect the kata-runtime to be responsible for this,
using the appropriate VM interface to stop it.

Fixes kata-containers#449

Depends-on: github.com/kata-containers/tests#1101

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
@sboeuf
Copy link
Author

sboeuf commented Jan 30, 2019

/test

@sboeuf
Copy link
Author

sboeuf commented Jan 30, 2019

@jodh-intel

Ok so I did a bit of testing and unfortunately, this PR breaks Kata Containers... The problem being that by closing the gRPC server, we end up closing the connection with the proxy, since the serial connection is managed through Yamux who is managed through gRPC... The server is properly stopped after all gRPC connections are returning, but this means we have a race where the proxy connection is being closed while returning the answer to the runtime.
So I think we should keep the current behavior of not trying to stop the server, and instead let the agent run.

That being said, the initial purpose of this PR was to be able to stop the span of tracing, and I think it can be done by having the same mechanism of Go channel to stop the span from the main agent go routine, instead of from the DestroySandbox() one.
You need to make sure the DestroySandbox() will wait for the end of the span before it returns, otherwise the VM might be killed while the whole span is sent out of the VM through vsock.
This means we need to be careful with the use of defer statements, as the first defer should be the wait for the end of the span.

@jodh-intel
Copy link
Contributor

Hi @sboeuf - thanks for looking into this.

You need to make sure the DestroySandbox() will wait for the end of the span before it returns, otherwise the VM might be killed while the whole span is sent out of the VM through vsock.

Alas, that won't work. It could wait for the end of any spans it creates, but what about its own span? One of the main reasons for adding tracing to the agent is to be able to trace the gRPC calls. Hence, DestroySandbox(), being a gRPC API actually needs to fully return before its span can be completed. See the problem? :)

The tracing support assumes vsock, which implies no proxy. So we could potentially do the full agent shutdown in the non-proxy case. It isn't ideal having the two code paths. However, the proxy has to be considered "legacy" at some point and there were always going to be conditions on allowing agent tracing.

Shutting down the agent cleanly is the right thing to do imho: it makes perfect sense, whereas the current design is much less clear.

@jodh-intel
Copy link
Contributor

Testing with my tracing code shows there are a few problems here:

  • The runtime kills the agent too early (before the agent has shutdown and finalised the trace spans).

    That can be resolved by allowing the VM to shut itself down and removing the QMP quit from the runtime.

  • Once that's done, the behaviour is racy:

    • if the workload runs "quickly" (busybox true), the trace span generally completes (since the agent shuts down correctly after a graceful gRPC server stop).
    • if the workload takes a little longer, the agent hits the timeout, then randomly either the forced-stop works, or it gets stuck waiting on the grpc internal WaitGroup issue I've seen.

@sboeuf
Copy link
Author

sboeuf commented Feb 1, 2019

@jodh-intel

The runtime kills the agent too early (before the agent has shutdown and finalised the trace spans).

True, and that's why we should make sure the agent gets the chance to spit the logs before it ends.

That can be resolved by allowing the VM to shut itself down and removing the QMP quit from the runtime.

I don't like this as it would make the agent actively responsible for the shutdown of the VM, which should be done by the runtime IMO. And as you mentioned, this leads into race conditions.

I think that we should simply consider DestroySandbox() as the last call that can be received by the agent. With this in mind, we would consider doing a few things before we can actually return from this request.
Concretely, this means we have to send a signal from the DestroySandbox() implementation to the agent main routine to handle tracing span finalization. When receiving this signal, the agent would close all the spans and send them out. Afterwards, it would simply signal the DestroySandbox() thread back, using another channel, and it would return.

Now, if you don't like the fact that we might miss a few logs from DestroySandbox() (because we have to do a span.Finish before to send the first signal), then maybe we need to introduce an extra gRPC call called StopAgent() or StopTracing() that would be in charge of doing this. And this would assure us that the agent actually had the time to send all the traces before it returns.
Actually the more I think about this, the more StopTracing() seems to be appropriate since it could be called from the runtime only if we have vsock.

WDYT about this proposal?

@jodh-intel
Copy link
Contributor

@sboeuf - yes, funnily enough I used to have just such a StopAgent() call in a local branch for this reason. I can add it back but it actually just "moves the problem"... because StopAgent() is yet another gRPC call which has an associated trace span.

I'll take another look at this design Monday with a clearer head and more coffee on hand...! ;)

@sboeuf
Copy link
Author

sboeuf commented Feb 1, 2019

@jodh-intel

I can add it back but it actually just "moves the problem"... because StopAgent() is yet another gRPC call which has an associated trace span.

No but you don't want a span in this call, that's the point. Or if you want one, just start it to notify that we reached this function, but then stop it right before we signal the main agent thread.
We have to be okay with loosing a little bit of tracing for the sake of having proper closure here.

One more thing, if we don't want to over complicate this, let's put the fact that we don't stop the agent out the picture here.

@sboeuf sboeuf changed the title agent: Properly stop the gRPC server [DNM] agent: Properly stop the gRPC server Feb 19, 2019
@sboeuf
Copy link
Author

sboeuf commented Feb 19, 2019

Just to mention this is a test/enhancement to handle proper agent termination, but this is not ready yet and being investigated by @jodh-intel and myself.

@jodh-intel
Copy link
Contributor

Incorporated into #415.

@grahamwhaley
Copy link
Contributor

@sboeuf @jodh-intel - as this is now in #415 - should we close this PR?

@sboeuf
Copy link
Author

sboeuf commented Mar 4, 2019

@grahamwhaley I think so! But I'll leave that to @jodh-intel decision.

@jodh-intel
Copy link
Contributor

Thanks @sboeuf. wfm - closing...

@jodh-intel jodh-intel closed this Mar 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vsock semantics
5 participants