[DNM] agent: Properly stop the gRPC server #448

sboeuf · 2019-01-29T19:15:11Z

This commit attempts to close cleanly the gRPC server so that tracing
will be ended properly.

Fixes #445

Signed-off-by: Sebastien Boeuf sebastien.boeuf@intel.com

devimc · 2019-01-29T19:19:19Z

agent.go

+	case <-done:
+		return
+	case <-time.After(timeout):
+		fieldLogger.Warnf("Could not gracefully stop the server after %v", timeout)


fieldLogger.WithField("timeout", timeout)....

devimc · 2019-01-29T19:28:05Z

agent.go

+	done := make(chan struct{})
+	go func() {
+		s.gracefulStopGRPC()
+		close(done)


uhmmm this smells like race condition

gracefulStopGRPC sets s.server to nil and stopGRPC can use it

Yes let me do that a little bit better

Oh that’s what that smell was....

This commit attempts to close cleanly the gRPC server so that tracing will be ended properly. Fixes kata-containers#445 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>

sboeuf · 2019-01-29T19:44:12Z

/test

sboeuf · 2019-01-30T07:12:55Z

CI failures about license check should be fixed by kata-containers/tests#1101

sboeuf · 2019-01-30T08:19:00Z

/test

The semantic around the agent is that it should be a passive component, hence it should not implicitly shut down the VM. Instead, we expect the kata-runtime to be responsible for this, using the appropriate VM interface to stop it. Fixes kata-containers#449 Depends-on: github.com/kata-containers/tests#1101 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>

sboeuf · 2019-01-30T16:36:45Z

/test

sboeuf · 2019-01-30T18:22:26Z

@jodh-intel

Ok so I did a bit of testing and unfortunately, this PR breaks Kata Containers... The problem being that by closing the gRPC server, we end up closing the connection with the proxy, since the serial connection is managed through Yamux who is managed through gRPC... The server is properly stopped after all gRPC connections are returning, but this means we have a race where the proxy connection is being closed while returning the answer to the runtime.
So I think we should keep the current behavior of not trying to stop the server, and instead let the agent run.

That being said, the initial purpose of this PR was to be able to stop the span of tracing, and I think it can be done by having the same mechanism of Go channel to stop the span from the main agent go routine, instead of from the DestroySandbox() one.
You need to make sure the DestroySandbox() will wait for the end of the span before it returns, otherwise the VM might be killed while the whole span is sent out of the VM through vsock.
This means we need to be careful with the use of defer statements, as the first defer should be the wait for the end of the span.

jodh-intel · 2019-01-31T09:59:16Z

Hi @sboeuf - thanks for looking into this.

You need to make sure the DestroySandbox() will wait for the end of the span before it returns, otherwise the VM might be killed while the whole span is sent out of the VM through vsock.

Alas, that won't work. It could wait for the end of any spans it creates, but what about its own span? One of the main reasons for adding tracing to the agent is to be able to trace the gRPC calls. Hence, DestroySandbox(), being a gRPC API actually needs to fully return before its span can be completed. See the problem? :)

The tracing support assumes vsock, which implies no proxy. So we could potentially do the full agent shutdown in the non-proxy case. It isn't ideal having the two code paths. However, the proxy has to be considered "legacy" at some point and there were always going to be conditions on allowing agent tracing.

Shutting down the agent cleanly is the right thing to do imho: it makes perfect sense, whereas the current design is much less clear.

jodh-intel · 2019-02-01T15:01:55Z

Testing with my tracing code shows there are a few problems here:

The runtime kills the agent too early (before the agent has shutdown and finalised the trace spans).

That can be resolved by allowing the VM to shut itself down and removing the QMP quit from the runtime.
Once that's done, the behaviour is racy:
- if the workload runs "quickly" (busybox true), the trace span generally completes (since the agent shuts down correctly after a graceful gRPC server stop).
- if the workload takes a little longer, the agent hits the timeout, then randomly either the forced-stop works, or it gets stuck waiting on the grpc internal WaitGroup issue I've seen.

sboeuf · 2019-02-01T17:41:11Z

@jodh-intel

The runtime kills the agent too early (before the agent has shutdown and finalised the trace spans).

True, and that's why we should make sure the agent gets the chance to spit the logs before it ends.

That can be resolved by allowing the VM to shut itself down and removing the QMP quit from the runtime.

I don't like this as it would make the agent actively responsible for the shutdown of the VM, which should be done by the runtime IMO. And as you mentioned, this leads into race conditions.

I think that we should simply consider DestroySandbox() as the last call that can be received by the agent. With this in mind, we would consider doing a few things before we can actually return from this request.
Concretely, this means we have to send a signal from the DestroySandbox() implementation to the agent main routine to handle tracing span finalization. When receiving this signal, the agent would close all the spans and send them out. Afterwards, it would simply signal the DestroySandbox() thread back, using another channel, and it would return.

Now, if you don't like the fact that we might miss a few logs from DestroySandbox() (because we have to do a span.Finish before to send the first signal), then maybe we need to introduce an extra gRPC call called StopAgent() or StopTracing() that would be in charge of doing this. And this would assure us that the agent actually had the time to send all the traces before it returns.
Actually the more I think about this, the more StopTracing() seems to be appropriate since it could be called from the runtime only if we have vsock.

WDYT about this proposal?

jodh-intel · 2019-02-01T17:48:44Z

@sboeuf - yes, funnily enough I used to have just such a StopAgent() call in a local branch for this reason. I can add it back but it actually just "moves the problem"... because StopAgent() is yet another gRPC call which has an associated trace span.

I'll take another look at this design Monday with a clearer head and more coffee on hand...! ;)

sboeuf · 2019-02-01T17:54:40Z

@jodh-intel

I can add it back but it actually just "moves the problem"... because StopAgent() is yet another gRPC call which has an associated trace span.

No but you don't want a span in this call, that's the point. Or if you want one, just start it to notify that we reached this function, but then stop it right before we signal the main agent thread.
We have to be okay with loosing a little bit of tracing for the sake of having proper closure here.

One more thing, if we don't want to over complicate this, let's put the fact that we don't stop the agent out the picture here.

sboeuf · 2019-02-19T17:46:37Z

Just to mention this is a test/enhancement to handle proper agent termination, but this is not ready yet and being investigated by @jodh-intel and myself.

jodh-intel · 2019-02-26T09:48:43Z

Incorporated into #415.

grahamwhaley · 2019-03-04T11:35:48Z

@sboeuf @jodh-intel - as this is now in #415 - should we close this PR?

sboeuf · 2019-03-04T15:26:40Z

@grahamwhaley I think so! But I'll leave that to @jodh-intel decision.

jodh-intel · 2019-03-04T15:44:17Z

Thanks @sboeuf. wfm - closing...

sboeuf mentioned this pull request Jan 29, 2019

vsock semantics #445

Closed

devimc reviewed Jan 29, 2019

View reviewed changes

agent: Properly stop the gRPC server

569f5b8

This commit attempts to close cleanly the gRPC server so that tracing will be ended properly. Fixes kata-containers#445 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>

sboeuf force-pushed the fix_grpc_stop branch from 56be3e1 to 569f5b8 Compare January 29, 2019 19:35

sboeuf force-pushed the fix_grpc_stop branch from 0645376 to 383eb72 Compare January 30, 2019 06:22

sboeuf force-pushed the fix_grpc_stop branch from 383eb72 to f66a848 Compare January 30, 2019 07:21

sboeuf force-pushed the fix_grpc_stop branch from f66a848 to f5e7c8b Compare January 30, 2019 16:18

sboeuf added the do-not-merge label Jan 30, 2019

sboeuf changed the title ~~agent: Properly stop the gRPC server~~ [DNM] agent: Properly stop the gRPC server Feb 19, 2019

sboeuf mentioned this pull request Feb 22, 2019

Add opentracing support #322

Closed

jodh-intel mentioned this pull request Feb 25, 2019

tracing: Add OpenTracing support #415

Merged

jodh-intel closed this Mar 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] agent: Properly stop the gRPC server #448

[DNM] agent: Properly stop the gRPC server #448

sboeuf commented Jan 29, 2019

devimc Jan 29, 2019

sboeuf Jan 29, 2019

devimc Jan 29, 2019

sboeuf Jan 29, 2019

sboeuf Jan 29, 2019

egernst Jan 29, 2019

sboeuf commented Jan 29, 2019

sboeuf commented Jan 30, 2019

sboeuf commented Jan 30, 2019

sboeuf commented Jan 30, 2019

sboeuf commented Jan 30, 2019

jodh-intel commented Jan 31, 2019

jodh-intel commented Feb 1, 2019

sboeuf commented Feb 1, 2019

jodh-intel commented Feb 1, 2019

sboeuf commented Feb 1, 2019

sboeuf commented Feb 19, 2019

jodh-intel commented Feb 26, 2019

grahamwhaley commented Mar 4, 2019

sboeuf commented Mar 4, 2019

jodh-intel commented Mar 4, 2019

[DNM] agent: Properly stop the gRPC server #448

[DNM] agent: Properly stop the gRPC server #448

Conversation

sboeuf commented Jan 29, 2019

devimc Jan 29, 2019

Choose a reason for hiding this comment

sboeuf Jan 29, 2019

Choose a reason for hiding this comment

devimc Jan 29, 2019

Choose a reason for hiding this comment

sboeuf Jan 29, 2019

Choose a reason for hiding this comment

sboeuf Jan 29, 2019

Choose a reason for hiding this comment

egernst Jan 29, 2019

Choose a reason for hiding this comment

sboeuf commented Jan 29, 2019

sboeuf commented Jan 30, 2019

sboeuf commented Jan 30, 2019

sboeuf commented Jan 30, 2019

sboeuf commented Jan 30, 2019

jodh-intel commented Jan 31, 2019

jodh-intel commented Feb 1, 2019

sboeuf commented Feb 1, 2019

jodh-intel commented Feb 1, 2019

sboeuf commented Feb 1, 2019

sboeuf commented Feb 19, 2019

jodh-intel commented Feb 26, 2019

grahamwhaley commented Mar 4, 2019

sboeuf commented Mar 4, 2019

jodh-intel commented Mar 4, 2019