Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logmon: recover from shutting down call locally #5616

Merged
merged 7 commits into from
Apr 26, 2019

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Apr 25, 2019

Alternative to #5615 - where we detect logmon plugin shutting down during logmon.Start call and retry it after killing the process.

Here, we try 4 times with 1-second backoff. I opted to use a very simple backoff strategy, and punt on a more robust strategy for logmon failures to 0.9.2.

The condition is very hard to test, as plugin needs to be exiting but not exited yet prior to Start call. Open for suggestions.

Mahmood Ali added 2 commits April 25, 2019 14:32
Retry if we detect shutting down during Start() api call is started,
locally.
}

func (h *logmonHook) prestartOneLoop(ctx context.Context, req *interfaces.TaskPrestartRequest) error {
// attach to a running logmon if state indicates one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted not to change the logic here as the logic is somewhat brittle and I don't want to make it worse. But the idea is that if grpc call fails with shutting down, h.logmonPluginClient would be non-nill and would be marked as Exited, so we'll create a new logmon instance on retries.

@@ -73,5 +73,8 @@ func (p *Plugin) GRPCServer(broker *plugin.GRPCBroker, s *grpc.Server) error {
}

func (p *Plugin) GRPCClient(ctx context.Context, broker *plugin.GRPCBroker, c *grpc.ClientConn) (interface{}, error) {
return &logmonClient{client: proto.NewLogMonClient(c)}, nil
return &logmonClient{
doneCtx: ctx,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not one 100% percent following the plugin ctx and how it's set during the initialization phase - but this follows the pattern in other driver plugins , e.g. https://github.com/hashicorp/nomad/blob/v0.9.0/plugins/drivers/plugin.go#L39-L49 .

Interestingly the logmon plugin client doesn't embed BasePluginClient, so might want to follow up on that with 0.9.2.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this code is very difficult to test, but it'd be nice to get some sort of coverage on it even if it's just small unit tests that rely on mocks.

@notnoop notnoop force-pushed the b-retry-logmon-start-errs-2 branch from c0114c7 to 94c9c57 Compare April 25, 2019 20:31
@notnoop notnoop force-pushed the b-retry-logmon-start-errs-2 branch from 94c9c57 to 978fc65 Compare April 25, 2019 20:41
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phenomenal job on the test! We should probably try synthesizing arbitrary pauses with SIGSTOP more!

// then we kill process while Start call is running
require.NoError(t, proc.Signal(syscall.SIGSTOP))
// sleep for the signal to take effect
time.Sleep(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you guarantee that the sleep is sufficient? Wondering if this will be flaky in CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of a good way to test if a process is stopped - unlike killing. If it's flaky in CI, we can react - but considering that we will overhaul logmon plugin method to be inline with other plugin clients and this code would probably change, I'd avoid overengineering and react to CI problems as they come.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I take that back :). ended up using gopsutil.Process utility to verify process sleep status in 658a734!

h.logger.Warn("logmon shutdown while making request", "error", err)

if tries > 3 {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe log here that its out of retries before returning error?

Copy link
Member

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit and one question, otherwise LGTM

@notnoop notnoop merged commit 1497b8e into master Apr 26, 2019
notnoop pushed a commit that referenced this pull request Apr 26, 2019
logmon: recover from shutting down call locally
notnoop pushed a commit that referenced this pull request Apr 26, 2019
@notnoop notnoop deleted the b-retry-logmon-start-errs-2 branch May 1, 2019 00:40
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants