Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - logmon: recover from Start failures #5615

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion client/allocrunner/taskrunner/logmon_hook.go
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,10 @@ func (h *logmonHook) Prestart(ctx context.Context,
}

// We did not reattach to a plugin and one is still not running.
// Note that the Exited check is racy and used as a fast check: If
// plugin is shutting down but process hasn't exited completely, the
// Start call would fail with UNAVAILABLE grpc error, so need to retry
// some Start call failures too
if h.logmonPluginClient == nil || h.logmonPluginClient.Exited() {
if err := h.launchLogMon(nil); err != nil {
// Retry errors launching logmon as logmon may have crashed on start and
Expand All @@ -131,7 +135,10 @@ func (h *logmonHook) Prestart(ctx context.Context,
})
if err != nil {
h.logger.Error("failed to start logmon", "error", err)
return err
// we treat start logmon failure as recoverable to attempt again, specially if they are
// grpc errors
// TODO: do we know of permanent errors here that aren't worth retrying?
return structs.NewRecoverableError(err, true)
}

rCfg := pstructs.ReattachConfigFromGoPlugin(h.logmonPluginClient.ReattachConfig())
Expand Down