Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog: use abort action as a default if killing is enabled. #13523

Merged
merged 8 commits into from
Oct 19, 2020

Conversation

KBaichoo
Copy link
Contributor

@KBaichoo KBaichoo commented Oct 12, 2020

Signed-off-by: Kevin Baichoo kbaichoo@google.com

Commit Message: Watchdog: use abort action as a default if killing is enabled and we're on a supported platform.
Additional Description:
Risk Level: low
Testing: unit tests
Docs Changes: Included
Release Notes: Included

See PR #13208 for context as the reason it's part of core envoy and not an extension.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #13523 was opened by KBaichoo.

see: more, trace.

@antoniovicente antoniovicente self-assigned this Oct 12, 2020
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@repokitteh-read-only repokitteh-read-only bot added the deps Approval required for changes to Envoy's external dependencies label Oct 13, 2020
@repokitteh-read-only
Copy link

CC @envoyproxy/dependency-shepherds: Your approval is needed for changes made to (bazel/.*repos.*\.bzl)|(bazel/dependency_imports\.bzl)|(api/bazel/.*\.bzl)|(.*/requirements\.txt).

🐱

Caused by: #13523 was synchronize by KBaichoo.

see: more, trace.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for generalizing this action and moving it to envoy core. Change looks good, just a few nits.

source/common/thread/terminate_thread.h Show resolved Hide resolved
],
)

envoy_cc_extension(
# TODO(kbaichoo): is there a more appropriate build unit for this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems about right. I see some similar examples outside the extensions directory like udp_default_writer_config.cc which contain the extension name as part of the filename since the extensions don't have their own directories like in your example. you could consider making this action live in the watchdog directory instead and possibly have the config object link in directly to the watchdog library.

Copy link
Contributor Author

@KBaichoo KBaichoo Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them into the watchdog directory, and linked it directly into the guarddog_lib. Given this, I removed the AbortAction namespace since it originally was for being consistent with directory structure.

source/common/watchdog/abort_action/abort_action.cc Outdated Show resolved Hide resolved

namespace Envoy {
namespace Extensions {
namespace Watchdog {
namespace AbortAction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AbortAction namespace doesn't seem necessary. I assume it exists for consistency with the directory structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it was for consistency with directory structure. Given my response to your comment about removing the abort_action directory, I removed the AbortAction namespace since it originally was for being consistent with directory structure.

Done.

// Successfully signaled to thread to terminate, sleep for wait_duration.
absl::SleepFor(absl::Milliseconds(PROTOBUF_GET_MS_OR_DEFAULT(config_, wait_duration, 0)));
} else {
// Failed to send the signal, abort?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd comment since we effectively panic in the next line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I removed it. The LOG along with the comment on the PANIC seem to supply enough context there.

// Add abort_action if killing is enabled.
envoy::watchdog::abort_action::v3alpha::AbortActionConfig abort_config;
// Wait one second for the aborted thread to abort.
abort_config.mutable_wait_duration()->set_seconds(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rely on the default wait_duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if (Thread::terminateThread(thread_id)) {
// Successfully signaled to thread to terminate, sleep for wait_duration.
absl::SleepFor(absl::Milliseconds(PROTOBUF_GET_MS_OR_DEFAULT(config_, wait_duration, 0)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default sleep duration be non-zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it's now the default that wait_duration is (5s).

test/server/configuration_impl_test.cc Outdated Show resolved Hide resolved

// We should have the abort action added to both KILL and MULTIKILL events.
EXPECT_EQ(config.workerWatchdogConfig().actions().size(), 2);
EXPECT_EQ(config.mainThreadWatchdogConfig().actions().size(), 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth checking the contents of these action configs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Abort from the action since the signaled thread hasn't yet crashed the process.
// Panicing in the action gives flexibility since it doesn't depend on
// external code to kill the process if the signal fails.
PANIC(fmt::format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PANICs in guarddog_impl.cc are no longer reachable now that we are adding these actions by default. Should they be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're still reachable in the guarddog unit tests, which doesn't install this action, but perhaps they should. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to install these default actions in the guarddog itself instead of doing so by modifying config earlier in the process. Doing so would allow us to get more consistent behavior in smaller tests and avoid potentially showing the config modifications when someone accesses the proxy config via the admin handler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a great suggestion -- having this logic captured within just the guarddog simplifies a lot of the design.

// MULTIKILL events if those are enabled.
message AbortActionConfig {
// How long to wait for the thread to respond to the thread kill function
// before killing the process from this action. This is a blocking action.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a default wait interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. Will go with 5 seconds by default -- this is somewhat arbitrary, but it should be sufficient time for the failure handlers to have finished running and the process exiting.

…er core extensions, set a default wait time of 5 second, minor changes.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
source/server/configuration_impl.cc Outdated Show resolved Hide resolved
auto actions = watchdog.actions();

// Add abort_action if killing is enabled.
envoy::watchdog::v3alpha::AbortActionConfig abort_config;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Move just before the 2 PackFrom calls below.

Although, would it work if you don't specify typed config since the proto in question has no fields set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still works as we end up setting the type url which IIUC is necessary for finding the right factory for the typed config.

… guarddog.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
antoniovicente
antoniovicente previously approved these changes Oct 14, 2020
source/server/guarddog_impl.cc Outdated Show resolved Hide resolved
* only works on platforms that support SIGABRT.
*
* Returns the result from the platform specific function (i.e. kill) to terminate
* the thread. If the platform is currently unsupported, this will return false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not return the result of the platform specific function, it returns true if the platform specific function succeeded. See implementation, return is: kill() == 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch I've made it:

Returns true if the platform specific function to terminate the thread succeeded (i.e. kill() == 0). If the platform is currently unsupported, this will return false.

@@ -1,7 +0,0 @@
watchdogs:
main_thread_watchdog:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this config file was never used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it was added in an intermediate commit, when we were splitting watchdog into multiple watchdogs. But it didn't get used in the end, and I missed removing it (until now).

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

@envoyproxy/windows-dev

antoniovicente
antoniovicente previously approved these changes Oct 14, 2020
docs/root/version_history/v1.16.0.rst Show resolved Hide resolved
test/per_file_coverage.sh Outdated Show resolved Hide resolved
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…ult-cross-platform

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13523 (comment) was created by @KBaichoo.

see: more, trace.

@KBaichoo
Copy link
Contributor Author

PTAL @envoyproxy/api-shepherds @envoyproxy/dependency-shepherds

@mattklein123 mattklein123 self-assigned this Oct 19, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this LGTM. @htuch any more concerns on this one?

api/envoy/watchdog/v3alpha/abort_action.proto Show resolved Hide resolved
@repokitteh-read-only repokitteh-read-only bot removed api deps Approval required for changes to Envoy's external dependencies labels Oct 19, 2020
@htuch
Copy link
Member

htuch commented Oct 19, 2020

/lgtm api

@htuch htuch merged commit 415af04 into envoyproxy:master Oct 19, 2020
mpuncel added a commit to mpuncel/envoy that referenced this pull request Oct 21, 2020
* master: (22 commits)
  ci: various improvements (envoyproxy#13660)
  dns: fix defunct fd bug in apple resolver (envoyproxy#13641)
  build: support ppc64le with wasm (envoyproxy#13657)
  [fuzz] Added random load balancer fuzz (envoyproxy#13400)
  dependencies: compute and check release dates via GitHub API. (envoyproxy#13582)
  mac ci: try ignoring update failure (envoyproxy#13658)
  watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. (envoyproxy#13103)
  typos: fix a couple 'enovy' mispellings (envoyproxy#13645)
  lua: Expose stream info downstreamLocalAddress and downstreamDirectRemoteAddress for Lua filter (envoyproxy#13536)
  tap: fix upstream streamed transport socket taps (envoyproxy#13638)
  Revert "delay health checks until transport socket secrets are ready. (envoyproxy#13516)" (envoyproxy#13639)
  Watchdog: use abort action as a default if killing is enabled. (envoyproxy#13523)
  [fuzz] Fixed divide by zero bug (envoyproxy#13545)
  wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621)
  fix: record recovered local address (envoyproxy#13581)
  docs: fix incorrect compressor filter doc (envoyproxy#13611)
  docs: clean up docs for azp migration (envoyproxy#13558)
  wasm: fix building Wasm example. (envoyproxy#13619)
  test: Refactor flood tests into a separate test file (envoyproxy#13556)
  wasm: re-enable tests with precompiled modules. (envoyproxy#13583)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants