-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watchdog: use abort action as a default if killing is enabled. #13523
Watchdog: use abort action as a default if killing is enabled. #13523
Conversation
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for generalizing this action and moving it to envoy core. Change looks good, just a few nits.
], | ||
) | ||
|
||
envoy_cc_extension( | ||
# TODO(kbaichoo): is there a more appropriate build unit for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems about right. I see some similar examples outside the extensions directory like udp_default_writer_config.cc which contain the extension name as part of the filename since the extensions don't have their own directories like in your example. you could consider making this action live in the watchdog directory instead and possibly have the config object link in directly to the watchdog library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved them into the watchdog directory, and linked it directly into the guarddog_lib
. Given this, I removed the AbortAction namespace since it originally was for being consistent with directory structure.
|
||
namespace Envoy { | ||
namespace Extensions { | ||
namespace Watchdog { | ||
namespace AbortAction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AbortAction namespace doesn't seem necessary. I assume it exists for consistency with the directory structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it was for consistency with directory structure. Given my response to your comment about removing the abort_action directory, I removed the AbortAction namespace since it originally was for being consistent with directory structure.
Done.
// Successfully signaled to thread to terminate, sleep for wait_duration. | ||
absl::SleepFor(absl::Milliseconds(PROTOBUF_GET_MS_OR_DEFAULT(config_, wait_duration, 0))); | ||
} else { | ||
// Failed to send the signal, abort? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Odd comment since we effectively panic in the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I removed it. The LOG
along with the comment on the PANIC
seem to supply enough context there.
source/server/configuration_impl.cc
Outdated
// Add abort_action if killing is enabled. | ||
envoy::watchdog::abort_action::v3alpha::AbortActionConfig abort_config; | ||
// Wait one second for the aborted thread to abort. | ||
abort_config.mutable_wait_duration()->set_seconds(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rely on the default wait_duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
if (Thread::terminateThread(thread_id)) { | ||
// Successfully signaled to thread to terminate, sleep for wait_duration. | ||
absl::SleepFor(absl::Milliseconds(PROTOBUF_GET_MS_OR_DEFAULT(config_, wait_duration, 0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default sleep duration be non-zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it's now the default that wait_duration is (5s).
|
||
// We should have the abort action added to both KILL and MULTIKILL events. | ||
EXPECT_EQ(config.workerWatchdogConfig().actions().size(), 2); | ||
EXPECT_EQ(config.mainThreadWatchdogConfig().actions().size(), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth checking the contents of these action configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Abort from the action since the signaled thread hasn't yet crashed the process. | ||
// Panicing in the action gives flexibility since it doesn't depend on | ||
// external code to kill the process if the signal fails. | ||
PANIC(fmt::format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PANICs in guarddog_impl.cc are no longer reachable now that we are adding these actions by default. Should they be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're still reachable in the guarddog unit tests, which doesn't install this action, but perhaps they should. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may make sense to install these default actions in the guarddog itself instead of doing so by modifying config earlier in the process. Doing so would allow us to get more consistent behavior in smaller tests and avoid potentially showing the config modifications when someone accesses the proxy config via the admin handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a great suggestion -- having this logic captured within just the guarddog simplifies a lot of the design.
// MULTIKILL events if those are enabled. | ||
message AbortActionConfig { | ||
// How long to wait for the thread to respond to the thread kill function | ||
// before killing the process from this action. This is a blocking action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a default wait interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. Will go with 5 seconds by default -- this is somewhat arbitrary, but it should be sufficient time for the failure handlers to have finished running and the process exiting.
…er core extensions, set a default wait time of 5 second, minor changes. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
source/server/configuration_impl.cc
Outdated
auto actions = watchdog.actions(); | ||
|
||
// Add abort_action if killing is enabled. | ||
envoy::watchdog::v3alpha::AbortActionConfig abort_config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Move just before the 2 PackFrom calls below.
Although, would it work if you don't specify typed config since the proto in question has no fields set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still works as we end up setting the type url which IIUC is necessary for finding the right factory for the typed config.
… guarddog. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
* only works on platforms that support SIGABRT. | ||
* | ||
* Returns the result from the platform specific function (i.e. kill) to terminate | ||
* the thread. If the platform is currently unsupported, this will return false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not return the result of the platform specific function, it returns true if the platform specific function succeeded. See implementation, return is: kill() == 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch I've made it:
Returns true if the platform specific function to terminate the thread succeeded (i.e. kill() == 0). If the platform is currently unsupported, this will return false.
@@ -1,7 +0,0 @@ | |||
watchdogs: | |||
main_thread_watchdog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this config file was never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it was added in an intermediate commit, when we were splitting watchdog into multiple watchdogs. But it didn't get used in the end, and I missed removing it (until now).
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@envoyproxy/windows-dev |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…ult-cross-platform Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
/retest |
Retrying Azure Pipelines. |
PTAL @envoyproxy/api-shepherds @envoyproxy/dependency-shepherds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this LGTM. @htuch any more concerns on this one?
/lgtm api |
* master: (22 commits) ci: various improvements (envoyproxy#13660) dns: fix defunct fd bug in apple resolver (envoyproxy#13641) build: support ppc64le with wasm (envoyproxy#13657) [fuzz] Added random load balancer fuzz (envoyproxy#13400) dependencies: compute and check release dates via GitHub API. (envoyproxy#13582) mac ci: try ignoring update failure (envoyproxy#13658) watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. (envoyproxy#13103) typos: fix a couple 'enovy' mispellings (envoyproxy#13645) lua: Expose stream info downstreamLocalAddress and downstreamDirectRemoteAddress for Lua filter (envoyproxy#13536) tap: fix upstream streamed transport socket taps (envoyproxy#13638) Revert "delay health checks until transport socket secrets are ready. (envoyproxy#13516)" (envoyproxy#13639) Watchdog: use abort action as a default if killing is enabled. (envoyproxy#13523) [fuzz] Fixed divide by zero bug (envoyproxy#13545) wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621) fix: record recovered local address (envoyproxy#13581) docs: fix incorrect compressor filter doc (envoyproxy#13611) docs: clean up docs for azp migration (envoyproxy#13558) wasm: fix building Wasm example. (envoyproxy#13619) test: Refactor flood tests into a separate test file (envoyproxy#13556) wasm: re-enable tests with precompiled modules. (envoyproxy#13583) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Kevin Baichoo kbaichoo@google.com
Commit Message: Watchdog: use abort action as a default if killing is enabled and we're on a supported platform.
Additional Description:
Risk Level: low
Testing: unit tests
Docs Changes: Included
Release Notes: Included
See PR #13208 for context as the reason it's part of core envoy and not an extension.