-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static/dynamic analysis of data plane exceptions on worker threads #14320
Comments
I like it, but wouldn't we need an integration test to catch the failure? Maybe swap it for ENVOY_BUG so if there are any slip-ups they're caught (without crashing) in prod? If someone is willing to doc up the current use cases, we could maybe run a fixit, or encourage new Envoy devs to pick them up - things like fixing Utility::getResponseStatus would be good intro projects to Envoy Do you think it would also be worth splitting out data plane code from control plane code by location or naming convention or some such? That way we could also catch it with fix format scripts, and avoid my integration test concerns above. |
Why wouldn't existing integration tests catch many of the failures? The only issue is that coverage is often in unit tests vs. integration tests, but it's probably not in scope to increase integration test coverage to 100%. Agreed on the ENVOY_BUG instead of ASSERT. |
We only throw on unexpected/invalid behavior and those corner cases are frequently unit tested rather than integration tested. I'd be pleasantly surprised if even with ENVOY_BUG it caught even half the issues in CI, especially after a quick scan of where we throw exceptions. |
The aim of wrapping the |
oh yeah, you're right, I totally misread this. So to do this, we'd have to remove the catch we still have in the codec dispatch. I think that's pretty dangerous unless we have plenty more safeguards making sure utilities (like the example I posted) don't throw. Alternately we could leave in that catchall, but I think it mostly defeats the purpose if we still have a catchall catch since I think most throws depend on that? |
After some offline discussion with @alyssawilk the goal here is not to remove top-level Another suggestion here is to write a CodeQL scanner that ensures that every |
An example of this class of data plane exception use is
We should also |
Seems like a good direction. Can we also have a clang-tidy check for use of try? |
I'm not sure how we'd do that if we continue allow try to be used for control plane code (hence why I suggested splitting files out) Or maybe it would be possible to just say new data plane code should avoid try/catch as well, and blacklist new instances, but I think that'd be more controversial. |
Yeah without some delineation between "this code is ok to run on the data plane" it's hard to prevent accidental usage of throwing utility functions. It's very easy to throw in something like a status code conversion function that appear innocent in code review but that was actually intended for use within a try-catch. Splitting the code base between data/control plane seems a bit heavy handed (what about all the code that is safely shared?), but it is a fairly low tech solution. Conceptually I would imagine you could make use of static analysis and annotations would be good (e.g. Having the first step being to try to find existing violations makes sense to me, then we can work on prevention in the future. |
Control plane uses happen on the main thread (at least those that should be allowed to throw), data plane on worker threads. So, if we assert based on TID we can catch this. Admin endpoint is probably an exception that arguably shouldn't be on main thread (we don't want the admin endpoint to lock-up on large config updates for example). I agree that new data plane code should avoid try/catch, the main issue is around utilities and nested stacks. Better structure as well would help here, but it will be a lot of work to get to the point that we could use things like build visibility rules to enforce (but arguably a good north star). |
" So, if we assert based on TID we can catch this." where "this" is new try blocks being added, not new exceptions being added, right?. I don't object, but I'm convinced it'll catch enough to be worth the churn. If we agree that new data plane code should avoid try/catch, I think check_format or clang tidy checks which disallow try/catch by default but could be overridden for legit reasons would result in more consistent future-proofing. We can skip the check for refactor PRs, but by default just disallow new additions. |
How do we know if a utility is safe to use on data plane or control plane? I think this is the crux of the problem. |
How many utilities have local try-catch blocks, and wouldn't be caught by the audit you'd need to do to land that macro in the first place? I'm guessing not many, even without restricting what we look at to not-obviously based utilities.. Based on my knowledge of what's integration tested I'm dubious there'd be value add beyond the audit and fix_format checks. I'm not going to block the PR if you find someone to do the work, I just don't think it's going to get us the most exception-saftey-bang for our proverbial buck. |
I think 100% of them would be caught, that is the plan outlined today. The idea is to provide a regression framework for new ones. I think it's becoming clear that a combination of things would make sense here. Worker thread ID-based ENVOY_BUG for both try/throw uses. CodeQL checks to verify that all throws are in fact caught (and not by the top-level dispatch). Annotations at a function level indicating exception safety or data/control plane compatibility. I've updated the title to make clear the more general set of options. I think I'd defer to the person doing the actual work to prioritize amongst these. |
@chaoqin-li1123 has kindly volunteered to do some initial work on this issue. |
Thanks! I will start with the thread id assertion macros and try to replace the raw try catch block one by one. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
The next step will be adding the try macros into the codebase and also some examples to demonstrate its use. |
We can also remove exception from parseInternetAddressAndPort( envoy/source/common/network/utility.cc Line 206 in 659f2ea
envoy/source/common/network/utility.cc Line 140 in 659f2ea
|
IMO you should avoid `ASSERT(Thread::MainThread::isMainThread())` as a
pattern, precisely because it creates non-obvious dependencies on
initializing the threading system before running the tests.
Are you implying that we need to have this thread-checking in `
Utility::hostFromUdpUrl`? Why do we need checking for threads in what looks
like an algorithmic function?
If that algorithmic function is really depending on some hidden state,
let's make the state be explicit and then when we construct it we can
provide the threading/dispatching context needed to do the checi.
…On Sun, Jan 10, 2021 at 11:06 PM chaoqin-li1123 ***@***.***> wrote:
A tls instance has to be constructed for the assertion to pass. I agree
that we can inject something in this case, I am still thinking about it.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#14320 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO2IPLGPHFY2AS44XYRF7LSZJ2KPANCNFSM4USBM25A>
.
|
This may not be a perfect example because a worker thread is not going to parse a udpaurl. I guess we may want to check the thread id when threading is on and an exception will be thrown. We want to remove exception from the data plane, that's why we want to do the checking. This is an algorithmic function, but some algorithmic functions can be involved and throw exceptions in worker thread, which is what we want to avoid. |
To provide some context, we want to replace |
Actually do you need to assert that you are on the main thread at all the places exceptions are thrown? That might be hard because library functions can throw. Instead maybe you could assert main-thread at all places excepts are caught. There are probably be fewer of them, and maybe they'll have enough context to access the dispatcher. WDYT? |
I see your point. Many library functions don't have enough threading context, but there may be enough context where the exceptions are caught. Actually I don't want to assert in all the throw, I plan to assert in all the try, I guess that is almost equivalent to "assert main-thread at all places excepts are caught". I will do some investigation to see whether we have the necessary context in all the places where exceptions are caught. Good night! |
I have a proposal somehow related to this issue. Currently, the constructor of EnvoyException envoy/include/envoy/common/exception.h Line 12 in 1d1b708
takes a const string reference as argument, which means that when we pass in a string literal, like EnvoyException("error message"), a tempory string object will be constructed, and this string will be copied by the constructor of std::runtime_error, the super class of EnvoyException. That means we are allocating the memory twice. Do you think it reasonable to avoid redundant string creation by adding another constructor EnvoyException(const char * message) : std::runtime_error(message) {} |
@chaoqin-li1123 probably not worth micro-optimizing, since this is the unhappy path. That said, I think we're trying to move to |
Make sense. The cost of creating a tempory string is small when compared to all the stack unwinding in an exceptional scenario. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
The next PR will change the error propagation of existing envoy code. For codes executed on worker thread, replace try catch with other error propagation mechanism. When all the try catch in existing code has be removed or replaced with try with main thread assertion, we can add format checking to disallow raw try. |
… caught on main thread (#15251) Signed-off-by: chaoqin-li1123 <chaoqinli@google.com> Commit Message: Define a macros that wrap try with an assertion that exceptions are only caught in main thread. Currently, envoy use c++ exception for error propagation. This PR is one of the steps to address #14320. The long term goal is to disallow raw try in envoy core code and eliminate c++ exception from data plane, which can improve exception safety. The try in the PR happen on main thread and can be wrapped in main thread assertion without breaking any existing test. In the following PR, raw try that can not be replaced by TRY_ASSERT_MAIN_THREAD will be removed from core codebase with other error propagation. Additional Description: Risk Level: Testing: none Docs Changes: none Release Notes: none Platform Specific Features: none
…me try catch pattern (#16122) Commit Message:This is part of the effort to remove C++ exception from data plane by adding assertion that the code is executed in main thread when an exception is caught.(#14320) By making the constructor of instances(PipeInstance, Ipv6Instance, Ipv4Instance) no throw, remove some try catch code from envoy. Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
…me try catch pattern (envoyproxy#16122) Commit Message:This is part of the effort to remove C++ exception from data plane by adding assertion that the code is executed in main thread when an exception is caught.(envoyproxy#14320) By making the constructor of instances(PipeInstance, Ipv6Instance, Ipv4Instance) no throw, remove some try catch code from envoy. Signed-off-by: chaoqin-li1123 <chaoqinli@google.com> Signed-off-by: chris.xin <xinchuantao@qq.com>
…me try catch pattern (envoyproxy#16122) Commit Message:This is part of the effort to remove C++ exception from data plane by adding assertion that the code is executed in main thread when an exception is caught.(envoyproxy#14320) By making the constructor of instances(PipeInstance, Ipv6Instance, Ipv4Instance) no throw, remove some try catch code from envoy. Signed-off-by: chaoqin-li1123 <chaoqinli@google.com>
In general, there should be no exceptions on worker threads, and they should only happen on the main thread. To validate this at runtime (and during test runs which should catch most instances), I propose we replace all:
in Envoy with
where
envoy_try
is something like:This bug tracks this proposal and implementation work. There's probably a number of data plane exceptions which still happen on worker threads, which need to be fixed before the
ASSERT
can be merged, but we can convert to the new macro to facilitate this. We would also augmentcheck_format
to catch any rawtry
statements.@envoyproxy/maintainers WDYT?
CC @chaoqin-li1123 @asraa
The text was updated successfully, but these errors were encountered: