Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Treat stack overflows as an unrecoverable error #18448

Merged
merged 1 commit into from
Sep 21, 2022

Conversation

nmeum
Copy link
Member

@nmeum nmeum commented Aug 12, 2022

Contribution description

Presently, RIOT just emits a warning when a stack overflow is encountered but still resumes execution. In my view, execution should be aborted as the detection of a stack overflows via the heuristic provided by the scheduler is an unrecoverable error.

I ran into this while performing automated tests of a RIOT application where a stack overflow occurred but I only noticed this after inspecting the application output more closely.

Similar to SSP failures, I added crash_code for stack overflows and also bumped the log level from warning to error.

Testing procedure

I don't think there is a test case for the SCHED_TEST_STACK heuristic but basically the testing procedure would boil down to ensuring that execution does not continue after a stack overflow is detected by the scheduler.

Issues/PRs references

None.

Presently, RIOT just emits a warning when a stack overflow is
encountered but still resumes execution. In my view, execution should be
aborted as the detection of a stack overflows via the heuristic provided
by the scheduler is an unrecoverable error.

I ran into this while performing automated tests of a RIOT application
where a stack overflow occurred but I only noticed this after inspecting
the application output more closely.

Similar to SSP failures, I added crash_code for stack overflows.
@nmeum nmeum requested a review from kaspar030 as a code owner August 12, 2022 14:09
@github-actions github-actions bot added the Area: core Area: RIOT kernel. Handle PRs marked with this with care! label Aug 12, 2022
@nmeum nmeum force-pushed the pr/stack-overflow-panic branch from 0dade34 to 95b6f85 Compare August 12, 2022 14:14
Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Note that the heuristic may detect a stack overflow a tad too early (e.g. it is not possible to tell apart if the stack was fully exhausted up to the last byte but not overflown, or it did overflow indeed). But IMO wasting up to 4 bytes of stack space is very much worth it.

Also note that with the changed behavior we should get rid of THREAD_CREATE_STACKTEST and instead always create the stack overflow detection pattern in the stack. (That is, define THREAD_CREATE_STACKTEST to 0 for compatibility and deprecate that #define, but drop all in-tree users.) Otherwise a thread created without a stack test would result in an immediate core panic.

Do you have time to do this? Otherwise I could create a PR to remove THREAD_CREATE_STACKTEST and deprecated the backward compatibility define. This could then rebased on top.

It would IMO also be nice to use the new PANIC_STACK_OVERFLOW for the MPU stack protector. I think when mpu_noexec_ram is not used, one could just use that in the mem manage handler without additional effort to narrow the cause down, right?

@miri64
Copy link
Member

miri64 commented Aug 15, 2022

Also note that with the changed behavior we should get rid of THREAD_CREATE_STACKTEST and instead always create the stack overflow detection pattern in the stack.

Isn't THREAD_CREATE_STACKTEST also used to generate canary that is used to determine the used / free output for ps? If so I would be very much against its removal, as this is a very useful analysis feature for RAM optimization of apps.

See also

RIOT/core/thread.c

Lines 237 to 247 in 25a5269

if (flags & THREAD_CREATE_STACKTEST) {
/* assign each int of the stack the value of it's address. Alignment
* has been handled above, so silence -Wcast-align */
uintptr_t *stackmax = (uintptr_t *)(uintptr_t)(stack + stacksize);
uintptr_t *stackp = (uintptr_t *)(uintptr_t)stack;
while (stackp < stackmax) {
*stackp = (uintptr_t)stackp;
stackp++;
}
}

and

RIOT/core/thread.c

Lines 174 to 189 in 25a5269

uintptr_t thread_measure_stack_free(const char *stack)
{
/* Alignment of stack has been fixed (if needed) by thread_create(), so
* we can silence -Wcast-align here */
uintptr_t *stackp = (uintptr_t *)(uintptr_t)stack;
/* assume that the comparison fails before or after end of stack */
/* assume that the stack grows "downwards" */
while (*stackp == (uintptr_t)stackp) {
stackp++;
}
uintptr_t space_free = (uintptr_t)stackp - (uintptr_t)stack;
return space_free;
}

@maribu
Copy link
Member

maribu commented Aug 15, 2022

Yes. The point is, with this change RIOT will no longer work at all without THREAD_CREATE_STACKTEST. And if it becomes mandatory anyway, we should just deprecate the flag and make the stack test mandatory.

@nmeum
Copy link
Member Author

nmeum commented Aug 16, 2022

I think enabling the stack overflow detection heuristic by default is a good idea. However, maybe it makes sense to retain THREAD_CREATE_STACKTEST in cases where mpu_stack_guard is used, i.e. allow disabling the heuristic when an MPU is available for thread stack overflow detection?

Do you have time to do this? Otherwise I could create a PR to remove THREAD_CREATE_STACKTEST and deprecated the backward compatibility define. This could then rebased on top.

I don't have the time to implement this at the moment, so feel free to move forward with this :)

It would IMO also be nice to use the new PANIC_STACK_OVERFLOW for the MPU stack protector.

Yes, this is a good idea! Not sure how to best implement this though, currently the mem_manage_handler is invoked for all violations of MPU region attributes (i.e. for both mpu_stack_guard and mpu_noexec_ram):

void mem_manage_default(void)
{
core_panic(PANIC_MEM_MANAGE, "MEM MANAGE HANDLER");
}

I am not much of an ARM person myself so I am not sure how one would determine the cause for the invocation of the mem_manage_handler handler.

@kaspar030
Copy link
Contributor

The point is, with this change RIOT will no longer work at all without THREAD_CREATE_STACKTEST. And if it becomes mandatory anyway, we should just deprecate the flag and make the stack test mandatory.

SCHED_TEST_STACK is pretty weak for finding stack overflows. I don't think it is worth the extra check on every context switch. Anyhow, it doesn't need the whole stack full with canary words, only *stack_start.
THREAD_CREATE_STACKTEST is optional as it is quite expensive. For long-running threads that doesn't matter, but enabling this by default for short-running threads makes these prohibitively expensive.
If SCHED_TEST_STACK is enabled, we should always set *stack_start to the correct value, but IMO THREAD_CREATE_STACKTEST should stay optional.

@nmeum nmeum force-pushed the pr/stack-overflow-panic branch from 95b6f85 to 8011665 Compare September 21, 2022 07:58
@nmeum
Copy link
Member Author

nmeum commented Sep 21, 2022

I just rebased this PR and resolved the merge conflict.

Apart from enabling the stack overflow check by default (which is likely a separate issue) do we at least agree that it makes sense to trigger a panic on stack overflow? And if so: What changes are necessary in order to get this merged then?

@maribu
Copy link
Member

maribu commented Sep 21, 2022

OK, it looks that my fear that without THREAD_CREATE_STACKTEST a stack overflow would be falsely detected is baseless:

RIOT/core/thread.c

Lines 237 to 252 in 3876f38

if (flags & THREAD_CREATE_STACKTEST) {
/* assign each int of the stack the value of it's address. Alignment
* has been handled above, so silence -Wcast-align */
uintptr_t *stackmax = (uintptr_t *)(uintptr_t)(stack + stacksize);
uintptr_t *stackp = (uintptr_t *)(uintptr_t)stack;
while (stackp < stackmax) {
*stackp = (uintptr_t)stackp;
stackp++;
}
}
else {
/* create stack guard. Alignment has been handled above, so silence
* -Wcast-align */
*(uintptr_t *)(uintptr_t)stack = (uintptr_t)stack;
}

(In the else branch still a canary is added to test against.) So I think this is safe to go in as is.

@maribu maribu added Process: needs >1 ACK Integration Process: This PR requires more than one ACK CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Sep 21, 2022
Copy link
Contributor

@kaspar030 kaspar030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

@kaspar030 kaspar030 enabled auto-merge September 21, 2022 11:58
@miri64 miri64 added Process: needs >1 ACK Integration Process: This PR requires more than one ACK and removed Process: needs >1 ACK Integration Process: This PR requires more than one ACK labels Sep 21, 2022
@kaspar030 kaspar030 merged commit 6db9960 into RIOT-OS:master Sep 21, 2022
@maribu maribu added this to the Release 2022.10 milestone Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Process: needs >1 ACK Integration Process: This PR requires more than one ACK
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants