-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
riotboot: ECC faults (eg. in STM32L5 or STM32WB) not handled gracefully #17874
Comments
fixed that for you |
So, the header is only written if the rest of the image has been verified. and the header itself is checksum protected. Is it possible to disable ECC only while reading the header? |
As far as I understand the data sheet, the only way to "disable" ECC is to do something inside the NMI. |
seems like the do something is:
maybe one also needs to have a Flash Interupt handler that checks for ECCC |
Continuing needs a lot of care:
I guess that "mark as faulty and reboot" might be an acceptable result for way less effort. To avoid the dichotomy between "full NMI clearing" and "start wiping memory on ECC fault", there may be a third option like storing the address in reboot-safe memory, rebooting, and relying on the caller of "careful read" to look there before attempting access. (Not through-through, just to be sure we keep our options open). |
If the NMI cannot be identified -> do something safe (e.g. reboot/reset or stop) the document talks about three sources
While in NMI nothing but reset can happen (or hardfault which will halt the whole system until reset)
If i am not misreading the document, it is either an:
If one memory cell yields an ECCC and another one also ECCC it is still ECCC. No double fault upgrade of ECCC to ECCD each read is either correct, correctable or faulty.
What happens when an interrupt returns? |
```
if flash eccd -> handle that and return
else if sram parity -> handle that and return
else if hse css -> handle that and return
else reset / reboot / stop
```
This implies that they are mutually exclusive. Maybe an ECCD and HSE CSS
fault can occur at at the same time. If NMIs are level triggered by the
presence of any of these bits, then we can implement such logic easily,
but I haven't found the right documentation that states that that is so.
- ECCC (one bit gone in that memory cell and that will be corrected (possibly issuing a Flash-Int and setting ECCC) or
yes, that's a separate interrupt. (Might make sense to record it so that
the device can ask for a firmware update for its other slot at the
earliest convenience).
- ECCD (two bit error in that memory cell, can only be detected and leads to NMI and ECCD set).
If I read that correctly:
There can never be ECCC and ECCD set (just ECCC will continue silently and if not handled and ECCD might happen without setting the FLAG ("While ECCC or ECCD is set, FLASH_ECCR is not updated if a new ECC error occurs. FLASH_ECCR is updated only when ECC flags are cleared") but the NMI will be called -> Flash Int needs to handle ECCC if you want to know if ECCD was the NMI-Source.
If one memory cell yields an ECCC and another one also ECCC it is still ECCC. No double fault upgrade of ECCC to ECCD each read is either correct, correctable or faulty.
I'm only considering the ECCD here.
Between the time the first ECCD occurs and some unrelated ECCD, the
second ECCD would pass silently -- something our "clear and continue"
handler would introduce.
The example is contrived (especially it doesn't happen in riotboot, but
the larger issue affects anyone who writes to flash, and riotboot
happens to be the only full application we ship that does), but should
still illustrate what I'm saying here:
Assume we do a "careful read" while some DMA operation is in progress
that writes data from flash to UART (maybe part of stdout). The "careful
read" operation reads out the faulty ROM entry, sets FLASH_ECCR with address and ECCD, and trips the NMI. While the NMI is being entered, the stdout DMA
also finds a faulty cell, but does not set ECCD because it already is
set.
The second ECCD fault would thus go unnoticed.
Granted, it is *highly* unlikely that two faults happen in the same chip
in the first place (one can be part of an aborted flash operation, but
only one flash operation fails per surprise power-off), and even more
unlikely that they race.
What happens when an interrupt returns?
Control flow continues where it was before -- but all the interrupts
I've ever handled were "regular" ones that effectively came to pass
between two instructions (i.e. instruction is processed, PC is
increased, and state is saved as part of the interrupt procedure); here
the interrupt happens during processing of the instruction when the
instruction can not complete, and I wouldn't know how to tell (except
exerimentally) whether the faulting instruction completes, if so with
which value, or not (in which case the load would ... be retried and
fail again unless fixed?).
…-----
So if all this can be managed and understood -- sure, would be nice. But
if any of that causes trouble, I'd personally go for the low-tech
solution of clearing the memory cell (users of that extra read
functionwould need to be aware that any write unit may be cleared out)
and starting over.
(Clever users might want to provide mechanisms to re-obtain and re-store
the very values that should have gone into that flash area, but that too
is a 20% use case).
|
Follow-up post because GitHub's new quote detection completely ignores how the signature separator looks like, and interprets my markdown <hr/> as something it is not -- repeating what is hidden in the above post and irrecoverable because editing a mailed comment breaks formatting worse yet. So if all this can be managed and understood -- sure, would be nice. But (Clever users might want to provide mechanisms to re-obtain and re-store |
A HardFault triggers if the BusFault NMI is not enabled in the SCB, so with the BusFault enabled and the |
Description
Flash in STM32L5 and STM32WB, which works in 64bit words with 8bit ECC, can produce NMIs when reading faulty flash -- in particular I expect (sic!; didn't verify) that power loss during writing would create such a situation.
(And I am firmly on the stand point that "Please do not power off the device while saving" is a user request befitting only to 20th century game consoles).
Steps to reproduce the issue
Note that this is pure speculation, but founded in the reference manuals and code paths.
(If riotboot ever grew verification of the full image, this would become more of a problem; with the header it's really a very short race).
Expected results
"Actual" results
Possible resolution
Sadly, the flash read errors are really non-maskable, even though riotboot might manage to perform a "careful read" on that flash area.
A possible solution would be to define, when "careful reads" are enabled, to define
(possibly under mutex guard, so that no two concurrent careful read operations ever happen, or as a linked list of
struct careful_read
areas). The NMI would in that case be extended such that if the NMI cause was a flash error (i.e. the ECCD bit is set), it checks whether the faulty address (ADDR_ECC) is in the (or any) "careful read" slices.From there, I'm not too sure: If an NMI can return safely (can it at all? If so, how can it be sure that only this one bit triggered the NMI, and not something else as well?), it just does (which probably returns garbage to the read; possibly also setting some flag around careful reads that the bootloader later checks). If it can not, then the NMI handler would write zeros into that address or erase the page (we'd need a different term for "careful read" ... "nuclear read"? "read but on error take the whole thing to a safer state and reboot" reads?) and reset.
Either way, the outcome would be that the one good image still comes up.
The text was updated successfully, but these errors were encountered: