Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riotboot: ECC faults (eg. in STM32L5 or STM32WB) not handled gracefully #17874

Open
chrysn opened this issue Mar 28, 2022 · 9 comments
Open

riotboot: ECC faults (eg. in STM32L5 or STM32WB) not handled gracefully #17874

chrysn opened this issue Mar 28, 2022 · 9 comments
Assignees
Labels
Area: cpu Area: CPU/MCU ports Area: OTA Area: Over-the-air updates Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@chrysn
Copy link
Member

chrysn commented Mar 28, 2022

Description

Flash in STM32L5 and STM32WB, which works in 64bit words with 8bit ECC, can produce NMIs when reading faulty flash -- in particular I expect (sic!; didn't verify) that power loss during writing would create such a situation.

(And I am firmly on the stand point that "Please do not power off the device while saving" is a user request befitting only to 20th century game consoles).

Steps to reproduce the issue

Note that this is pure speculation, but founded in the reference manuals and code paths.

  • Use riotboot in an A/B setup, write image B, and cut power while writing the header.

(If riotboot ever grew verification of the full image, this would become more of a problem; with the header it's really a very short race).

Expected results

  • On next startup, image B is detected as invalid because its header doesn't check out.

"Actual" results

  • riotboot reads image B's header, encounters a flash read error due to the ECC violation, runs into the default NMI handler and reboots ad infinitum.

Possible resolution

Sadly, the flash read errors are really non-maskable, even though riotboot might manage to perform a "careful read" on that flash area.

A possible solution would be to define, when "careful reads" are enabled, to define

static uint8_t *careful_read_start;
size_t careful_read_length;

(possibly under mutex guard, so that no two concurrent careful read operations ever happen, or as a linked list of struct careful_read areas). The NMI would in that case be extended such that if the NMI cause was a flash error (i.e. the ECCD bit is set), it checks whether the faulty address (ADDR_ECC) is in the (or any) "careful read" slices.

From there, I'm not too sure: If an NMI can return safely (can it at all? If so, how can it be sure that only this one bit triggered the NMI, and not something else as well?), it just does (which probably returns garbage to the read; possibly also setting some flag around careful reads that the bootloader later checks). If it can not, then the NMI handler would write zeros into that address or erase the page (we'd need a different term for "careful read" ... "nuclear read"? "read but on error take the whole thing to a safer state and reboot" reads?) and reset.

Either way, the outcome would be that the one good image still comes up.

@chrysn chrysn added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: cpu Area: CPU/MCU ports labels Mar 28, 2022
@chrysn chrysn self-assigned this Mar 28, 2022
@chrysn chrysn added the Area: OTA Area: Over-the-air updates label Mar 28, 2022
@kaspar030
Copy link
Contributor

(And I am firmly on the stand point that "Please do not power off the device while saving" is a user request befitting only to 20th century game consoles and Windows).

fixed that for you

@kaspar030
Copy link
Contributor

So, the header is only written if the rest of the image has been verified. and the header itself is checksum protected. Is it possible to disable ECC only while reading the header?

@chrysn
Copy link
Member Author

chrysn commented Mar 28, 2022

As far as I understand the data sheet, the only way to "disable" ECC is to do something inside the NMI.

@kfessel
Copy link
Contributor

kfessel commented Mar 28, 2022

https://www.st.com/resource/en/reference_manual/rm0434-multiprotocol-wireless-32bit-mcu-armbased-cortexm4-with-fpu-bluetooth-lowenergy-and-802154-radio-solution-stmicroelectronics.pdf 3.3.3

seems like the do something is:

  • check which of the three sources for the nmi was triggered
  • if flash ecc (FLASH_ECCR has ECCD set) reset that ECCD-bit and continue (maybe react to that bad ecc)

maybe one also needs to have a Flash Interupt handler that checks for ECCC

@chrysn
Copy link
Member Author

chrysn commented Mar 28, 2022

Continuing needs a lot of care:

  • We'd need to be sure that there are really only the known sources of NMIs (the cited document lists them on page 364, others may not).
  • We need to be sure that no race conditions occur (when clearing one cause and the other happens)
  • We need to be sure not to miss unrelated faults. (For example, a second flash ECC fault happening outside the currently-being-read area would go unnoticed, briefly disabling all ECC effectively; but the priority of the NMI might help here, maybe it'd double-fault right away?).
  • (And I don't know yet what happens if NMI even returns).

I guess that "mark as faulty and reboot" might be an acceptable result for way less effort. To avoid the dichotomy between "full NMI clearing" and "start wiping memory on ECC fault", there may be a third option like storing the address in reboot-safe memory, rebooting, and relying on the caller of "careful read" to look there before attempting access. (Not through-through, just to be sure we keep our options open).

@kfessel
Copy link
Contributor

kfessel commented Mar 28, 2022

Continuing needs a lot of care:

* We'd need to be sure that there are really only the known sources of NMIs (the cited document lists them on page 364, others may not).

If the NMI cannot be identified -> do something safe (e.g. reboot/reset or stop) the document talks about three sources
eg:

if flash eccd -> handle that and return
else if sram parity -> handle that and return
else if hse css ->  handle that and return
else reset / reboot / stop
* We need to be sure that no race conditions occur (when clearing one cause and the other happens)

While in NMI nothing but reset can happen (or hardfault which will halt the whole system until reset)

* We need to be sure not to miss _unrelated_ faults. (For example, a second flash ECC fault happening outside the currently-being-read area would go unnoticed, briefly disabling _all_ ECC effectively; but the priority of the NMI might help here, maybe it'd double-fault right away?).

If i am not misreading the document, it is either an:

  • ECCC (one bit gone in that memory cell and that will be corrected (possibly issuing a Flash-Int and setting ECCC) or
  • ECCD (two bit error in that memory cell, can only be detected and leads to NMI and ECCD set).
    If I read that correctly:
    There can never be ECCC and ECCD set (just ECCC will continue silently and if not handled and ECCD might happen without setting the FLAG ("While ECCC or ECCD is set, FLASH_ECCR is not updated if a new ECC error occurs. FLASH_ECCR is updated only when ECC flags are cleared") but the NMI will be called -> Flash Int needs to handle ECCC if you want to know if ECCD was the NMI-Source.

If one memory cell yields an ECCC and another one also ECCC it is still ECCC. No double fault upgrade of ECCC to ECCD each read is either correct, correctable or faulty.

* (And I don't know yet what happens if NMI even returns).

What happens when an interrupt returns?

@chrysn
Copy link
Member Author

chrysn commented Mar 28, 2022 via email

@chrysn
Copy link
Member Author

chrysn commented Mar 28, 2022

Follow-up post because GitHub's new quote detection completely ignores how the signature separator looks like, and interprets my markdown <hr/> as something it is not -- repeating what is hidden in the above post and irrecoverable because editing a mailed comment breaks formatting worse yet.


So if all this can be managed and understood -- sure, would be nice. But
if any of that causes trouble, I'd personally go for the low-tech
solution of clearing the memory cell (users of that extra read
functionwould need to be aware that any write unit may be cleared out)
and starting over.

(Clever users might want to provide mechanisms to re-obtain and re-store
the very values that should have gone into that flash area, but that too
is a 20% use case).

@jeandudey
Copy link
Contributor

can it at all? If so, how can it be sure that only this one bit triggered the NMI, and not something else as well?

A HardFault triggers if the BusFault NMI is not enabled in the SCB, so with the BusFault enabled and the SCB->BFSR and SCB->BFAR can be used to identify where the error happened. So one can identify the page/sector by comparing the BFAR register and the areas that riotboot might write (e.g.: the firmware header), and then decide how to handle the situation (e.g.: deleting the sector to remove the ECC error, write a 0 on the address, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: cpu Area: CPU/MCU ports Area: OTA Area: Over-the-air updates Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

4 participants