Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC for handling Erase WiFi Setting after OTA #6965

Draft
wants to merge 64 commits into
base: master
Choose a base branch
from

Conversation

mhightower83
Copy link
Contributor

@mhightower83 mhightower83 commented Dec 31, 2019

Status: FLASH_MAP_SUPPORT has not been properly incorporated.

Issue: Sometimes when an ESP8266 is reflashed/upgraded the WiFi does not work.
Then a serial flash with Erase Flash with WiFi setting is recommended. I have
seen this more often when changing SDK by OTA. We don't have an erase WiFi for
OTA.

This PR should not be merged. It presents three Proof of Concept solutions. The intent is that one of these solutions would be chosen, developed further, and incorporate in a new PR.

There are 3 cases to consider when the firmware is updated by OTA. The new firmware:

  1. has the same Flash Configuration as the old.
  2. has a larger Flash Configuration than the old.
  3. has a smaller Flash Configuration than the old.

In theory after an OTA and before a restart, the flash could be erased for
case 1. Case 2 is a problem because the size exceeds the size specified in
flashchip. That size is used by SPIEraseSector to validate the callers request
and fails when too large. We have to wait for a restart, after which the SDK
will have updated the values in flashchip. Case 3 is potentially unsafe
because we could be erasing the code that is running.

At app_entry() flashchip properties appear to be reset to a default 4MByte
flash. Even an ESP8285 reported 4MByte. The value of flashchip->chip_size was
changed at the 2nd SPIRead by the SDK. The 1st read was to address 0, 4 bytes.

To erase the flash WiFi area for methods 1 and 2, I choose to wait until the SDK
has finished its adjustments to the flashchip structure. Then to begin erasing
WiFi sectors. Method 3 runs before the SDK starts. It temporarily updates
flashchip->chip_size. Then does the erase and puts everything back before
starting the SDK.

Summary of the three methods:

  1. The first runs after the SDK calls user_init() and flash code execution is
    available (No IRAM needed), but restarts to be sure the SDK does not get
    confused about its sectors being erased.

  2. The 2nd method runs as early as possible requires IRAM. The sectors are
    erased before the 2nd read is processed. This allows the SDK to start off
    thinking the sectors were blank at boot.

  3. Similar to method 2 runs as early as possible, turns on flash code execution
    so more of the initialization code can be moved to flash. Directly modifies the
    size element in flashchip structure in ROM data, dRAM. This allows the flash
    erase to succeed.

    The original flash size is restored before starting the SDK with the assumption
    that the SDK will handle the size change properly. It should be noted that
    only changing the size value in the flashchip structure, is equivalent to what the
    esptool.py is doing, when it flashes the firmware.

I also added an example/test sketch for exercising the feature. It is
OTAEraseConfig. It also gathers some WiFi signal/connection statistics.

Edited: Added Method 3, corrected some grammar.

Issue: Sometimes when an ESP8266 is reflashed/upgraded the WiFi does not work.
Then a  serial flash with Erase Flash with WiFi setting is recommended. I have
seen this more often when changing SDK by OTA. We don't have an erase WiFi for
OTA.

The PR tries to present a proof of concept solution. Actually it describes two
different methods.

There are 3 cases to consider when the firmware is updated by OTA. The new
firmware:
1. has the same Flash Configuration as the old.
2. has a larger Flash Configuration than the old.
3. has a smaller Flash Configuration then the old.

In theory after an OTA and before a restart, the flash could be erased for
_case 1_. _Case 2_ is a problem because the size exceeds what the SPIEraseSector
expects and fails. We have to wait for a restart, after which the SDK will have
updated the values in `flashchip`. _Case 3_ is potentially unsafe because we
could be erasing the code that is running.

At app_entry() `flashchip` properties appear to be reset to a default 4MByte
flash. Even an ESP8285 reported 4MByte. The value of `flashchip->chip_size` was
changed at the 2nd SPIRead. The 1st read was to address 0, 4 bytes.

To erase the flash WiFi area, I choose to wait until the SDK has finished its
adjustments to the `flashchip` structure. Then to begin erasing WiFi sectors.

I implemented two methods:
1. The first runs after the SDK calls `user_init()` and flash code execution is
available (No IRAM needed), but restarts to be sure the SDK  does not get
confused about its sectors being erased.
2. The 2nd method runs as early as possible required IRAM. The sectors are
erased before that 2nd read is processed. The SDK allowing the SDK to start off
thinking the sectors were blank at boot.

I also added an example/test sketch for exercising the feature. It is
OTAEraseConfig.
@mcspr
Copy link
Collaborator

mcspr commented Dec 31, 2019

I have seen this more often when changing SDK

Just to clarify, that only affects OTA to or from SDK 3.x.x? Or some settings (rfcal or credentials format) are incompatible even between 2.2.x versions?

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 31, 2019

@mhightower83
In #6690, I wonder myself whether the first bytes in flash are used by FW or only used by and for ourselves.
https://github.com/d-a-v/Arduino/blob/nosizeconf/cores/esp8266/Esp.cpp#L298

In the latter case, we'd better rely on getFlashChipRealSize() = FW's spi_flash_get_id()
That would at least fix incoherence like what you've seen with esp8285.

@mcspr
Also between 2.x versions, but I don't have a clear view on that.
What I know is when everything fails, a full erase can fix the issue.

@mhightower83
Copy link
Contributor Author

mhightower83 commented Dec 31, 2019

@mcspr
Definitely the latter. Too often I would forget which 2.x FW version I had on a remote device and flash it via OTA, only to have it stop working. I think 191105 might be the worst it often took an RF CAL erase and a power cycle before it would respond remotely. (related to a unit with a weak signal)
Updated: Added clarification.

@mhightower83
Copy link
Contributor Author

The problem with getFlashChipRealSize() = FW's spi_flash_get_id() is, it defines an upper bound value that is not followed by SPIRead, SPIWrite, or SPIEraseSector. These ROM functions use the values provided in the flashchip structure to validate each call request. Looking at one of the ROM disassemblies out there, SPIRead compares flash offset + size of the read with flashchip.chip_size and fails the request if it exceeds flashchip.chip_size.

In effect, as the SDK starts up, by the time of the 2nd SPIRead call, the values in flashchip have been updated to reflect the information read at flash offset 0. This is the IDE configured value which may be smaller than getFlashChipRealSize().

That said, I think I noticed a download result, that implied the serial download tool was changing the IDE configured value to the real chip size on the fly.

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 1, 2020

According to your findings, flashchip structure is initialized by FW/SDK after the first call of SPIRead().
Do you think it reads first flash byte to initialize it, or does it get flash size from flash chip ?
Does spi_flash_get_id() gets its result from flashchip() structure ?

My concern is about the API calls: Esp.getFlashChipSize() vs Esp.getFlashChipRealSize():
What's the use of the first, and can we get rid of it ?

@mhightower83
Copy link
Contributor Author

It looks like the flashchip structure update is based on the 1st byte read. The updated structured appears to track the IDE configured value.

My take or interpretation so far is that Esp.getFlashChipSize() gives us:

  • the IDE configured size
  • Which the SDK adopts and uses as THE size for all operations by updating flashchip.
  • Since flashchip is referenced by the SPIRead, ... type API calls for an upper limit check. (at least the ones I have looked at.)
  • I hesitate to suggest, that maybe we could refer to this as a virtual chip size.

I see Esp.getFlashChipRealSize() as providing insight to the actual hardware chip size.

Does spi_flash_get_id() gets its result from flashchip() structure ?

Using the RTOS SDK for inspiration, if I am reading this right, it would appear the command issues a flash chip command to read an ID from the actual flash chip.
https://github.com/espressif/ESP8266_RTOS_SDK/blob/b02ad1477b8657e9224b768f73f9aa9ee5d950ff/components/spi_flash/src/spi_flash_raw.c#L46-L50

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 6, 2020

What do you think of reading the first byte of flash at boot and then, if it does not reflects what Esp.getFlashChipRealSize() gives, rewriting it ?
(edit: that, only if changing flashchip value in ram on-the-fly won't work)

We would have

  • no more "virtual" chip size
  • no more misconfiguration that would potentially poison the FS
  • and (possibly) FS configuration selected by sketch (like what's proposed in flash-size agnostic builds #6690)

@mhightower83
Copy link
Contributor Author

Currently, I see the following when building and downloading the example sketch with the generic option selected with Flash Size:"1MB(FS:64KB OTA:~470KB)"

Flash information reported after using esptool

Flash Size as reported by:
  flashchip->chip_size:     0x01000000, 16777216
  ESP.getFlashChipSize:     0x01000000, 16777216
  ESP.getFlashChipRealSize: 0x01000000, 16777216

Flash information reported after using OTA

Flash Size as reported by:
  flashchip->chip_size:     0x0100000, 1048576
  ESP.getFlashChipSize:     0x0100000, 1048576
  ESP.getFlashChipRealSize: 0x01000000, 16777216

So currently it appears the esptool today is doing what you are suggesting we do with OTA. It sounds like a good idea and I like it a lot; however, I also think it needs deeper thought. My initial concern is for legacy devices. On too many occasions, I have mismatched build and existing configured sizes.

I am concerned about the sealed-up devices that cannot be reflashed serially. Maybe there is a need to respect the original value of legacy devices. An old device may not expect the upgrade in size. The location of the EEPROM will move. We could copy it; however, if there are objects inside, pointing to places in flash, that could create an issue. And there are some more fancy EEPROM libraries out there in use that may wonder where their data went.

If we could somehow increase the size and keep the old pointer offsets until a sketch could handle the migration to a larger size. Migration, referring to moving of EEPROM and SPIFFs, etc. to their new home.

@mhightower83
Copy link
Contributor Author

In case anyone is trying this, there appears to be a problem with the linker not using the erase_config module. It appears in my effort to reduce the code down to what was needed, I took away the one direct reference to this module that caused it to be linked in. ☹️

There is also an issue of Soft WDT when doing a system_restart() with method 1. It appears I have to keep the call to __real_system_restart_local() instead.

I'll try and sprinkle in some printf's I think I can make those work now to a degree. Before, all the code reduction, I was using a hack together deferred print library to get a peek of what was going on during this garbled print period.

And code to ensure the UART speed stayed at 74880 for a more complete
viewing of messages. Well at least, it stays at 74880 for a longer time than it was.
Added pinMode filter to prevent UART TX pin from being disabled during user_init().
This is just a PoC an alternate for preinit() can be created later.
int8_t rssi_max;
int8_t rssi_min;
int8_t rssi;
} wifi_health_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't these members initialized by default?
Some of them need to be initialized at values other than 0 (e.g. the "min" values and the "rssi_max"), or else it is quite easy to make mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!
C++ is still new to me. My experience lies with C. I didn't realize you could now initialize inside a structure under C++. I know classes and structures have strong similarities. I have never seen a structure initializing elements directly before. I searched and found some examples. It certainly makes complex initialization like this easier. Thanks for the pointer.

int32_t rssi = WiFi.RSSI();
if (rssi < 10) {
wifiHealth.rssi_max = max((int8_t)rssi, wifiHealth.rssi_max);
wifiHealth.rssi_min = min((int8_t)rssi, wifiHealth.rssi_min);
Copy link
Contributor

@TD-er TD-er Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here an example of things that may to wrong if the wifiHealth members aren't properly initialized.
In this example the struct is initialized, but that's easy to overlook when working with examples.

Copy link
Collaborator

@earlephilhower earlephilhower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little puzzled overall with this. It seems to go to a lot of trouble to overwrite a sector in flash with new data after an OTA, and provides 3 ways of doing it only configurable via core file change/rebuild.

Wouldn't it be faster/safer/easier to just have that as part of the eboot command/work? eboot already can erase/write the entire flash, so it feels like another bit in the command there and maybe an address passed in eboot_cmd[xx] might be enough to handle this, no?

@@ -325,6 +325,36 @@ FlashMode_t EspClass::getFlashChipMode(void)
return mode;
}

#ifdef ERASE_CONFIG_H
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the Arduino.h change, ERASE_CONFIG_H seems like it is always defined. Therefore, can we remove the #if/#else/#endif blocking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Back when I did this, I was asked to not add it to eboot because there was very little space left and other activities that needed the space. So, I tried to get the task done before the SDK started. There are three methods because I don't like any of them. Since it is a PoC I kept all the 3 ideas as methods to pick from. I was hoping for some indication of which one might be better.

RE: ERASE_CONFIG_H is more for development. I left it in since this is a PoC and it makes it clear where code is changing. Since the Arduino IDE does not have a flexible way of setting a global, it also makes it easy to remove the feature by editing Arduino.h.

@@ -62,6 +62,11 @@ void UpdaterClass::_reset() {
}

bool UpdaterClass::begin(size_t size, int command, int ledPin, uint8_t ledOn) {
#if defined(ERASE_CONFIG_H) && !defined(HOST_MOCK)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERASE_CONFIG_H always defined, drop the condition (leaving host_mock)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@@ -5,6 +5,7 @@
#include <flash_utils.h>
#include <MD5Builder.h>
#include <functional>
#include <erase_config.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arduino.h already includes this file always, now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@@ -65,6 +66,13 @@ class UpdaterClass {
*/
bool begin(size_t size, int command = U_FLASH, int ledPin = -1, uint8_t ledOn = LOW);

#ifdef ERASE_CONFIG_H
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same #if comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@@ -199,6 +207,9 @@ class UpdaterClass {
int _ledPin = -1;
uint8_t _ledOn;

#ifdef ERASE_CONFIG_H
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ibid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

#define ERASE_CFG__FIX_DIVIDER() do {} while(0)
#endif

#ifdef ERASE_CONFIG_H
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always true #if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay


*/

#define ERASE_CONFIG_METHOD 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in general seems like this is not the right way to handle it because ytou need to edit core files and rebuild on a case-by-case basis.

if I understand correctly, this is equivalent to 1, 2, or 3 described above. Right now it's always doing 3). If 3) is always good, why support the other 2 methods (i.e. not seeing value in the extra code to do the same thing 2 other ways).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained above

*/
if (0U == ebcmd->magic &&
0U == ebcmd->crc32 &&
ACTION_COPY_RAW == ebcmd->action &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the RTC and (presently) unused bits of the eboot_cmd the right spot for these options? We're not in eboot anymore. when this code runs. Also, we're moving to eboot-cmd in flash which would end up wiping the whole cmd anyway (flash erase once update done). I don't have the right answer yet...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot has changed in eboot since this PoC was submitted. I have been waiting for the dust to settle on the recent eboot activities. I agree eboot is the right place for this if you have the room to accommodate. These ugly options were the best alternative I could come up with.

ERASE_CFG__ETS_FLUSH(0);
}

#if 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constant ifdef

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure I get why you're overloading pinMode as I don't see it used anywhere in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug code to preserve UART operation during the SDK/Core startup so a few more debug messages could print out. Without it, the UART was killed and some messages garbled.

extern cont_t* g_pcont;
extern "C" void call_user_start();

extern "C" void IRAM_ATTR app_entry_redefinable(void) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the other methods probably need the "noextra4k" variants as well.

…anges.

Fixed missed edit (rom_uart_div_modify) for debug build.
@mhightower83
Copy link
Contributor Author

It sounds like the PoC has outlived its usefulness if it had any.
Shall I Close?

@devyte
Copy link
Collaborator

devyte commented Apr 4, 2021

@mhightower83 To be clear, the principle of what you accomplished here is absolutely necessary. What isn't clear is the approach.
About putting this in eboot, the idea is interesting, but it must be weighed against the last necessary ota enhancement (discussed in #905 and the resulting proposal in #6538).
At this point, I don't know what fits and what doesn't fit in what remains of the eboot sector. I think the tiny bit of logic needed in eboot to handle this would be small enough so that we can fit both, but that needs to be tried.

@davisonja
Copy link

I'm keen to finalise #6538 is it worth my working through this PR to see how they might mesh?

@devyte
Copy link
Collaborator

devyte commented Apr 6, 2021

@davisonja yes, I think you're the person best suited to assess whether #6538 and this eboot idea would work together. I think we would need:

  1. Ota commands in flash #6538 updated
  2. numbers for remaining space in the eboot sector with and without Ota commands in flash #6538
  3. a new PR that implements the eboot alternative to this PR

Points 1 and 2 should be easy enough. With those we will know how much space is left in the eboot sector.

I think that the eboot alternative to this PR would need:
A. absolute minimal new code that goes in eboot, something along the lines of what @earlephilhower said: maybe a new command that has an address/size to erase.
B. all of the address/size calculations are outside of eboot, and done after the OTA image has been received and written in the empty area, but before the reboot actually happens.

I'm not sure who should implement points A or B, I think that's up to you both 😛

@d-a-v d-a-v removed the alpha included in alpha release label Apr 10, 2021
@mhightower83 mhightower83 marked this pull request as draft February 17, 2023 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants