-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WiFi cannot connect until a power cycle. #5527
Comments
I'm glad I'm not the only one that has seen this with 2.4.2. I have 2 devices that are installed a long way away and after anywhere from 1-10 days, like you mention, they will not reconnect to the WiFi AP. In this application they are going through a cellular modem and they seem to lose communication fairly regularly (multiple times a day) even though the signal strengths have been pretty good. I was thinking that maybe it had something to do with the changes with lwIP. I haven't been able to try and diagnose them due to their location. In the meantime I rolled back to 2.4.0 (I've had serious problems with 2.4.1) and the reliability seems to have improved. I've also implemented some logic that if after a certain period of time (i.e. 5 minutes) of not being able to reconnect the MCU is restarted. I was hoping to see if things improve with 2.5.0 when it comes out of Beta. |
We need to reproduce to have a chance to fix. Fixes will be made to 2.5.0-beta2, not to 2.4.x. I've been running this sample for several hours with core-2.5.0-beta2 with no issue so far.
|
An update on the progress of tracking down this bug in the new Version: -Prepared a new Version using 2.5.0-Beta2 and lwlp v2. We seem to have some other issues from the switch from 2.4..2 to 2.5.0-Beta 2, the heap is way lower and it seems that we got a memory leak from switching as the heap is in a steady decline, causing a reboot ~ every 2hours. 4 Devices did not make it through the night. All of them giving up ~50-60min after the last boot (after which the end up in a state where they cannot establish a connection). Restarted 2 of the 4 Devices made them able to reconnect on the first try. Leaving the other 2 to keep on trying to connect for now to see if they recover. The other 6 are still running and reconnect successfully every 10 seconds. (but are also rebooting a lot) The next steps will be to find and at least reduce the memory leak so that we can conduct a more meaningful test on this. Will increase reconnect timer to 15 seconds to give the chip more time to reconnect. Will also model the disconnect / reconnect after the example d-a-v posted (using force sleep / wake), but will also wait 1.2 seconds between disconnect and reconnect. Since all but one of these devices are not connected via Serial, will also implement that the last "failed wifi" status will be saved to the spiffs for more information. Ill post my findings here soon... OK will have 4 running with the new version which is a bit more stable and using the changed mentioned above. Will update if the issue happens on any of the 4 Devices. I will also run 6 devices on Version 2.4.0 with the changes above and reconnect every 15seconds to see if the issue occurs on this version as well. UpdateThe issue also happened once on the devices running 2.4.0 and lwIp 1.4. |
At ESPeasy we're getting similar reports on this for many months now. |
The sketch I posted above had been working for two full weeks (I was away and had let it run). In order to help debugging, we need to be able to understand where the cause is. |
@d-a-v It seems there are several issues at hand which overlap in observed symptoms. This one ESPeasy - DeepSleep Problem to send Data has some nice Wireshark dumps/screenshots. But like I said, it seems to be more than one issue, or at least there is not one single remedy (apart from reboot) to solve it. Some reported fixes (e.g. reboot AP or disconnect ESP from within AP management tools) do not seem to work for others. One of our users is now testing if this suggested fix will work (in his setup): #2186 (comment) Also this closed issue (WiFi Reconnect only after power cycle #2235) does seem to be related to the one we're discussing right now. |
Please do your tests and reports using the latest "git" version of this core. |
You mean the 'stage' branch? |
We have only one branch. So yes. |
The error has now occured since we have implemeted a bit more exception logging. This on 2.4.2 and LwIP 1.4. |
|
Another short update We have this running on ~50 Devices, with another ~400 comming online soon. Here are the Logs LastWiFiDisconnectReason=8,2,8,201,2,201,2,8,2,8,;TotalFoundWiFis=7;RSSI=66%(-67) The 8 (ASSOC_LEAVE) is mostly triggered when we try to reconnect (disconnect is called) |
Short update on what we are currently examining....
We have been running this test for the last 10 days on 40+ Devices and the issue has never happened since. This on 1.4.2 and LwIp 1.4. |
@Swedish-Coder I'm looking at those classes.
That should be no issue, if the classes it inherits from do have a virtual destructor defined. All these classes have static variables defined, but maybe these are only reset in the constructor? |
Hi all, I like to reopen or reconfirm the issue and can hopefully give relevant information! I have seen the reconnect was reportet internally really fast but later on was not working. So when I do two cycles in the startup I got a workaround for this reconnection issue also for my bad ESPs. So in setup I do following: This slows down the startup cycle but solved the reset issue for me. It seems there is something stored in flash from older ESP versions, which I can't reset and seems is still influencing WiFi startup even when not explicitly stated. |
There is one thing I noticed when I enabled WIFI debugging, maybe it is of interest. In my case I was waking up from deep sleep and I had auto-connect on (the default). When I was looking at the log, I saw the order of events was:
Now I noticed there was no second "dhcp client start..." in the log. I guessed that maybe this was the reason that the WiFi connection didn't work sometimes. I now disable the auto-connect in if (WiFi.SSID() != "") {
println(F("WiFi credentials set in flash, wiping them"));
WiFi.disconnect();
}
if (WiFi.getAutoConnect()) {
println(F("Disabling auto-connect"));
WiFi.setAutoConnect(false);
}
WiFi.persistent(false);
WiFi.mode(WIFI_STA);
WiFi.begin(...); And suddenly the connection works every time. |
this is a lighter workaround for esp8266#5527 and may require better understanding of the issue
I have not tested that function directly, but I do get the |
Just to give some debugging context here:
And at the time where the connect if successful:
The last line is also triggered when the WiFi.status() reports And some debug output when there was a connection and I kicked it from the AP:
|
There are phy updates in #6257, can you try with it ? |
In PlatformIO I must then use this define? |
Yes, but I didn't try it myself |
Since I'm doing this every now and then, I have a feature request.... |
What kind of label ?
which gives:
(this script works with any repository on github) |
I did those steps in the terminal by hand and the result is that PlatformIO is detecting it is another core lib and thus renames the folder and start fetching the one it thinks it needs. So it renames No idea where it gets this Edit: It is in package.json. Edit2: |
Maybe @ivankravets could help with the platformio process ? |
OK, now it has been built with the ..SDK22y flag So far I have not been able to get it to crash after a disconnect. Edit:
But good news is it is a lot less often |
Just an observation here. N.B. automatic reconnect is enabled. Edit: |
Something fishy is going on here and not just regarding this new SDK version. Problem is that one build seems to be able to connect to the AP just fine and a next one is just not able to connect to the AP. Only once in 10's or maybe 100 attempts (and thus crash/reboots) Yesterday I saw it when changing delay values to get a feeling for their impact, but now I saw it again. if (loglevelActiveFor(LOG_LEVEL_INFO)) {
String log;
log.reserve(60);
log = F("WD : Uptime ");
log += wdcounter / 2;
log += F(" ConnectFailures ");
log += connectionFailures;
log += F(" FreeMem ");
log += FreeMem();
log += F(" WiFiStatus ");
log += wifiStatus;
addLog(LOG_LEVEL_INFO, log);
} And then replacing With just this single change the resulting executable was not able to connect at all. This is also behavior we're seeing for a long time in ESPeasy builds. This does "smell" like some buffer allocation or initialization which is not right. Is there some compiler flag possible to force any allocated array to be initialized with 0 (or some predefined value) ? Edit: |
Do we have any problems with PlatformIO? |
@ivankravets Well I am not 100% sure. :) About whether or not we have a problem with PIO... |
It is the default. A special attribute is needed to not initialize globals or allocated variables (to 0).
In doubt, can you run your tests with |
If I remember correctly, we now use the same build workflow as Arduino IDE, even the same linker scripts. You can take a look at https://github.com/esp8266/Arduino/blob/master/tools/platformio-build.py Or you mean when a library has |
As far as I know, we don't use those in this project, so that should not be an issue here. @d-a-v That one was already present in my code, wrapped in N.B. for my tests I went back to the core 2.5.2 SDK 2.2.1 It does seem a bit faster to me in rendering web pages, but maybe that's too early to tell right now. (SDK 2.2.1 vs. your PR version) |
The ESP node appears to reboot at the reconnect sequence when the wifi connection was lost. See also esp8266/Arduino#5527 (comment)
@TD-er @Swedish-Coder Is this issue still relevant in git master ? |
I'm not entirely sure if it is still relevant. I have now 2 nodes running with the patch you mentioned (previous version of it).
So it seems it was very well capable of reconnecting (some of these reconnects were me kicking the node from the AP, only during the first hour after boot) |
The pseudo modes are merged. |
Basic Infos
Platform
-Hardware: ESP8266
-Core Version: 2.4.2
-Development Env: Visual Micro
-Operating System: Windows
Settings in IDE
-Module: Generic ESP8266 Module
-Flash Mode: qio
-Flash Size: 2MB (256k Spiffs)
-lwip Variant: v1.4
-Reset Method: ck
-Flash Frequency: 40Mhz
-CPU Frequency: 80Mhz
-Upload Using: Serial
-Upload Speed: 921600
Problem Description
Using the reconnect from below works fine 99% of the time. Running this test on ~50 ESP8266. Reconnect is called every 30seconds.
After some undefined time (ranging from 1h to 10days..) the Wifi will not be able to reconnect.
The Status stays like this until the device is reset (hardware reboot), have tried letting the devices stay in this mode for days, but they never recover. Rebooting the AP does not help.
The signal strength is also not the issue (ran a test with AP in front of ~42 ESP8266 with ~2m away from devices, they all had ~ -65 dba).
Have tried different disconnect / reconnects, all resulting in the same stale state. Currently trying
WiFi.disconnect(true); forceSleepBegin(35000); //…do other tasks for 25ms WiFi.forceSleepWake(); WiFi.persistent(false); WiFi.begin(ssid.c_str(), psk.c_str());
This sometimes causes crashes on the WiFi.begin(..) line.
Is there a way to “properly” do a reconnect on WiFi, or a way to reset the WiFi to “hardware boot” status (delete all DHCP information, known AP, etc.)?
MCVE Sketch
The text was updated successfully, but these errors were encountered: