-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects #1987
Comments
Sound like I have same issue. Sometimes my esp lost wifi connect and keeps “lost” from wifi for hours. I though they must reboot and reconnect to wifi thanks to watchdog but they dont. A cold boot solve the issue. |
That last thing described by @wolverinevn is something I have seen happening here too. |
as we discussed in #1957 I'm quite sure a lot of these strange WiFi/networking issues come from this layer 2 instability... I've seen all kind of strange behaviour before changing my AP to some increased timings... |
Hope you will find the solution. One of my Nodemcu hangs and disappear from router for 5 hours until now without reason. I hope it can recover from watchdog but no thing happen. I have to reboot it manually now. Very annoyed! |
@wolverinevn when you have access to the node again go to tools=> advanced and set the "Connection Failure Threshold" to something else than 0 (I suggest something between 50 and 100, depending on the nr. of tasks you have). This does actually not change the problem but increases the chances that the node reboots and reconnects significantly! the other workaround would be if you can tweak some parameters in your accesspoint, depending actually what type of AP you have if that can be tweaked... |
@clumsy-stefan Should we set the default in ESPeasy to this level too? And maybe we should also display this value in the sysinfo page and make it available for rules? |
@TD-er setting it to some level by default is probably not a bad thing if it's not too low (it can always happen that a connection fails). When debugging the issue I thought about how this is done, currently every unsuccelsful connection increases the counter and ever succesfull connection decreases it. I thought about if it would be more logical to reset it to 0 as soon as a succeful connection happend, but I guess that's a bit a ideological question what makes more sense. The issue with that number is, if you have 10 tasks, each of them with a retry count of 10 and a resend delay of 100ms, the reboots happen quite quickly if there is a real comms problem (100 retries within about 10 sec.). now if you have for example always 5 comms failing and 1 successfull, you'll be continiously increasing connection failures. if this happens all the time you will reboot the node sooner or later even though all data could be delivered. the main issue I'm seeing though is, that somehow the node is not realizing that the connection on layer 2 is actually gone and continues to send data (I guess). besides this what I realized tonight, what happens to syslog (and other comms like NTP etc.) if there is no wifi connection? Is this also stopped? this could explain why my nodes suddenly jump to 100%cpu when layer 2 is gone. probably no more task data is sent, but it tries to get rid of the UDP syslog packets and can't... just a guess though... sorry, long text for two simple questions... in short: |
@clumsy-stefan I've already set it to 50. Lets wait. ;) |
@wolverinevn hmm... 50 should happen quite quickly depending on the number of tasks interval and retries you have (5-15min.)... if this does not help I think the node is actually not frozen, but it just can't reconnect to the network. I had this also even after a WD reboot. can you see if the node tries going to AP mode? do you see the AP-WLAN of the node? |
I meant to be inspected in rules using a system variable like |
ah, yes, agree, that would make sense! that's also a bit related to the issue #1993. Having a plugin that sends a number of system/performance variables regularly to the controller (without wasting the limited available tasks) would be really great! |
I have 9 tasks, 3 of them are Dummy and MQTT_import. I think the rules is a little bit busy with computing and reading sensors, I tried to limit mqtt_publish by calling in rules every few minutes. Load is arround 29%. Uptime is 7hrs and 20mins, RSSI is -71dbm, there are a few wifi around me. |
@wolverinevn the problem with this issue is, that it happens completely random. I have ~30 nodes running, some of them faced the issue some of them not, some rebooted, some wnt to AP mode... It really seem to be a combination of how busy the node is, how busy the air is (eg. numebr of wifi devices) and how your AP acutally handles certain conditions (missnig layer 2 acks etc.)... so I guess until we find a way within the application (ESPEasy) to reliably detect this condition and act on it, there is no "real" solution.... |
@wolverinevn PS: you're not using mikrotik AP's by chance? |
@wolverinevn About the number of reconnects (in your edit)
|
No. I'm using router running Padavan firmware (kind of ASUS). @TD-er I knew it. I'm inspecting the reason, may be noise from buck module nearby. Another one has 0 reconnect after 2 hours. |
Unfortunately I don't know this FW at all... Any chance to tweak layer 2 parameters? Something like frame ack timeouts or similar? Some kind od "distance" settings? |
@clumsy-stefan Unfortunately, I don’t see anything like that. |
@clumpsy-stefan The unit was rebooted 2 times last night with 50 failure threshold set. Good news is there no frozen any more. Today I will try to improve wifi connect by some minor changes in hardware setup. |
@Domosapiens & @wolverinevn one more thing you can try is increasing the group-key-timeout on your AP (if you have such option). Normally that's around 5min. You can try to increase to 30min. or even 1h and see if it improwves (as long as it's not in a super high security network, which I don't assume if you have IoT's in it)... |
I also have currently units that ran for over 3 days now and other that rebooted within a day... I did see some issues with the rekeeying of the group key. it somehow seems, that in newer versions of the core it can happen that the rekeying runs into a timeout... however the application should act on this and not go into some high-load not responsive mode... but I'm not sure where it's failing.. |
What do you think about the next scenario then. Edit: |
In other words, the fallback remains active only until the first successful connection to the AP, then it is removed. In this case it would be helpful to see in the syspage the wifi mode. Another questio: in current implementation in which scenario will the unit become an AP? |
Nope, the AP-mode remains active while still testing to connect to the given APs. And we should keep focus on the "easy" part of the project. I agree it should be made more clear what connection setting is actively used. |
understood. Please consider also this scenario with Force B/G set: power failure. I don't want to insist, but really this HW and freeze issues have been going since 1 year. Now that you with the help of the community found one working solution, I strongly advice to make sure it remains applied. The risk for an unexperienced user to set the "Force B/G mode" are less than him setting the wrong SSID with the same results: no access to the access point. SUGGESTION: In this case Less experienced users will know what they are doing and more experienced users will be sure that when ForceB/G mode is set it cannot fallback to N. What do you think? You could |
Not only if it boots, but the fallback option will then be disabled in the settings and saved. |
ok. Then a manual check is even more appropriate instead of an automatic check. Dont't you think? |
Yep, I will first add a simple checkmark to disable overrides. |
ok |
Version 2019_02_16 was now running for 9 days without reboot (forced to B/G). Yesterday it started to reboot again. The hardware watchdog forced this two times. I have absolutely no idea what happened |
@kischde What kind of accesspoint are you using? Note that the node was not rebooted to achieve this. One thing that is set "incorrect" on that AP, compared to another MikroTik I have, is that it has the "Distance" setting set to "indoors". |
I am using a Swisscom AP (swiss Telecom provider AP), but I had all the same issues written here in the different places like the guys with the MikroTik. So it maybe has the same chipset, but I can´t change a lot in the setting, for example those extended settings. |
Did you recently change WiFi settings? I will also try to see in the NonOS docs how these can be effectively cleared when we change WiFi settings. |
No, as I have only one AP, this was IMO not necessary |
OK. Good to know. |
No,even less |
@TD-er did you see that there is an option in the new esp8266 core called also support for IP fragmentation is set with this: |
Hmm, not sure what the defaults are. |
I don't know... I just know, that in the Arduino IDE in the plattforms definition file you can select if you want it included and defined or not.. |
I always thought the LWIP parts were included as pre-compiled libraries in the core distribution. |
yes, but it depends which one you link against (from boards.txt):
|
So they're using these flags:
So all v2 versions have |
yes, but if you look at the |
Summarize of the problem/feature request
When there is a lot of traffic on the WiFi network or the node is too busy it seems that some send/ack frames on layer 2 get lost and are net or not in time resent by the ESP. Therefore the connection on layer 2 is dropped by the accesspoint.
The ESP does not seem to handle this situation correctly and still tries to send data to the controller/server. This increases the load on the node to 100% and a renegotiation of the WiFi handshake fails (possibly due to not enough time in the WiFi core to do the handshake).
After some time (1-2min) the ESP runs into an exception (mostly 3 or 29) and reboots. Depending on the state of the WiFi and AP the connection to the AP is never established anymore.
See also discussion here with detailed information about the issue and possible workaround
Expected behavior
The ESP should check for that condition and reinitiate a handshake/connection to the AP before continuing to send data to the controller.
Actual behavior
The ESP sends data to the controller until it raises an exception
Steps to reproduce
Problem persists after powercycle as well as normal reboots.
Current workaround is increasing the time for frame ack's to a higher value (eg. on Mikrotiks set the "distance" value of the interface to 50(km).
System configuration
Hardware: wemos D1 mini, Sonoff Basic, Sonoff Pow, Wemos Pro, others
ESP Easy version: SELF COMPILED!! Latest GIT version! esp8266 core 2.4.2 LWIP 2.0.1 low memory
Rules or log data
All debug logs and other information documented in #1957
See also PR #1979 for additional debug feature and basic check of sending data.
The text was updated successfully, but these errors were encountered: