Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects #1987

Closed
clumsy-stefan opened this issue Oct 31, 2018 · 195 comments · Fixed by #2559
Closed

[BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects #1987

clumsy-stefan opened this issue Oct 31, 2018 · 195 comments · Fixed by #2559
Labels
Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity Type: Bug Considered a bug

Comments

@clumsy-stefan
Copy link
Contributor

clumsy-stefan commented Oct 31, 2018

Summarize of the problem/feature request

When there is a lot of traffic on the WiFi network or the node is too busy it seems that some send/ack frames on layer 2 get lost and are net or not in time resent by the ESP. Therefore the connection on layer 2 is dropped by the accesspoint.

The ESP does not seem to handle this situation correctly and still tries to send data to the controller/server. This increases the load on the node to 100% and a renegotiation of the WiFi handshake fails (possibly due to not enough time in the WiFi core to do the handshake).

After some time (1-2min) the ESP runs into an exception (mostly 3 or 29) and reboots. Depending on the state of the WiFi and AP the connection to the AP is never established anymore.

See also discussion here with detailed information about the issue and possible workaround

Expected behavior

The ESP should check for that condition and reinitiate a handshake/connection to the AP before continuing to send data to the controller.

Actual behavior

The ESP sends data to the controller until it raises an exception

Steps to reproduce

  1. Reduce the time to wait for a frame ack on the router (eg. on Mikrotik set distance to "indoors" or below 5(km)
  2. make a lot of ESP's (~20) send regular data to a controller
  3. wait for it to crash

Problem persists after powercycle as well as normal reboots.

Current workaround is increasing the time for frame ack's to a higher value (eg. on Mikrotiks set the "distance" value of the interface to 50(km).

System configuration

Hardware: wemos D1 mini, Sonoff Basic, Sonoff Pow, Wemos Pro, others

ESP Easy version: SELF COMPILED!! Latest GIT version! esp8266 core 2.4.2 LWIP 2.0.1 low memory

Rules or log data

All debug logs and other information documented in #1957
See also PR #1979 for additional debug feature and basic check of sending data.

@clumsy-stefan clumsy-stefan changed the title [BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnect [BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects Oct 31, 2018
@TD-er TD-er added Type: Bug Considered a bug Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity labels Oct 31, 2018
@TD-er TD-er added the Category: Controller Related to interaction with other platforms label Oct 31, 2018
@wolverinevn
Copy link

Sound like I have same issue. Sometimes my esp lost wifi connect and keeps “lost” from wifi for hours. I though they must reboot and reconnect to wifi thanks to watchdog but they dont. A cold boot solve the issue.

@TD-er
Copy link
Member

TD-er commented Nov 1, 2018

That last thing described by @wolverinevn is something I have seen happening here too.

@clumsy-stefan
Copy link
Contributor Author

as we discussed in #1957 I'm quite sure a lot of these strange WiFi/networking issues come from this layer 2 instability... I've seen all kind of strange behaviour before changing my AP to some increased timings...

@wolverinevn
Copy link

wolverinevn commented Nov 1, 2018

as we discussed in #1957 I'm quite sure a lot of these strange WiFi/networking issues come from this layer 2 instability... I've seen all kind of strange behaviour before changing my AP to some increased timings...

Hope you will find the solution. One of my Nodemcu hangs and disappear from router for 5 hours until now without reason. I hope it can recover from watchdog but no thing happen. I have to reboot it manually now. Very annoyed!

@clumsy-stefan
Copy link
Contributor Author

@wolverinevn when you have access to the node again go to tools=> advanced and set the "Connection Failure Threshold" to something else than 0 (I suggest something between 50 and 100, depending on the nr. of tasks you have). This does actually not change the problem but increases the chances that the node reboots and reconnects significantly!

the other workaround would be if you can tweak some parameters in your accesspoint, depending actually what type of AP you have if that can be tweaked...

@TD-er
Copy link
Member

TD-er commented Nov 2, 2018

@clumsy-stefan Should we set the default in ESPeasy to this level too?

And maybe we should also display this value in the sysinfo page and make it available for rules?

@clumsy-stefan
Copy link
Contributor Author

@TD-er setting it to some level by default is probably not a bad thing if it's not too low (it can always happen that a connection fails).

When debugging the issue I thought about how this is done, currently every unsuccelsful connection increases the counter and ever succesfull connection decreases it. I thought about if it would be more logical to reset it to 0 as soon as a succeful connection happend, but I guess that's a bit a ideological question what makes more sense.

The issue with that number is, if you have 10 tasks, each of them with a retry count of 10 and a resend delay of 100ms, the reboots happen quite quickly if there is a real comms problem (100 retries within about 10 sec.).

now if you have for example always 5 comms failing and 1 successfull, you'll be continiously increasing connection failures. if this happens all the time you will reboot the node sooner or later even though all data could be delivered.

the main issue I'm seeing though is, that somehow the node is not realizing that the connection on layer 2 is actually gone and continues to send data (I guess). besides this what I realized tonight, what happens to syslog (and other comms like NTP etc.) if there is no wifi connection? Is this also stopped? this could explain why my nodes suddenly jump to 100%cpu when layer 2 is gone. probably no more task data is sent, but it tries to get rid of the UDP syslog packets and can't... just a guess though...

sorry, long text for two simple questions... in short:
default level: yes I'd set it to the max (100) or so by default... if everything is ok it does no harm if not, the unit gets accessible again...
sysinfo page and rules: I'd say no, why should this be dynamically changed? it's an emergency values...

@wolverinevn
Copy link

@clumsy-stefan I've already set it to 50. Lets wait. ;)

@clumsy-stefan
Copy link
Contributor Author

clumsy-stefan commented Nov 2, 2018

@wolverinevn hmm... 50 should happen quite quickly depending on the number of tasks interval and retries you have (5-15min.)... if this does not help I think the node is actually not frozen, but it just can't reconnect to the network. I had this also even after a WD reboot. can you see if the node tries going to AP mode? do you see the AP-WLAN of the node?

@TD-er
Copy link
Member

TD-er commented Nov 2, 2018

sysinfo page and rules: I'd say no, why should this be dynamically changed? it's an emergency values...

I meant to be inspected in rules using a system variable like %conn_fail% and show it on the sysinfo page, next to the number of wifi reconnects.
After all, it is a performance statistics value

@clumsy-stefan
Copy link
Contributor Author

I meant to be inspected in rules using a system variable like %conn_fail% and show it on the sysinfo page, next to the number of wifi reconnects. After all, it is a performance statistics value

ah, yes, agree, that would make sense! that's also a bit related to the issue #1993. Having a plugin that sends a number of system/performance variables regularly to the controller (without wasting the limited available tasks) would be really great!

@wolverinevn
Copy link

wolverinevn commented Nov 2, 2018

@wolverinevn hmm... 50 should happen quite quickly depending on the number of tasks interval and retries you have (5-15min.)... if this does not help I think the node is actually not frozen, but it just can't reconnect to the network. I had this also even after a WD reboot. can you see if the node tries going to AP mode? do you see the AP-WLAN of the node?

I have 9 tasks, 3 of them are Dummy and MQTT_import. I think the rules is a little bit busy with computing and reading sensors, I tried to limit mqtt_publish by calling in rules every few minutes. Load is arround 29%.
As I remember, last time it was frozen this morning, I can't find the AP of Espeasy (if you mean AP_WLAN is operating in AP mode).
My setup (network, location of ESP) was working greate with another Nodemcu running 2.3 or 2.4 which was released on March.

Uptime is 7hrs and 20mins, RSSI is -71dbm, there are a few wifi around me.
Last Disconnect Reason: | (200) Beacon timeout
Number reconnects: | 35

@clumsy-stefan
Copy link
Contributor Author

@wolverinevn the problem with this issue is, that it happens completely random. I have ~30 nodes running, some of them faced the issue some of them not, some rebooted, some wnt to AP mode...

It really seem to be a combination of how busy the node is, how busy the air is (eg. numebr of wifi devices) and how your AP acutally handles certain conditions (missnig layer 2 acks etc.)...

so I guess until we find a way within the application (ESPEasy) to reliably detect this condition and act on it, there is no "real" solution....

@clumsy-stefan
Copy link
Contributor Author

@wolverinevn PS: you're not using mikrotik AP's by chance?

@TD-er
Copy link
Member

TD-er commented Nov 2, 2018

@wolverinevn About the number of reconnects (in your edit)
35 reconnects in about 8 hours is a lot.
I have nodes here running for days which only have a handful of reconnects.
The most stable one is running for 20 days 11 hours 46 minutes now and only 1 reconnect.

Connected 19d22h33m
Last Disconnect Reason (202) Auth fail
Number reconnects 1

@wolverinevn
Copy link

@wolverinevn PS: you're not using mikrotik AP's by chance?

No. I'm using router running Padavan firmware (kind of ASUS).

@TD-er I knew it. I'm inspecting the reason, may be noise from buck module nearby. Another one has 0 reconnect after 2 hours.

@clumsy-stefan
Copy link
Contributor Author

No. I'm using router running Padavan firmware (kind of ASUS).

Unfortunately I don't know this FW at all... Any chance to tweak layer 2 parameters? Something like frame ack timeouts or similar? Some kind od "distance" settings?

@wolverinevn
Copy link

@clumsy-stefan Unfortunately, I don’t see anything like that.

@wolverinevn
Copy link

@clumpsy-stefan The unit was rebooted 2 times last night with 50 failure threshold set. Good news is there no frozen any more. Today I will try to improve wifi connect by some minor changes in hardware setup.

@Domosapiens
Copy link

Domosapiens commented Nov 4, 2018

3 Wemos units in the same room, connected to the same AP.
Reconnects in the last 16 hours or so,
With Rule: On WiFi#Connected ....

26 WD reboots and 104 re-connections:
muc21_capture

9 WD reboots and 32 re-connections
muc19_capture

2WD reboots and 40 re-connections
muc14_capture

All have 50 failure threshold set

@clumsy-stefan
Copy link
Contributor Author

@Domosapiens & @wolverinevn one more thing you can try is increasing the group-key-timeout on your AP (if you have such option). Normally that's around 5min. You can try to increase to 30min. or even 1h and see if it improwves (as long as it's not in a super high security network, which I don't assume if you have IoT's in it)...

@clumsy-stefan
Copy link
Contributor Author

@TD-er

The most stable one is running for 20 days 11 hours 46 minutes now and only 1 reconnect.

I also have currently units that ran for over 3 days now and other that rebooted within a day...

I did see some issues with the rekeeying of the group key. it somehow seems, that in newer versions of the core it can happen that the rekeying runs into a timeout... however the application should act on this and not go into some high-load not responsive mode... but I'm not sure where it's failing..

@TD-er
Copy link
Member

TD-er commented Mar 8, 2019

What do you think about the next scenario then.
As soon as it is possible to connect to the given AP's using B/G only, an extra flag will be set to provide no fallback anymore.
If some of these settings (B/G only, or other AP settings) change, the fall-back will be enabled again until it was successful to connect to the given AP.

Edit:
With fallback I mean those extra settings, not the "fallback SSID"

@giig1967g
Copy link
Contributor

In other words, the fallback remains active only until the first successful connection to the AP, then it is removed. In this case it would be helpful to see in the syspage the wifi mode.
It's a possible solution even if I still prefer to avoid fallbacks to N if I set Force B/G.
I understand the possibility of the user to make mistakes, but then the user could type the SSID wrong and not get access to the unit in anycase...

Another questio: in current implementation in which scenario will the unit become an AP?
Because it seems to me that the unit will try B/G for 10 times then try N but will it eventually give up and become an AP?

@TD-er
Copy link
Member

TD-er commented Mar 8, 2019

Nope, the AP-mode remains active while still testing to connect to the given APs.
Maybe we should also add an optional check for uptime and only allow to start the AP mode in the first 10 minutes the node is booted.
Just for extra security and also to exclude the possibility the AP mode may have an effect on the ESP not being able to reach the given wifi networks.

And we should keep focus on the "easy" part of the project.
This means proper defaults and no overwhelming amount of settings offered, but give the option to the expert to do it all.
This also means there should be a proper fall-back for the less experienced user.
Especially for B/G only settings. I guess 90+ percent of the people starting with ESPeasy are not aware of the differences between 802.11B/G/N So if they experience issues, which can be handled very well by using a fallback, it may cause them to look for other projects.
I also understand why this fallback should not give a false sense of 'stability', so I really understand why the current implementation has room for improvement. But if the "first connect attempt success => disable fallback" is made automatic, then it is also perfect for the more experienced user. (who also makes stupid mistakes, as I know from experience ;) )

I agree it should be made more clear what connection setting is actively used.

@giig1967g
Copy link
Contributor

understood.
So, the prosal is:
The unit Boots.
N-fallback flag is set to true.
If FORCE B/G is set, it tries 10 times to connect to the wifi AP in B/G.
If it can connect it sets the N-fallback flack to false.
If it can't connect it tries in N-mode.

Please consider also this scenario with Force B/G set: power failure.
ESP and Wifi router are powered off.
Then power returns, and ESP is up quicker than the wifi router.
In this case it could happen that after 10 times the wifi AP is not still listening.
So the unit will try N-mode and eventually will succeed.
(Experienced) user will not know that the connection was in N-mode and thinks that it's in B/G mode with lack of stability.

I don't want to insist, but really this HW and freeze issues have been going since 1 year. Now that you with the help of the community found one working solution, I strongly advice to make sure it remains applied. The risk for an unexperienced user to set the "Force B/G mode" are less than him setting the wrong SSID with the same results: no access to the access point.

SUGGESTION:
In order to make sure the less experinced user does notmake mistake, why don't you add a button to "Test B/G mode". If it succeeds, it enables the "FORCE B/G mode", if not it remains disabled.

In this case Less experienced users will know what they are doing and more experienced users will be sure that when ForceB/G mode is set it cannot fallback to N.

What do you think?

You could

@TD-er
Copy link
Member

TD-er commented Mar 8, 2019

Not only if it boots, but the fallback option will then be disabled in the settings and saved.

@giig1967g
Copy link
Contributor

ok. Then a manual check is even more appropriate instead of an automatic check. Dont't you think?

@TD-er
Copy link
Member

TD-er commented Mar 8, 2019

Yep, I will first add a simple checkmark to disable overrides.
But later it should be made more dynamic in this to also help us, the "experts" to make less mistakes :)

@giig1967g
Copy link
Contributor

ok

@kischde
Copy link

kischde commented Mar 10, 2019

Version 2019_02_16 was now running for 9 days without reboot (forced to B/G). Yesterday it started to reboot again. The hardware watchdog forced this two times.
After a while the unit was not reacheable any more. Switching of the WLAN router did not help (tried it 3 times, up to 30 minutes). Restarting the ESP by switching of the power fuse did also not help.
I had to shut down wlan again and than connect to the ESP via 192.168.4.1 direct to get access on it

I have absolutely no idea what happened

@TD-er
Copy link
Member

TD-er commented Mar 10, 2019

@kischde What kind of accesspoint are you using?
Last night I experienced something similar myself while experimenting with installing a new AP.
I was trying to install a MikroTik AP and all was working fine until the ESP needed to reconnect.
Through the web interface of the MikroTik, I could see the node was connected to the WiFi, but it didn't receive an IP address.
Even when I tried to connect it to my phone's hotspot, I could reboot the ESP but it repeated this behavior.
Only when I set the main AP config to another AP it started to connect like it should.

Note that the node was not rebooted to achieve this.

One thing that is set "incorrect" on that AP, compared to another MikroTik I have, is that it has the "Distance" setting set to "indoors".
I will perform more tests to see what's the difference here, but like discussed before, it seems to be some timeout setting for a reply on a packet.
And I can imagine the ESP does take some time to respond to DHCP requests, which may be just too short for this setting.

@kischde
Copy link

kischde commented Mar 10, 2019

I am using a Swisscom AP (swiss Telecom provider AP), but I had all the same issues written here in the different places like the guys with the MikroTik. So it maybe has the same chipset, but I can´t change a lot in the setting, for example those extended settings.
Before re-powering I also tried to connect with my mobil phone, like you did, with the same behaviour than you.
I use fix IP, no DHCP
I switched now to the actual 2019_03_05, will see what happens...

@TD-er
Copy link
Member

TD-er commented Mar 10, 2019

Did you recently change WiFi settings?
For example access point with MAC address AA:AA:AA:AA:AA on position 1 of the SSID's another AP with MAC address BB:BB:BB:BB:BB:BB
I have got the feeling there are some settings left in some place in the ESP where we do not store them.

I will also try to see in the NonOS docs how these can be effectively cleared when we change WiFi settings.
Also in my tests, it appears to be really useful to have more than 2 SSID's to be used.
I will also try to add more field for this, or maybe even allow to store some encrypted file in SPIFFS to have a near unlimited amount of WiFi AP's stored.

@kischde
Copy link

kischde commented Mar 10, 2019

Did you recently change WiFi settings?

No, as I have only one AP, this was IMO not necessary

@TD-er
Copy link
Member

TD-er commented Mar 10, 2019

OK. Good to know.
Since I don't know yet what is causing this, I'm trying to eliminate as much unknowns as possible.
So the only thing you needed to do to make it work again is add the setting for wifi again.

@kischde
Copy link

kischde commented Mar 10, 2019

No,even less
I forced to connect direct to the esp via 192.168.4.1 (disabled the AP)
Than I checked everything, but did not find any abnormal things... So I give it a try and restartet my AP, than reset the ESP and it run again. So IMO I just had to force to do a "normal/other" WLAN connect.
BTW I also saw it connected at the AP, but also no IP Adress assigned, however it´s static

@TD-er
Copy link
Member

TD-er commented Mar 10, 2019

Hmm, I've been playing with it a bit, with 5 nodes connected to the MikroTik I'm playing with.

As soon as I set "Hw. Fragmentation Threshold" (just 'unfold' the option), the ESP nodes no longer are capable of receiving any IP address anymore.
The default value of this setting is 256. If I set it to 1600 (will fit a full MTU), all nodes will receive an IP address and continue to work.

This is shown in the MikroTik UI when the nodes are able to communicate:
image

And this when they are not able to send/receive any data (but are connected to the WiFi layer)
image

@clumsy-stefan
Copy link
Contributor Author

clumsy-stefan commented Mar 10, 2019

@TD-er did you see that there is an option in the new esp8266 core called LWIP_FEATURES which I think will activate IP reassembly... Probably that's not defined in your build?

see here: https://github.com/esp8266/Arduino/blob/192aaa42919dc65e5532ea4b60b002c4e19ce0ec/tools/sdk/lwip2/include/lwipopts.h#L748-L754

also support for IP fragmentation is set with this:
https://github.com/esp8266/Arduino/blob/192aaa42919dc65e5532ea4b60b002c4e19ce0ec/tools/sdk/lwip2/include/lwipopts.h#L756-L763

@TD-er
Copy link
Member

TD-er commented Mar 10, 2019

Hmm, not sure what the defaults are.
Are the numbers given on the first line in the Doxygen documentation the defaults?

@clumsy-stefan
Copy link
Contributor Author

I don't know... I just know, that in the Arduino IDE in the plattforms definition file you can select if you want it included and defined or not..

@TD-er
Copy link
Member

TD-er commented Mar 10, 2019

I always thought the LWIP parts were included as pre-compiled libraries in the core distribution.
So then it is quite hard to make sure LWIP is rebuilt using the correct flags.

@clumsy-stefan
Copy link
Contributor Author

clumsy-stefan commented Mar 11, 2019

yes, but it depends which one you link against (from boards.txt):

-llwip2-1460-feat
-llwip2-536-feat
-llwip2-536
-llwip2-1460
-llwip2

@TD-er
Copy link
Member

TD-er commented Mar 11, 2019

So they're using these flags:

https://github.com/esp8266/Arduino/blob/192aaa42919dc65e5532ea4b60b002c4e19ce0ec/boards.txt#L357-L387

Label build.lwip_lib build.lwip_flags
v2 Lower Memory -llwip2-536-feat -DLWIP_OPEN_SRC -DTCP_MSS=536 -DLWIP_FEATURES=1 -DLWIP_IPV6=0
v2 Higher Bandwidth -llwip2-1460-feat -DLWIP_OPEN_SRC -DTCP_MSS=1460 -DLWIP_FEATURES=1 -DLWIP_IPV6=0
v2 Lower Memory (no features) -llwip2-536 -DLWIP_OPEN_SRC -DTCP_MSS=536 -DLWIP_FEATURES=0 -DLWIP_IPV6=0
v2 Higher Bandwidth (no features) -llwip2-1460 -DLWIP_OPEN_SRC -DTCP_MSS=1460 -DLWIP_FEATURES=0 -DLWIP_IPV6=0
v2 IPv6 Lower Memory -llwip6-536-feat -DLWIP_OPEN_SRC -DTCP_MSS=536 -DLWIP_FEATURES=1 -DLWIP_IPV6=1
v2 IPv6 Higher Bandwidth -llwip6-1460-feat -DLWIP_OPEN_SRC -DTCP_MSS=1460 -DLWIP_FEATURES=1 -DLWIP_IPV6=1
v1.4 Higher Bandwidth -llwip_gcc -DLWIP_OPEN_SRC
v1.4 Compile from source -llwip_src -DLWIP_OPEN_SRC

So all v2 versions have LWIP_FEATURES=1 except for the ones labelled as "no features"

@clumsy-stefan
Copy link
Contributor Author

clumsy-stefan commented Mar 11, 2019

yes, but if you look at the -lstatements, you need to youse the libraries with -feat at the end (in contrary to the label, a bit conter-intuitive)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity Type: Bug Considered a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.