-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EthernetServer accept no longer connects clients after unplugging/plugging ethernet cable ~7 times #15
Comments
I repeated the steps with some TCP debugging options enabled and caught this. Line 665 in c0b33ce
|
Thanks for this report. There's a limited number of total connections. Are you certain the other connections are closed? |
Thinking some more (I'm not able to test just yet): What happens after you wait 2 minutes after you see this error? Technically, those connections are still active even if the Ethernet plug gets unplugged. There may be a 2-minute timeout; have you tried waiting this long after the last connection made and after you unplug the cable? |
In the example I am using to test this issue, you call client.stop() after 5 seconds of no input from the client. In my program I call client.close() on link state change, if false, and both programs have this issue. Is that what you mean by connection closed? I did what you suggested and waited to see if the client would ever connect, and it did after ~32 minutes. I repeated this twice with the same results. |
Thanks for the extra info. Are you saying things are repaired after that ~32 minutes? After it’s “repaired”, does the same thing happen after unplugging and re-plugging the cable again a few times? |
Correct, after 32 minutes my client connects to the Teensy and everything seems to operate as normal. I am then able to repeat the steps in my original post, unplugging and plugging back in the ethernet cable ~7 times before the issue repeats itself. |
Here is a more complete log after connecting the ethernet cable for the last time and failing to connect.
Line 1847 in c0b33ce
|
Correct me if I am wrong on any of this. I learned "~7" times corresponds to the max number of TCP pcb which is set to 8. Looks like I am hitting the max and receiving an error when memp tries to make space for a new pcb but can't.
Looks like this is where the issue is showing itself, but I assume the actual issue lies somewhere else, not sure if I can trace much further without digging deep. Line 1841 in c0b33ce
There must be an issue freeing up TCP pcb when the ethernet cable gets pulled? But something is clearly forcing them to free up after some time ~30mins? Maybe there is a way to force them to free up? Even though it looks like it tries to free up space after the line shown above and fails to do so. |
Log when the pcb finally gets purged and everything goes back to normal. (The 30ish minute wait)
corresponding to Line 1234 in e01aefa
and Line 2152 in e01aefa
|
Question: Are you using DHCP, and if so, do you see the address change ever, whenever plugging the Ethernet back in? |
I'm having trouble reproducing this. Can you tell me more about the client you're using to test connections? I'm using a browser and reloading manually around every second while unplugging and plugging in the cable. (Static IP.) Could you also tell me more about how you modified the ServerWithListeners example? |
I have the Teensy directly connected to a computer both with a static IP. Not sure if it matters, but I am using a USB to ethernet adapter for this. The only thing I changed in ServerWithListeners is the following:
Here is a C# client I threw together to send "hi" every second and reconnect on error. The original client I discovered this issue with was is in LabVIEW but this has the same issue.
|
I've not been able to reproduce the problem. My Java code (Java 17):
For my testing procedure, I plugged and unplugged the cable multiple times with various timings:
My Java program can always reconnect. There's a case where I leave the cable unplugged and it takes about a minute for the socket to realize there's nothing connected and then it properly complains of a "broken pipe". Also, if I leave the cable unplugged while the Java program is running, wait for a timeout on the Teensy side, and then reconnect the cable, the Java program shortly realizes there's a "broken pipe" and then restarts the connection. What versions of QNEthernet and Teensyduino are you using? Additionally, I'm running these tests on a Mac. On what hardware are you running your test program? I wonder if there's a difference in the client-side TCP/IP stacks we're using. |
I am using Windows 10 PCs, Teensyduino 1.56 and Arduino 1.8.16. I just ran a few tests plugging directly into the Ethernet jack on the PC vs using the USB adapter and the issue was still there. I tried a different PC to make sure no software was causing the issue and the issue was still there. Then I tried connecting the Teensy to a modem/router, changing the IP/gateway and ran the test again. This method allows me to unplug/re-plug the ethernet cable many times without issue but I believe this method does not act in the same way as a direct connection which is my current configuration. I don't have a MAC that I can use to test this unfortunately, I would be interested to know if that makes the difference. |
Could you re-try with the latest Teensyduino 1.56? Just to be sure you have the latest of everything. Also, what version of QNEthernet do you have? I'll re-try my tests by plugging the Ethernet into my laptop (via one of those Belkin USB-C Ethernet adapters). |
I just did similar tests, but with the Teensy connected with Ethernet directly to my computer via that USB-C Ethernet adapter. The only problem I saw was the Java program not being able to connect—I saw this once. I simply restarted the Java program and things returned to normal. Does that sound like what you're seeing sometimes? Which lwIP debugging options did you turn on when you built the project? |
I am now using Teensyduino 1.56 (previously 1.55) with Arduino 1.8.16 and QNEthernet 0.14.0. I ran the test again and the issue is still there. Restarting the client program does not fix the issue in my case. At the moment I do not have any debugging options enabled. I think the ones I enabled previously were MEMP_DEBUG, TCP_DEBUG and a few others. |
I wonder how often the That function is where all the network processing is done (as opposed to from ISRs). |
I believe I know what is happening now. I previously found out that the PCBs were not freeing up, today I checked the state of them and they are all stuck in fin_wait_1. After doing some searching, this is known behavior. What is happening is when I unplug the ethernet cable the Teensy will send out FIN and wait in state fin_wait_1 until it receives an ACK, but it never receives it since the link was disconnected. The only other thing that will free them up is if they hit max retransmissions which takes a while. Here are posts I found explaining what I believe to be the problem. http://savannah.nongnu.org/bugs/?func=detailitem&item_id=44092 I'm still trying to figure out what to do to circumvent this. It seems your MAC handles this differently, similarly if I plug into a router instead of direct connection/network switch. |
Definitely a tricky problem. Enabling the keepalive option seems busy, but are there good TCP_MAXRTX values that work for you? (Adding this here for future readers of the thread; I'm imagining you're already exploring these options.) I certainly wonder how the Mac and router handle these differently. |
@SpenceV1 Would it be useful to you if I added a way to call I might couple this with inactive polling callbacks (see |
Yes, I was exploring my options the other day. Keepalive, TCP_MAXRTX, and rebooting seem to be the simple options. I don't think keepalive is sufficient for me, it didn't seem to speed up the time at which the pcbs were freed. Setting TCP_MAXRTX lower definitely helps, I think this is the best solution without going against the TCP protocol. My application has pretty quick timings so I don't think I will ever want a packet being resent 1+ minute later anyways. |
Sorry, I was probably updating my above comment after you responded. What do you think about my You could also possibly call “Abort” instead of “Close” in the link-off detection. |
Right, calling close() is the start of the issue. Close tries to send out FIN and never seems to receive an ACK so the pcb is stuck in fin_wait_1. Although I'm not sure why it would never get anything back once the cable is re-plugged and there are still retries. It could help to have a function to call tcp_abort(), what happens naturally if I were to wait for the max retries is tcp_slowtmr hits this line which seems to clear out the unacked packets and free up the pcb. Line 1232 in e01aefa
I'm not really sure what would be best, it sounds like tcp_poll might be appropriate but anything we do might be breaking the flow of the TCP protocol. If I used tcp_abort() I would probably use it when the link gets disconnected since I know this is a problem. |
Since this issue probably wont come up often and most of the time would be solved by waiting, it is probably sufficient for me to lower TCP_MAXRTX. If you provide the ability for me to call tcp_abort() I will probably use it to be sure, but my ideal flow would be to only abort the oldest connection when a new one comes in and we don't have the space for another pcb. Similar to how tcp_alloc() in tcp.c handles connections states such as TIME-WAIT and LAST-ACK. |
Maybe the best option is to add a section to the README describing how to address this with either keepalive or by lowering TCP_MAXRTX. I might save “Abort” for another day unless it’s really needed. What do you think of this plan? |
I'm going back and forth on this one. If someone connects 5 clients to one Teensy and unplugs, then re-plugs the ethernet cable one time, they will not be able to reconnect all of the clients until the original connections timeout. It would be great to have someone verify that this is a Windows specific issue. I fired up Wireshark and I noticed FIN flags (some retransmissions) but I didn't see any ACKs going back to the Teensy. Windows may be killing the connection as soon as it no longer sees a cable connected, I tried looking this up and this is the only thing I could find. https://stackoverflow.com/a/438212 This would make sense why adding a router between the Teensy and PC would potentially solve this issue since the Win PC would still see a cable connected on it's end. If windows is aborting connections on ethernet unplug, I feel it would be appropriate to do the same. I will try and see if a network switch behaves the same or not as this is how the actual application will run, although it may still depend on which cable you disconnect. |
I tested this using my actual setup of one Windows 10 PC connected to multiple Teensy 4.1 with an unmanaged switch. I am able to disconnect and reconnect the ethernet cable going between a Teensy and the switch many times with no problem. I believe the Windows TCP stack is most likely the issue as described previously. |
Thanks for diagnosing. Good to know of this issue. I’ll have an |
@SpenceV1 I've added an ## On connections that hang around after cable disconnect
Ref: [EthernetServer accept no longer connects clients after unplugging/plugging ethernet cable ~7 times](https://github.com/ssilverman/QNEthernet/issues/15)
TCP uses various mechanisms to maintain connections, even when the physical
connection is unreliable. This includes such things as timeouts, retries, and
exponential backoff.
It turns out that some systems drop and forget a connection when the physical
link is disconnected. This means that the other side may still be waiting to
continue the connection.
The above link contains a discussion where a user of this library couldn't
accept any new connections until all the current connections timed out after
about a half hour. What happened was this: connections were being made, the
Ethernet cable was disconnected and reconnected, and then more connections were
being made. The Teensy side still maintained connection state for all the
connections, choosing to do what TCP does: make a best effort to maintain those
connections. Once all the available sockets had been exhausted, no more
connections could be accepted.
Those connections couldn't be cleared and sockets made available until all the
TCP retries had elapsed. The main problem was that the other side simply dropped
the connections when it detected a link disconnect. If the other system had
maintained connection state, the connections would have continued as normal when
the Ethernet cable was reconnected. That's why tests on my system couldn't
reproduce the issue. The IP stack on the Mac maintained state across cable
disconnects/reconnects. The issue reporter was using Windows, and the IP stack
there apparently drops a connection if the link disconnects. This left the
Teensy side waiting for replies and retrying, and the Windows side no longer
sending traffic.
To mitigate this problem, there are a few possible solutions, including:
1. Reduce the number of retransmission attempts by changing the `TCP_MAXRTX`
setting in `lwipopts.h`, or
2. Abort connections upon link disconnect.
To accomplish #2, there is an `EthernetClient::abort()` function that simply
drops a TCP connection without going though the normal TCP close process. This
could be called on connections when the link has been disconnected. (See
`Ethernet.onLinkState(cb)`.)
Fun links:
* [Removing Exponential Backoff from TCP - acm sigcomm](http://www.sigcomm.org/node/2736)
* [Exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff) |
Here's my latest revision: ## On connections that hang around after cable disconnect
Ref: [EthernetServer accept no longer connects clients after unplugging/plugging ethernet cable ~7 times](https://github.com/ssilverman/QNEthernet/issues/15)
TCP tries its best to maintain reliable communication between two endpoints,
even when the physical link is unreliable. It uses techniques such as timeouts,
retries, and exponential backoff. For example, if a cable is disconnected and
then reconnected, there may be some packet loss during the disconnect time, so
TCP will try to resend any lost packets by retrying at successively larger
intervals.
The TCP close process uses some two-way communication to properly shut down a
connection, and therefore is also subject to physical link reliability. If the
physical link is interrupted or the other side doesn't participate in the close
process then the connection may appear to become "stuck", even when told to
close. The TCP stack won't consider the connection closed until all timeouts and
retries have elapsed.
It turns out that some systems drop and forget a connection when the physical
link is disconnected. This means that the other side may still be waiting to
continue or close the connection, timing out and retrying until all attempts
have failed. This can be as long as a half hour, or maybe more, depending on how
the stack is configured.
The above link contains a discussion where a user of this library couldn't
accept any new connections, even when all the connections had been closed, until
all the existing connections timed out after about a half hour. What happened
was this: connections were being made, the Ethernet cable was disconnected and
reconnected, and then more connections were made. When the cable was
disconnected, all connections were closed using the `close()` function. The
Teensy side still maintained connection state for all the connections, choosing
to do what TCP does: make a best effort to maintain or properly close those
connections. Once all the available sockets had been exhausted, no more
connections could be accepted.
Those connections couldn't be cleared and sockets made available until all the
TCP retries had elapsed. The main problem was that the other side simply dropped
the connections when it detected a link disconnect. If the other system had
maintained those connections, it would have continued the close processes as
normal when the Ethernet cable was reconnected. That's why tests on my system
couldn't reproduce the issue: the IP stack on the Mac maintained state across
cable disconnects/reconnects. The issue reporter was using Windows, and the IP
stack there apparently drops a connection if the link disconnects. This left the
Teensy side waiting for replies and retrying, and the Windows side no longer
sending traffic.
To mitigate this problem, there are a few possible solutions, including:
1. Reduce the number of retransmission attempts by changing the `TCP_MAXRTX`
setting in `lwipopts.h`, or
2. Abort connections upon link disconnect.
To accomplish #2, there's an `EthernetClient::abort()` function that simply
drops a TCP connection without going though the normal TCP close process. This
could be called on connections when the link has been disconnected. (See also
`Ethernet.onLinkState(cb)` or `Ethernet.linkState()`.)
Fun links:
* [Removing Exponential Backoff from TCP - acm sigcomm](http://www.sigcomm.org/node/2736)
* [Exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff) |
I've pushed some new changes, including |
Thanks for the addition to the library. Your explanation is good, I hope it helps others who may run into this issue. Since I am working strictly with Windows 10 PCs and I assume they will all behave in a similar way, I added an abort on link disconnect to be safe. I appreciate your help with this issue. |
EthernetServer accept function will no longer return new clients after doing the following steps.
This would affect any long running application which may occasionally be unplugged from the network. I have poked around in some of the TCP files with a debugger and I see that the teensy is receiving the data I send every second but the library does not seem to accept the connection.
The text was updated successfully, but these errors were encountered: