-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
weird behaviour with --bidir on bsd/pfsense #1836
Comments
here is some debug output. mind the hang after the send statements.
|
I have no idea what the problem may be, but following are some questions/tests that may to understand the issue:
|
hi, thanks for the feedback regarding
|
this is debug output for the server side ( iperf -s -d )
|
and - interestingly: not only doess "truss" on the sender side (iperf3 -c . ...) heal the problem, but it also heals the problem when i run "truss iperf3 -s" on the other end. so i would say it's a race condition or some timing issue, as truss should have influence on the applications execution time/path. this is what AI tells about this when asking for the effect of "truss" when healing a problem: The issue you're experiencing, where your program hangs when sending network packets but operates correctly when run under "truss" (the BSD equivalent of strace), suggests that "truss" may be influencing the program's execution in a way that prevents the hang. This behavior could be due to several factors: Blocking System Calls: Your program might be encountering a blocking system call, such as waiting for network resources or file operations. "Truss" could be affecting the timing or order of these calls, thereby preventing the hang. Synchronization Issues: If your program uses multiple threads or processes, there might be synchronization problems causing the hang. "Truss" might be altering the execution flow, which could prevent these issues from manifesting. Signal Handling: Your program might rely on signals to manage certain events. "Truss" could be affecting signal delivery or handling, which might prevent the hang. To diagnose and resolve the issue, consider the following steps: Examine System Calls: Use "truss" to monitor the system calls your program makes and identify where it hangs. This can help pinpoint the exact location of the issue. Analyze Synchronization: Review your code for potential synchronization problems, especially if you're using threads or processes. Look for race conditions or deadlocks that could cause the hang. Inspect Signal Handling: Ensure that your program handles signals correctly and that there are no issues with signal delivery or handling that could lead to the hang. Test Without "Truss": Run your program without "truss" in a controlled environment to see if the issue persists. This can help determine if "truss" is masking the problem or if the issue is inherent in the program itself. By systematically investigating these areas, you should be able to identify the root cause of the hang and implement an appropriate solution. |
setting "-w 512k" also reliably cures the problem iperf3 -c 192.168.1.3 --bidir -w 512k |
iperf is telling "Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0" for my curiousity - why is it then printing "sent X bytes of 131072, pending 65372, total 65700" 10 times in a row ?
|
Hi, interesting information, although I can't tell it helps understanding the root of the problem... Can you try running the 10 seconds test with 1 second report interval, i.e. adding Some notes:
What iperf3 command you used for the tcpdupm? I am asking since it seems that about 140KB/160KB were transferred before of the 10 seconds "hult". Also, I assume that the "wakeup" after the 10 seconds is the interrupt for the report (or the end of the test). This is why I asked above to to using Also, I see that the sending is halted after packets with TCP "Push" flag (the "[P.]") are sent. I am not sure what that may mean.
It is trying to send the first 128KB message, or more accurately to push the message into the TCP buffers. The first time 64KB are sent (the "sent 65700 bytes"). Then when it tries again and again to send the rest of the message but nothing is sent (the "sent 0 bytes"). What causing that is the main issue and this is the reason for the "slow" throughput. |
that does not make a difference. iperf does not print anything before reaching 10s test end.
simply -c ... --bidir with no further options.
i will try investigating
when this is about "retrying" - i assume you meant tcp retransmit/retry , why does iperf -s -d on the server IMMEDIATELY print all these lines ? mostly , i only get those ten "sent X bytes of 131072" printed where the first X is 65700 bytes - but sometimes i get the following output IMMEDIATELY on start of the send. no delay between any of the lines printed. i have attached such output below. and i just found out, that in about 1 of 20 tries, the hang does not happen but i get normal performance reported, i.e. nothing gets stuck. so this looks like a race condition / timing issue. which also explains why it's getting healed by running under control of "truss".
|
from the debug output, we would assume that 65700 bytes have been send and then things getting stuck Congestion algorithm is cubic when i look at the tcpdump and count each occurence of full sized 1448 byte packets, i count: cat tcpdump-before-hang.log| grep "192.168.1.2.34071 > 192.168.1.3.5201" |grep "length 1448"| wc -l = 97x1448 = 140456 bytes sent from client to iperf server. cat tcpdump-before-hang.log| grep "192.168.1.3.5201 > 192.168.1.2.56471"|grep "length 1448" | wc -l = 85x1448 = 123080 bytes received from iperf server to the client. despite no received bytes being logged, i also wonder why it tells it did send 65700 bytes when it sent 140456. even if we substract tcp overhead, that's at least seems to be more data , then what has been logged on the sending side. |
Actually these are not "retries". iperf3 is trying to send as much as it can (unless bandwidth is limited by From the new output you sent it seems that the problem is in the system for some reason, as the TCP buffers are "slowly" getting free so additional data can be sent. E.g.:
This is also what confused me in the previous debug output you sent. Can it be that this debug is from the case when additional data was sent as in the above output snapshot? You may try looking at the times of the "length 1448" lines to see if there are times gaps at the "sent 0" times. One more point, can you try building from version 3.18 or event from the master branch of iperf3 and see if the problem still happens? Since version 3.16 iperf3 runs the sending part in a separate thread and it would be interesting to know if the problem also happens in this case. |
as i cannot build software on our production system, i managed to install iperf v3.18 on pfsense by manually downloading the package and installing via https://forums.freebsd.org/threads/how-to-download-packages-for-offline-installation-of-a-newer-version-of-freebsd.92361/ (which is quite tricky) and indeed the architectural change in iperf changed it's behaviour and i cannot reproduce it with 3.18. so, i'm very sorry for the noise to hunt for a bug which does not exist in the recent version anymore. it's unfortunate, that our pfsense using such an outdated version. i will investigate, why it doesn't use more recent version. think we can consider this resolved and not worth to take further investigation. thank you for your help ! |
No problem. I believe that this was a real issue, either in you system or in iperf3, and the multi-thread version just happened to "fix" it, as I don't see any other change in iperf3 that may have fixed such problem (but probably we will never know ...). |
thanks. at least it created a lot of hassle, because i never thought that this would be an iperf problem, i really thought we had a network problem was we had also issues with carp/vrrp between those machines.... on of that "weird" things you come across in admin life... |
i was analyzing a firewall problem for the last days (pfsense) which appeared to be caused by weak hardware.
while analyzing, i intensively used iperf3 to test performance between 2 pfsense boxes and got weird results when using --bidir
i thought that was part of the problem which i thought would be resolved after changing the hardware - but while all our problems are gone now, the weirdness with iperf3 remained
iperf3 box1 -> box2 is fast
iperf3 box1 <- box2 is fast
iperf3 box1 <-> box2 is dead slow
i also tested with sending two concurrent tcp datastreams via ssh in two directions, and i get also fast transfer
only iperf3 with --bidir is slow, as shown here:
ok.
now the weirdness:
and - even more weird:
when i run "iperf3 -c 192.168.1.2 --bidir" via "truss" (bsd strace), it's fast again (i.e. truss iperf3....)
what i also find that using iperf3 with "-l 32k" is always fast, -l 64k get's stuck/slow sometimes and -l 128k get's stuck/is slow always.
i can only reproduce this with iperf3 between two pfsense boxes, it does not happen when doing iperf3 from other linux box to pfsense.
any clue, why --bidir does not work out of the box as expected ?
The text was updated successfully, but these errors were encountered: