Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

802.3ad bonding mode issue #49

Open
galuha opened this issue Apr 1, 2016 · 13 comments
Open

802.3ad bonding mode issue #49

galuha opened this issue Apr 1, 2016 · 13 comments

Comments

@galuha
Copy link

galuha commented Apr 1, 2016

Greetings.
CentOS 6.7. ConnectX-3 EN. OFED 3.2-2.0.0.0
I am running sockperf tcp/ip ping-pong test.
Trying to use vma with vlan over 802.3ad bonded interfaces.
It works when server and client are connected directly with two cables (both are ConnectX-3 Pro)
It fails when server (vlan over bond)<-> router 802.3ad configured<-> internet <-> client. This works on kernel stack, hence net config is fine.

sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: No messages were received from the server. Is the server down?

I have written the test ping-pong app much similar to sockperf. From its log it is clear that tcp connection is being established, but no message can pass to server.

here are the logs with VMA_TRACELEVEL=4

vmalog_failed_802.3ad_vlan_ping-pong.txt
vmalog_failed_802.3ad_vlan_sockperf.txt

Is this a bug or i'm doing something wrong?

Network is configured in /etc/sysconfig/network-scripts. Not manually

Turning off the 802.3ad (keep vlan over eth0) on the server makes it work

@galuha galuha changed the title 802.3ad bonding mode doesn't work on production server 802.3ad bonding mode issue Apr 1, 2016
@rosenbaumalex
Copy link
Contributor

we'll have to take a deeper look to check this and update you.
thanks for reporting.

@MohammadQurt
Copy link
Contributor

Hi,
Regarding the issue you found, we still didn’t succeed to reproduce the issue.
According to the description and logs you attached we did the following to reproduce the issue (Please update us if we missed anything):

· Server-Client Configurations:

  1.   Used OFED 3.2-2.0.0.0 with ConnectX3 on both server and client.
    
  2.   Configured the server with vlan over 802.3ad bonded interface.
    
  3.   Set the server and client on a different networks.
    
  4.   Configured a server to behave as a router to route packets between server and client.
    

· 802.3ad Bonding configurations (Tried Immediate and Permeant configurations with the following parameters combinations):

  1.   mode=4 miimon=100
    
  2.   mode=4 miimon=100 fail_over_mac=0
    
  3.   mode=4 miimon=100 fail_over_mac=1
    

· Server-Client Commands:

  1.   Server: Sockperf TCP, 1 socket, non-blocking.
    
  2.   Client: Sockperf TCP, Ping-Pong, 1 socket (non-blocking or blocking). 
    

Also, please can you update us with the following:

  1.   Seems both logs belong to a server, so can you attach VMA_TRACELEVEL=4 logs for client side.
    
  2.   Can you please run sysinfo (https://mellanox.my.salesforce.com/sfc/p/#500000007heg/a/50000000Xab4/VneS.zpLith9.GWZ.XthrStGzhuRBH9SZS_DmBENbfI) for every VMA machines and send us the output.
    
  3.   You said: **It fails when server (vlan over bond)<-> router 802.3ad configured<-> internet <-> client. This works on kernel stack, hence net config is fine.**
    

What are the commands you used for router 802.3 configurations.
4. Is the issue reproducible every time?
Thanks in advance

@galuha
Copy link
Author

galuha commented May 5, 2016

Hello, here is the sysinfo file (remove .zip from behind):
sysinfo-snapshot-v3.1.7-ofed-20160504-1945.tgz.zip

I will be able to get client side log later.

"· Server-Client Configurations:

  1. Used OFED 3.2-2.0.0.0 with ConnectX3 on both server and client.
  2. Configured the server with vlan over 802.3ad bonded interface.
  3. Set the server and client on a different networks.
  4. Configured a server to behave as a router to route packets between server and client." - it's not correct. I have an actual router between server and the client, configured LACP 802.3ad

"What are the commands you used for router 802.3 configurations":
network-scripts.zip
route add default gw 172.25.24.1 device bond0.380

@MohammadQurt
Copy link
Contributor

Hi galuha,

Thanks for the update.

Can you please retry with the latest vma version here on githup: https://github.com/Mellanox/libvma

Thanks

@mellanoxer
Copy link

I have the same issue. VMA7.0.14 does not work correctly for TCP connections if vlan over bonding is used. Without libvma TCP connection is succesful.

@mellanoxer
Copy link

mellanoxer commented May 26, 2016

I've checked VMA 8.0.3. TCP connect() does not work if using vlan over bond.
Created socket is marked as offloaded. But program (for instance "telnet" or my own program) cant go out of the connect() and frezes. All routes is correct. Without VMA all is ok, connected.. e.t.c.

@mellanoxer
Copy link

mellanoxer commented Jun 1, 2016

Is there any hope to solve this issue?
What kind of additional tests I should to perform for helping libvma developers?

@galuha
Copy link
Author

galuha commented Jun 1, 2016

I guess there is no hope unless devs can reproduce the problem.
MohammadQurt, I am offering the team viewer session to the server with the issue. Please let me know if you are willing to see it.

@mellanoxer
Copy link

mellanoxer commented Jun 1, 2016

@rosenbaumalex rosenbaumalex, to get more details may be you will add some additional log-lines in sources according to logs I previosly posted to you. I will run again with your patch and send you back more detailed logs.

According to my logs you can see that interfaces is offloadable, socket is offloadable, SYN was sended, and as I previosly sad - there was SYN-ACK in reply but seeing with tcpdump, and not inside libvma.
That is we need more detailed logs-profiling in place of draining queues and receiving packets.

@rosenbaumalex
Copy link
Contributor

We're still having issues with reproducing this.
We're looking at the update log we received from your setup.
Still no root cause understanding.

PS: best is if you open a support@mellanox.com ticket to get full tracking of this incident

@OphirMunk
Copy link
Contributor

@galuha We are not able to reproduce your case. We tried it by several engineers and a Field Application Engineer.
Is your team viewer session still relevant? Please let know

@galuha
Copy link
Author

galuha commented Jul 18, 2016

Yes, it is. Skype me for more details. I just have sent my skype id to you via email

@OphirMunk
Copy link
Contributor

@galuha FAE will return to you within 1-2 weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants