Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retest OpenMPI, jucx, infinileap and disni when SoftiWarp or SoftRoCe module is loaded #47

Closed
subes opened this issue Feb 25, 2023 · 6 comments

Comments

@subes
Copy link
Contributor

subes commented Feb 25, 2023

https://github.com/zrlio/softiwarp

software based infiniband (similar to TCP/SCTP)

@subes
Copy link
Contributor Author

subes commented Feb 25, 2023

also try again to run RMA put/get tests in jucx

@subes subes changed the title retest OpenMPI, jucx and disni when SoftiWarp module is loaded retest OpenMPI, jucx, infinileap and disni when SoftiWarp module is loaded Mar 2, 2023
@subes
Copy link
Contributor Author

subes commented Mar 4, 2023

https://www.reflectionsofthevoid.com/2020/07/software-rdma-revisited-setting-up.html

Reflections Of The Void Software RDMA revisited setting up SoftiWARP on Ubuntu 20.04.pdf

Requires the use of an actually connected network interface:

modprobe siw
ifconfig #to find a connected ethernet or wifi module, "lo" did not work
sudo rdma link add siw0 type siw netdev wlp112s0
rdma link #should list the new device
ifconfig #to find the ip address of wlp112s0
rping -s -a 192.168.0.20 -v #server
rping -c -a 192.168.0.20 -v #client
sudo rdma link delete siw0 #call this during a test to verify that the interface is used, test should crash

@subes
Copy link
Contributor Author

subes commented Mar 4, 2023

ucx does not support iWarp as it seems: openucx/ucx#2507

They have some commits for it but say it is untested since 2017?
At least I can not get it to work with SoftiWarp.

Also seems as if the code does not support iWarp because it checks for only Infiniband?
image
https://github.com/openucx/ucx/blob/eadd74f9fe5b0edc081ba1ce589fb850d6809934/src/uct/ib/base/ib_md.c

@subes
Copy link
Contributor Author

subes commented Mar 4, 2023

Alternative is rdma_rxe (similar to UDP, though seems to keep packet order?): https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce (though outdated linux-rdma/rdma-core@0d2ff0e)

modprobe rdma_rxe
ifconfig #to find a connected ethernet or wifi module, "lo" did not work
sudo rdma link add rxe0 type rxe netdev wlp112s0
rdma link #should list the new device
ifconfig #to find the ip address of wlp112s0
rping -s -a 192.168.0.20 -v #server
rping -c -a 192.168.0.20 -v #client
sudo rdma link delete rxe0 #call this during a test to verify that the interface is used, test should crash

Hadronio works with Soft-RoCe, our Jucx integration requires that the listener does not get closed (which is now the default).

This here suggests Soft-RoCe can improve performance of normal networks cards as well: https://www.reflectionsofthevoid.com/2011/08/soft-roce-alternative-to-soft-iwarp.html
image
https://www.lanl.gov/projects/national-security-education-center/information-science-technology/_assets/docs/2010-si-docs/Team_CYAN_Implementation_and_Comparison_of_RDMA_Over_Ethernet_Presentation.pdf

subes added a commit that referenced this issue Mar 4, 2023
where stream is used during establish connection and tag is used
afterwards
@subes subes changed the title retest OpenMPI, jucx, infinileap and disni when SoftiWarp module is loaded retest OpenMPI, jucx, infinileap and disni when SoftiWarp or SoftRoCe module is loaded Mar 4, 2023
@subes
Copy link
Contributor Author

subes commented Mar 4, 2023

Soft-RoCe Checklist:

  • BlockingHadroNIO (does not automatically pick up RoCe)
  • HadroNIO
  • NettyHadroNIO (hangs without Bidi after ~500k records)
  • JUCX (hangs with PeerErrorHandling enabled after ~5 million records)
  • OpenMPI (does not automatically pick up RoCe, don't know the required params)
  • Infinileap (impossible due to frequent JVM crashes)
  • Disni (seems to be the fastest library that works both on iWarp and RoCE)
  • Neutrino (can't get it to work, it says bad argumen when exchanging connection details internally)

RoCe hangs might be due to unreliability of the protocol: zrlio/disni#37 (comment)

SoftiWarp Checklist:

  • BlockingHadroNIO (ucx does not support iWarp)
  • HadroNIO (ucx does not support iWarp)
  • NettyHadroNIO (ucx does not support iWarp)
  • JUCX (ucx does not support iWarp)
  • OpenMPI (does not automatically pick up iWarp, don't know the required params)
  • Infinileap (impossible due to frequent JVM crashes)
  • Disni (seems to be the fastest library that works both on iWarp and RoCE)
  • Neutrino (can't get it to work, it says bad argumen when exchanging connection details internally)

@subes
Copy link
Contributor Author

subes commented Apr 9, 2023

finished

@subes subes closed this as completed Apr 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant