-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/verbs: Allow RDMACM to connect using GIDs #5605
Conversation
We need to continue to use struct sockaddr_ib. That's a commonly defined structure that contains the GID in sib_addr (struct ib_addr), plus other needed fields, like the pkey and scope_id. A GID by IBTA spec definition is an ipv6 address (except that in reality it's not), but it's safe to use the inet functions to process it. I was suggesting we could construct the sockaddr_ib ourselves without calling into the librdmacm functions. As it is, we're letting the rdmacm pick the other fields in the sockaddr_ib without help. A GID could be associated with multiple pkeys, and despite G in GID meaning 'global', two devices could be assigned the same GID if they are on separate subnets. |
@shefty, your comment raises several questions/concerns to me. The
Regarding your statement below:
Does it mean that the GID is not sufficient for connection establishment and it needs to be associated with a pkey? |
To establish a connection, we need all of the fields in sockaddr_ib, except flowinfo. A pkey is required. The scope_id is used to select a device when two ports have the same GID. The latter can occur if the ports are on separate subnets, and the SMs assign the same GID. Honestly, I think that will be a very rare occurrence, but it's just an index. The SID is the equivalent to the tcp port number. That is something the rdmacm knows how to fill in. I guess we could copy that formatting code. It just writes a 16-bit port number into a 64-bit value, with the other bits fixed based on the port space. There are requirements or changes on the CM private data when native IB addressing is used, but I don't remember what they are. |
Guys, I have updated the patch but it doesn't include all the changes you requested so far. |
@shefty @rajachan, I have resurrected the patch with a new version that includes all your review comments. The new format for the
For example:
Please let me know what you think of this new version. And thanks for your time and your help to get this feature approved in Libfabric! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please break the first patch up into multiple patches. E.g. changes to formatting the sockaddr_ib string, converting the string to sib, changes to selecting the port space, getting the gids during device init, etc.
@shefty Thanks for the review! I have addressed all of your comments and split up the changes into multiple patches as requested. |
0128aee
to
81f5bee
Compare
bf2b439
to
5e5402d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - splitting up the patches helps a lot with the review.
src/common.c
Outdated
@@ -318,7 +320,20 @@ const char *ofi_straddr(char *buf, size_t *len, | |||
str, *((uint16_t *)addr + 8), *((uint32_t *)addr + 5)); | |||
break; | |||
case FI_SOCKADDR_IB: | |||
size = snprintf(buf, *len, "fi_sockaddr_ib://%p", addr); | |||
memset(str, 0, sizeof(str)); | |||
if (!inet_ntop(AF_INET6, ((uint64_t *)addr + 1), str, INET6_ADDRSTRLEN)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to cast address to some defined structure, rather than reading a bunch of byte offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a structure ofi_sockaddr_ib
similar to sockaddr_ib
, which avoids byte offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR but FI_ADDR_IB_UD
also uses a bunch of byte offsets, which should be converted into a known structure as well:
case FI_ADDR_IB_UD:
memset(str, 0, sizeof(str));
if (!inet_ntop(AF_INET6, addr, str, INET6_ADDRSTRLEN))
return NULL;
size = snprintf(buf, *len, "fi_addr_ib_ud://"
"%s" /* GID */ ":%" PRIx32 /* QPN */
"/%" PRIx16 /* LID */ "/%" PRIx16 /* P_Key */
"/%" PRIx8 /* SL */,
str, *((uint32_t *)addr + 4),
*((uint16_t *)addr + 10),
*((uint16_t *)addr + 11),
*((uint8_t *)addr + 26));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that shouldn't be defined like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll fix that code in a separate patch.
f06c9e1
to
0e08392
Compare
@shefty Thanks for the review! I have updated the patch series with most of the changes you requested. There are still some open questions/points that need to be addressed (please see my comments above). |
4dab702
to
62d6a66
Compare
This patch defines 2 new formats for fi_sockaddr_if addresses: fi_sockaddr_ib://[<gid>]:<pkey>:<port_space>:<scope_id> and: fi_sockaddr_ib://[<gid>]:<pkey>:<port_space>:<scope_id>:<port> Change-Id: If7900b71e01adbed1510f35fbdd298800ca75758 Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
…rs ... ... in the system, independent of whether ipoib is enabled on that pair or not. That change allows fi_getinfo() to retrieve IB interfaces in the case ipoib is not available on the system. Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
@shefty I have updated the patch series with all your review comments. I also fixed a (bad) issue where OFI_RDMA_PS_IB was mistakenly set to the wrong value. |
This is the final patch of the series that adds the support of GID-base connection establishment. The Verbs provider now can directly connect to the network adapters using the GID. In other words, the patch allows to use Libfabric even if there is no IP address set for the Infiniband interfaces. There are significant issues of issues IP addresses for connection establishment: - It requires to set up/maintain IP addresses for every IB interfaces. - In the context of multirail (multiple local interfaces that belong to the same network subnet), it requires specific IP routes to prevent an interface to reply for another one. Connection establishment would fail otherwise. The GID can be accessed by looking at the field src_addr returned by "fi_info -p verbs -v". Example of output: src_addr: fi_sockaddr_ib://[fe80::248a:703:1c:dc0c]:ffff:13f:0 The patch also modifies fabtest so anybody can start testing this new feature. A new option -F allows to specify the address format that is use for the source/destination addresses. After figuring out the GID of interface that will be used for the server, one can run the following commands with fabtest: Server: fi_msg_bw -s [fe80::248a:703:1c:dc0c]:ffff:13f:0 -e msg \ -p verbs -F fi_sockaddr_ib Client: fi_msg_bw -e msg -p verbs \ -F fi_sockaddr_ib [fe80::248a:703:1c:dc0c]:ffff:13f:0 Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
With AF_INET, user space should fill out the RDMA CM header and pass that to the kernel. References: - RDMA CM header format: https://github.com/linux-rdma/rdma-core/blob/master/\ librdmacm/cma.h#L105 - https://www.spinics.net/lists/linux-rdma/msg22381.html - IBTA Architecture Specification Vol 1. Annex A11: RDMA IP CM Service. Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
Remove verbs specific functions to manipulate sockaddr addresses and use the ofi functions provided by common code. Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
@shefty Would you please let me know the failure reported by the Intel CI pipeline? I ran fabtest locally with the verbs and it didn't report any error. |
Intel CI failure is unrelated
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a couple minor changes needed, which could even be added as separate patches.
prov/verbs/src/verbs_cm.c
Outdated
{ | ||
struct vrb_rdma_cm_hdr *rdma_cm_hdr = priv_data; | ||
|
||
rdma_cm_hdr->ip_version = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 6 << 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new patch to fix the port and the ip_version.
prov/verbs/src/verbs_cm.c
Outdated
struct vrb_rdma_cm_hdr *rdma_cm_hdr = priv_data; | ||
|
||
rdma_cm_hdr->ip_version = 0; | ||
rdma_cm_hdr->port = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the source port. Lower 16-bits of src_addr->sib_sid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I created a new patch to fix the port and the ip_version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the commit message? I'm not sure 'based on Sean's review' will be particularly useful when reading the git log. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the commit message. Sorry for the rush.
The patch also simplifies a 'if' statement. Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
bot:aws:retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI failure was unrelated, but restarted it anyway. Waiting for CI to finish, but this looks good to merge.
That's good news! Thank you! |
@shefty - I'm assuming with these changes the intent is not supporting GIDs using the IB Verbs XRC transport at this time? |
This only updates the rdma_cm path. I think Sylvain, who submitted the patches, is only concerned with RC QPs and MSG EPs. |
Thanks, sounds good. I had taken a look at the changes and just wanted to make sure he knew there would be additional work required to extend this to XRC if that transport was desired. Since XRC is intended to be used with RxM, it makes sense to me to not worry about XRC support at this time. |
@swelch. I'm sorry for my late reply - I didn't notice there was a discussion on-going in the PR. I am not interested in XRC for the moment, but that might change in the future :) |
Hi,
I recently wrote an email to the libfabric-users mailing list to ask if there was a way to run the verbs provider without RDMACM. I didn't get any response so far.
I initially supposed that RDMACM required to have an IP address set for every IB interfaces in order to work. I was wrong, and RDMACM can actually deal with GIDs directly for connection establishment.
The patch in this PR allows the Verbs provider to directly connect to the network adapters using the GID. In other words, the patch allows to use Libfabric even if there is no IP address set for the Infiniband interfaces.
There are significant issues of issues IP addresses for connection establishment:
The GID can be accessed by looking at the field src_addr returned by
fi_info -p verbs -v
.Example of output:
src_addr: fi_sockaddr_ib://fe80:0000:0000:0000:248a:0703:003f:1f6a
The patch also modifies fabtest so anybody can start testing this new feature. A new option -F allows to specify the address format that is use for the source/destination addresses.
After figuring out the GID of interface that will be used for the server, one can run the following commands with fabtest:
Server:
fi_msg_bw -s fe80:0000:0000:0000:248a:0703:003f:1f6a -e msg -p verbs -F FI_SOCKADDR_IB
Client:
fi_msg_bw -e msg -p verbs -F FI_SOCKADDR_IB fe80:0000:0000:0000:248a:0703:003f:1f6a
Signed-off-by: Sylvain Didelot sdidelot@ddn.com