Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: Allow RDMACM to connect using GIDs #5605

Merged
merged 13 commits into from
Jul 27, 2020

Conversation

sydidelot
Copy link
Member

Hi,

I recently wrote an email to the libfabric-users mailing list to ask if there was a way to run the verbs provider without RDMACM. I didn't get any response so far.

I initially supposed that RDMACM required to have an IP address set for every IB interfaces in order to work. I was wrong, and RDMACM can actually deal with GIDs directly for connection establishment.

The patch in this PR allows the Verbs provider to directly connect to the network adapters using the GID. In other words, the patch allows to use Libfabric even if there is no IP address set for the Infiniband interfaces.

There are significant issues of issues IP addresses for connection establishment:

  • It requires to set up/maintain IP addresses for every IB interfaces.
  • In the context of multirail (multiple local interfaces that belong to the same network subnet), it requires specific IP routes to prevent an interface to reply for another one. Connection establishment would fail otherwise.

The GID can be accessed by looking at the field src_addr returned by fi_info -p verbs -v.

Example of output:
src_addr: fi_sockaddr_ib://fe80:0000:0000:0000:248a:0703:003f:1f6a

The patch also modifies fabtest so anybody can start testing this new feature. A new option -F allows to specify the address format that is use for the source/destination addresses.

After figuring out the GID of interface that will be used for the server, one can run the following commands with fabtest:

Server:
fi_msg_bw -s fe80:0000:0000:0000:248a:0703:003f:1f6a -e msg -p verbs -F FI_SOCKADDR_IB

Client:
fi_msg_bw -e msg -p verbs -F FI_SOCKADDR_IB fe80:0000:0000:0000:248a:0703:003f:1f6a

Signed-off-by: Sylvain Didelot sdidelot@ddn.com

@sydidelot sydidelot requested review from shefty and arn314 February 5, 2020 14:10
fabtests/common/shared.c Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Show resolved Hide resolved
prov/verbs/src/verbs_info.c Show resolved Hide resolved
src/common.c Outdated Show resolved Hide resolved
src/common.c Outdated Show resolved Hide resolved
fabtests/benchmarks/msg_bw.c Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Show resolved Hide resolved
prov/verbs/src/verbs_info.c Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
src/common.c Outdated Show resolved Hide resolved
fabtests/benchmarks/msg_bw.c Show resolved Hide resolved
@shefty
Copy link
Member

shefty commented Feb 6, 2020

We need to continue to use struct sockaddr_ib. That's a commonly defined structure that contains the GID in sib_addr (struct ib_addr), plus other needed fields, like the pkey and scope_id. A GID by IBTA spec definition is an ipv6 address (except that in reality it's not), but it's safe to use the inet functions to process it.

I was suggesting we could construct the sockaddr_ib ourselves without calling into the librdmacm functions. As it is, we're letting the rdmacm pick the other fields in the sockaddr_ib without help. A GID could be associated with multiple pkeys, and despite G in GID meaning 'global', two devices could be assigned the same GID if they are on separate subnets.

@sydidelot
Copy link
Member Author

@shefty, your comment raises several questions/concerns to me.

The sockaddr_ib strcuture is defined as below:

    unsigned short int  sib_family; /* AF_IB */                
    __be16          sib_pkey;                                  
    __be32          sib_flowinfo;                              
    struct ib_addr      sib_addr;                              
    __be64          sib_sid;                                   
    __be64          sib_sid_mask;                              
    __u64           sib_scope_id;                              
};                                                             

sib_addr can be set using ibv_query_gid() and sib_pkey with ibv_query_pkey(). I'm not exactly sure how to retrieve the other members of the structure without librdmacm.

Regarding your statement below:

A GID could be associated with multiple pkeys, and despite G in GID meaning 'global', two devices could be assigned the same GID if they are on separate subnets.

Does it mean that the GID is not sufficient for connection establishment and it needs to be associated with a pkey?
As for now, the patch assumes that the GID is the only required data for connection establishment. We will need to change this as well.

@shefty
Copy link
Member

shefty commented Feb 7, 2020

To establish a connection, we need all of the fields in sockaddr_ib, except flowinfo. A pkey is required. The scope_id is used to select a device when two ports have the same GID. The latter can occur if the ports are on separate subnets, and the SMs assign the same GID. Honestly, I think that will be a very rare occurrence, but it's just an index. The SID is the equivalent to the tcp port number. That is something the rdmacm knows how to fill in. I guess we could copy that formatting code. It just writes a 16-bit port number into a 64-bit value, with the other bits fixed based on the port space. There are requirements or changes on the CM private data when native IB addressing is used, but I don't remember what they are.

@sydidelot
Copy link
Member Author

Guys, I have updated the patch but it doesn't include all the changes you requested so far.
Sorry, this PR was inactive for a while but I'm still very interested in using GIDs to establish connections.
@shefty would you please have a look at the function vrb_get_rdma_addrinfo() to verify if the construction of the input argument struct rdma_addrinfo looks correct to you? (I took some inspiration from rdmacm). I will address the remaining comments after I'm sure I'm going to the right direction :-). Thank you!

@sydidelot
Copy link
Member Author

@shefty @rajachan, I have resurrected the patch with a new version that includes all your review comments. The new format for the fi_sockaddr_ib is now the following:

fi_sockaddr_ib://[guid]:pkey:port_space:scope_id

For example:

fi_sockaddr_ib://[fe80::248a:703:1c:dc0c]:ffff:13f:0

Please let me know what you think of this new version. And thanks for your time and your help to get this feature approved in Libfabric!

@sydidelot sydidelot requested review from shefty and rajachan May 30, 2020 14:31
Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break the first patch up into multiple patches. E.g. changes to formatting the sockaddr_ib string, converting the string to sib, changes to selecting the port space, getting the gids during device init, etc.

prov/verbs/src/fi_verbs.c Outdated Show resolved Hide resolved
prov/verbs/src/fi_verbs.c Outdated Show resolved Hide resolved
prov/verbs/src/fi_verbs.c Outdated Show resolved Hide resolved
prov/verbs/src/fi_verbs.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
@sydidelot
Copy link
Member Author

sydidelot commented Jun 3, 2020

Please break the first patch up into multiple patches. E.g. changes to formatting the sockaddr_ib string, converting the string to sib, changes to selecting the port space, getting the gids during device init, etc.

@shefty Thanks for the review! I have addressed all of your comments and split up the changes into multiple patches as requested.

@sydidelot sydidelot requested a review from shefty June 3, 2020 16:59
@sydidelot sydidelot force-pushed the rdmacm_gid branch 2 times, most recently from 0128aee to 81f5bee Compare June 8, 2020 17:14
@sydidelot sydidelot force-pushed the rdmacm_gid branch 2 times, most recently from bf2b439 to 5e5402d Compare June 15, 2020 13:06
Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - splitting up the patches helps a lot with the review.

fabtests/common/shared.c Show resolved Hide resolved
src/common.c Outdated
@@ -318,7 +320,20 @@ const char *ofi_straddr(char *buf, size_t *len,
str, *((uint16_t *)addr + 8), *((uint32_t *)addr + 5));
break;
case FI_SOCKADDR_IB:
size = snprintf(buf, *len, "fi_sockaddr_ib://%p", addr);
memset(str, 0, sizeof(str));
if (!inet_ntop(AF_INET6, ((uint64_t *)addr + 1), str, INET6_ADDRSTRLEN))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cast address to some defined structure, rather than reading a bunch of byte offsets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a structure ofi_sockaddr_ib similar to sockaddr_ib, which avoids byte offsets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but FI_ADDR_IB_UD also uses a bunch of byte offsets, which should be converted into a known structure as well:

case FI_ADDR_IB_UD:                                              
    memset(str, 0, sizeof(str));                                 
    if (!inet_ntop(AF_INET6, addr, str, INET6_ADDRSTRLEN))       
        return NULL;                                             
    size = snprintf(buf, *len, "fi_addr_ib_ud://"                
            "%s" /* GID */ ":%" PRIx32 /* QPN */                 
            "/%" PRIx16 /* LID */ "/%" PRIx16 /* P_Key */        
            "/%" PRIx8 /* SL */,                                 
            str, *((uint32_t *)addr + 4),                        
            *((uint16_t *)addr + 10),                            
            *((uint16_t *)addr + 11),                            
            *((uint8_t *)addr + 26));                            

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that shouldn't be defined like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll fix that code in a separate patch.

src/common.c Outdated Show resolved Hide resolved
src/common.c Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
prov/verbs/src/fi_verbs.h Outdated Show resolved Hide resolved
prov/verbs/src/fi_verbs.c Outdated Show resolved Hide resolved
prov/verbs/src/fi_verbs.c Outdated Show resolved Hide resolved
prov/verbs/src/fi_verbs.c Outdated Show resolved Hide resolved
@sydidelot sydidelot force-pushed the rdmacm_gid branch 2 times, most recently from f06c9e1 to 0e08392 Compare June 29, 2020 15:21
@sydidelot
Copy link
Member Author

@shefty Thanks for the review! I have updated the patch series with most of the changes you requested. There are still some open questions/points that need to be addressed (please see my comments above).

@sydidelot sydidelot force-pushed the rdmacm_gid branch 2 times, most recently from 4dab702 to 62d6a66 Compare June 30, 2020 10:09
@sydidelot sydidelot requested a review from shefty June 30, 2020 11:40
This patch defines 2 new formats for fi_sockaddr_if addresses:
fi_sockaddr_ib://[<gid>]:<pkey>:<port_space>:<scope_id>
and:
fi_sockaddr_ib://[<gid>]:<pkey>:<port_space>:<scope_id>:<port>

Change-Id: If7900b71e01adbed1510f35fbdd298800ca75758
Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
…rs ...

... in the system, independent of whether ipoib is enabled on that pair
or not. That change allows fi_getinfo() to retrieve IB interfaces
in the case ipoib is not available on the system.

Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
@sydidelot
Copy link
Member Author

@shefty I have updated the patch series with all your review comments. I also fixed a (bad) issue where OFI_RDMA_PS_IB was mistakenly set to the wrong value.

This is the final patch of the series that adds the support
of GID-base connection establishment.

The Verbs provider now can directly connect to the network
adapters using the GID. In other words, the patch allows
to use Libfabric even if there is no IP address set for the
Infiniband interfaces.

There are significant issues of issues IP addresses for connection
establishment:
- It requires to set up/maintain IP addresses for every IB interfaces.
- In the context of multirail (multiple local interfaces that belong
  to the same network subnet), it requires specific IP routes to
  prevent an interface to reply for another one. Connection
  establishment would fail otherwise.

The GID can be accessed  by looking at the field src_addr returned
by "fi_info -p verbs -v".

Example of output:
src_addr: fi_sockaddr_ib://[fe80::248a:703:1c:dc0c]:ffff:13f:0

The patch also modifies fabtest so anybody can start testing this
new feature. A new option -F allows to specify the address format
that is use for the source/destination addresses.

After figuring out the GID of interface that will be used for the
server, one can run the following commands with fabtest:

Server:
fi_msg_bw -s [fe80::248a:703:1c:dc0c]:ffff:13f:0 -e msg \
-p verbs -F fi_sockaddr_ib

Client:
fi_msg_bw -e msg -p verbs \
-F fi_sockaddr_ib [fe80::248a:703:1c:dc0c]:ffff:13f:0

Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
With AF_INET, user space should fill out the RDMA CM header
and pass that to the kernel.

References:
- RDMA CM header format:
  https://github.com/linux-rdma/rdma-core/blob/master/\
  librdmacm/cma.h#L105
- https://www.spinics.net/lists/linux-rdma/msg22381.html
- IBTA Architecture Specification Vol 1. Annex A11: RDMA
  IP CM Service.

Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
Remove verbs specific functions to manipulate sockaddr
addresses and use the ofi functions provided by common
code.

Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
@sydidelot
Copy link
Member Author

@shefty Would you please let me know the failure reported by the Intel CI pipeline? I ran fabtest locally with the verbs and it didn't report any error.

@shefty
Copy link
Member

shefty commented Jul 23, 2020

Intel CI failure is unrelated

name:   fi_multinode -C rma -n 3 -p "psm2"
07:15:42    result: Fail

Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a couple minor changes needed, which could even be added as separate patches.

prov/verbs/src/fi_verbs.c Show resolved Hide resolved
{
struct vrb_rdma_cm_hdr *rdma_cm_hdr = priv_data;

rdma_cm_hdr->ip_version = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 6 << 4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new patch to fix the port and the ip_version.

struct vrb_rdma_cm_hdr *rdma_cm_hdr = priv_data;

rdma_cm_hdr->ip_version = 0;
rdma_cm_hdr->port = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source port. Lower 16-bits of src_addr->sib_sid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I created a new patch to fix the port and the ip_version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the commit message? I'm not sure 'based on Sean's review' will be particularly useful when reading the git log. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the commit message. Sorry for the rush.

@sydidelot sydidelot requested a review from shefty July 23, 2020 18:39
prov/verbs/src/verbs_cm.c Outdated Show resolved Hide resolved
The patch also simplifies a 'if' statement.

Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
@shefty
Copy link
Member

shefty commented Jul 27, 2020

bot:aws:retest

Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failure was unrelated, but restarted it anyway. Waiting for CI to finish, but this looks good to merge.

@sydidelot
Copy link
Member Author

That's good news! Thank you!

@swelch
Copy link
Contributor

swelch commented Jul 27, 2020

@shefty - I'm assuming with these changes the intent is not supporting GIDs using the IB Verbs XRC transport at this time?

@shefty
Copy link
Member

shefty commented Jul 27, 2020

This only updates the rdma_cm path. I think Sylvain, who submitted the patches, is only concerned with RC QPs and MSG EPs.

@swelch
Copy link
Contributor

swelch commented Jul 27, 2020

This only updates the rdma_cm path. I think Sylvain, who submitted the patches, is only concerned with RC QPs and MSG EPs.

Thanks, sounds good. I had taken a look at the changes and just wanted to make sure he knew there would be additional work required to extend this to XRC if that transport was desired. Since XRC is intended to be used with RxM, it makes sense to me to not worry about XRC support at this time.

@shefty shefty merged commit e4b3367 into ofiwg:master Jul 27, 2020
@sydidelot
Copy link
Member Author

@swelch. I'm sorry for my late reply - I didn't notice there was a discussion on-going in the PR. I am not interested in XRC for the moment, but that might change in the future :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants