layout | title | date | categories | tags | excerpt |
---|---|---|---|---|---|
post |
Understanding the RoCE network protocol |
2017-11-09 07:20:30 -0800 |
Network |
RDMA RoCE |
Understanding the RoCE network protocol |
RoCE
是RDMA over Converged Ethernet
的简称,基于它可以在以太网上实现RDMA
.另外一种方式是RDMA over an InfiniBand
.所以RoCE
(严格来说是RoCEv1
)是一个与InfiniBand
相对应的链路层协议。
There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed.
对于RoCE互联网络,硬件方面需要支持IEEE DCB
的L2以太网交换机,计算节点需要支持RoCE的网卡:
On the hardware side, basically you need an L2 Ethernet switch with IEEE DCB (Data Center Bridging, aka Converged Enhanced Ethernet) with support for priority flow control.
On the compute or storage server end, you need an RoCE-capable network adapter.
对应的数据帧格式如下:
对应的协议规范参考InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE。
示例:
由于RoCEv1
的数据帧不带IP头部,所以只能在L2子网内通信。所以RoCEv2
扩展了RoCEv1
,将GRH(Global Routing Header)
换成UDP header + IP header
:
RoCEv2 is a straightforward extension of the RoCE protocol that involves a simple modification of the RoCE packet format.
Instead of the GRH, RoCEv2 packets carry an IP header which allows traversal of IP L3 Routers and a UDP header that serves as a stateless encapsulation layer for the RDMA Transport Protocol Packets over IP.
数据帧的格式如下:
示例:
值得一提的是内核在4.9通过软件的方式的实现了RoCEv2,即Soft-RoCE
.