Does UCP multi-rail support auto rail fail over? #9809
Replies: 2 comments 2 replies
-
ucx_info on the server side:
ucx_info on the client side:
|
Beta Was this translation helpful? Give feedback.
-
By analysising related code I got some idea, please correct me if wrong: According to : Line 436 in ae47af5 which eventally call: Line 1488 in ae47af5 It seems that means no matter which rail failed in multi-rail mode, all rails will be flushed/disconnected and all pending requests will be purged. Just as this patch #7672 explained, It seems that we can not do multi-rail failover becasue single RNDV request can utilites all rails ?If that is the case, how about bound a request to a single rail(different requests can still uses different rails of course), then rail error handling can be done inside single UCT ep while keeping UCP ep working without interrupt(calling upper level user error handler...)? |
Beta Was this translation helpful? Give feedback.
-
Hi, When I testing multi-rail I can do load balance and the bandwidth doubled, it great! But I found that if one of the two lanes be unplugged, then the UCP EP's error callback will be called.
My env setup:
Two nodes both use two RoCE HCAs as two rail and a third NIC for C/S tcp connection manager setup, and msgs are was sent using this C/S ucp ep pair (but uses RoCE uct resources of course).
Client: ---------------> Server:
mlx5_0/1 (192.168.100.2/24) -------wire1--------> mlx5_0/1 (192.168.100.1/24)
mlx5_1/1 (192.168.200.2/24) -------wire2--------> mlx5_2/1 (192.168.200.1/24)
enp1s0 (192.168.1.133/24) -------wire3--------> ext (192.168.1.199/24)
The following is client side debug log after unpluging wire2:
[1712838247.230173] [promote:19173:a] ib_device.c:468 UCX DIAG IB Async event on mlx5_1: port error on port 1
[1712838259.039340] [promote:19173:0] ib_mlx5_log.c:177 UCX DEBUG Transport retry count exceeded on mlx5_1:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[1712838259.039340] [promote:19173:0] ib_mlx5_log.c:177 UCX DEBUG RC QP 0x223 wqe[10906]: opcode SEND
[1712838259.039408] [promote:19173:0] ucp_worker.c:531 UCX DEBUG worker 0x1362190: error handler called for UCT EP 0x14afe90: Endpoint timeout
[1712838259.039424] [promote:19173:0] ucp_ep.c:1467 UCX DEBUG ep 0x7fbcdaffd000: set_ep_failed status Endpoint timeout on lane[2]=0x14afe90
[1712838259.039446] [promote:19173:0] tcp_sockcm_ep.c:122 UCX DEBUG ep 0x1424860 (fd=27 state=526058) disconnecting from peer: 192.168.1.199:13337
[1712838259.039526] [promote:19173:0] ucp_ep.c:1428 UCX DEBUG ep 0x7fbcdaffd000: discarding lanes
[1712838259.039539] [promote:19173:0] ucp_ep.c:1436 UCX DEBUG ep 0x7fbcdaffd000: discard uct_ep[0]=0x1424860
[1712838259.039550] [promote:19173:0] ucp_ep.c:1436 UCX DEBUG ep 0x7fbcdaffd000: discard uct_ep[1]=0x1353b90
[1712838259.039571] [promote:19173:0] ib_mlx5.c:913 UCX DEBUG device uverbs0: modify QP 0x1353be8 num 0x228 to state 6
[1712838259.040188] [promote:19173:0] mpool.c:282 UCX DEBUG mpool send-ops-mpool: allocated chunk 0x148e9a0 of 16472 bytes with 257 elements
[1712838259.040211] [promote:19173:0] ucp_ep.c:1436 UCX DEBUG ep 0x7fbcdaffd000: discard uct_ep[2]=0x14afe90
[1712838259.040232] [promote:19173:0] ib_mlx5.c:913 UCX DEBUG device uverbs1: modify QP 0x14afee8 num 0x223 to state 6
[1712838259.042469] [promote:19173:0] mpool.c:282 UCX DEBUG mpool send-ops-mpool: allocated chunk 0x149eae0 of 16472 bytes with 257 elements
[1712838259.042500] [promote:19173:0] ucp_ep.c:1436 UCX DEBUG ep 0x7fbcdaffd000: discard uct_ep[3]=0x1448df0
[1712838259.042523] [promote:19173:0] ucp_ep.c:3408 UCX DEBUG ep 0x7fbcdaffd000: calling user error callback 0x40212f with arg (nil) and status Endpoint timeout
error handling callback was invoked with status -80 (Endpoint timeout)
[1712838259.042558] [promote:19173:a] tcp_sockcm_ep.c:360 UCX DEBUG ep 0x1424860 (fd=27 state=528106): remote peer () disconnected/rejected (Endpoint is not connected)
[1712838259.042585] [promote:19173:a] async.c:157 UCX DEBUG removed async handler 0x13dea60 [id=27 ref 2] uct_tcp_sa_data_handler() from hash
[1712838259.042594] [promote:19173:a] async.c:547 UCX DEBUG removing async handler 0x13dea60 [id=27 ref 2] uct_tcp_sa_data_handler()
[1712838259.042629] [promote:19173:a] async.c:172 UCX DEBUG release async handler 0x13dea60 [id=27 ref 0] uct_tcp_sa_data_handler()
[1712838259.042651] [promote:19173:a] ib_device.c:468 UCX DEBUG IB Async event on mlx5_0: SRQ-attached QP 0x228 was flushed
[1712838259.042709] [promote:19173:a] ib_device.c:468 UCX DEBUG IB Async event on mlx5_1: SRQ-attached QP 0x223 was flushed
[1712838259.042772] [promote:19173:0] ib_mlx5_log.c:177 UCX DIAG Transport retry count exceeded on mlx5_1:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[1712838259.042772] [promote:19173:0] ib_mlx5_log.c:177 UCX DIAG RC QP 0x223 wqe[10906]: SEND --e [va 0x7fbccf283f40 len 8256 lkey 0x1bebeb] [rqpn 0x157 dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.200.1 sgid_index=3 traffic_class=0]
[1712838259.042806] [promote:19173:0] async.c:150 UCX DEBUG async handler [id=27] not found in hash table
[1712838259.042849] [promote:19173:0] ucp_ep.c:1344 UCX DEBUG ep 0x7fbcdaffd000: unprogress iface 0x139b420 ud_mlx5/mlx5_0:1
[1712838259.042862] [promote:19173:0] async.c:157 UCX DEBUG removed async handler 0x14e1010 [id=1000017 ref 1] ???() from hash
[1712838259.042866] [promote:19173:0] async.c:547 UCX DEBUG removing async handler 0x14e1010 [id=1000017 ref 1] ???()
[1712838259.042882] [promote:19173:0] async.c:172 UCX DEBUG release async handler 0x14e1010 [id=1000017 ref 0] ???()
[1712838259.042898] [promote:19173:0] ud_ep.c:1786 UCX DEBUG ep 0x1448df0: disconnect
I tried and found that no matter which rail's wire I unplug, CM always kicks in and discards both lanes and eventually calls ucp ep's user error callback. My questions are:1) Is that behavier by design? 2) Does that mean I have to re-create UCX context to filter out the failed HCA resource (which I can not tell) and re-create worker and ep all over again?
I initial thought UCP should do things like NIC bonding which can do fail over even recover. Does UCP multi-rail suport the same function or will it support in the future?
Beta Was this translation helpful? Give feedback.
All reactions