-
Notifications
You must be signed in to change notification settings - Fork 428
Error handling
Aurelien Bouteiller edited this page Jan 12, 2016
·
2 revisions
- All uct/ucp functions to be added a return code, that contains an error code when relevant.
- Alternative/additional design is to have a callback for error cases, triggered from within the erroneous function. Not going this way for now, the loss of context makes it harder to use in low-level programming (as an example, one needs a setjmp/longjmp to exit the erroneous code path if the error is captured only by a CB, w/o a return code, not yummy).
- Errors are reported per-operation.
- When an error is reported on an endpoint, that endpoint can be marked as problematic, and should then be disposed off.
- Other endpoints are unaffected.
- It may or may not be possible to reconnect to the target process, or to use another transport/endpoint to reach that process (UCT should not do failover, UCP may).
- When an operation reports an error, the destination buffer is undefined (that is the local buffer in a get, the remote buffer in a put/amo)
- UCP may failover and try other UCT transports to complete the operation (probably want to report the performance error, and have a way to disable failover altogether from the userland)
- When getting an UCP error: the endpoint is in error state
- When an endpoint is in error state, we should stop the matching
- Add an UCP function to resume the matching: the up-layer has to decide if currently pending matching order is still making sense (if an ANY_SOURCE operation is in the matching, it is possible we should interrupt everything).
- If not, add an UCP function to shutdown the endpoint and purge the pending/matching queue and unexpected frag from all messages relating to that endpoint.
- generally, up-layer is responsible for determining if the UCP error is a link or process error. However, if the transport provides some introspection capabilities, more precise errors can be generated.
- In most cases, and in general it is hard to determine if a remote peer has actually failed, or has just become disconnected (HCA error, out of credits, link-wire switch issue, ...). So in general, UCT functions are expected to return error codes about "UNREACHABLE"
- Some UCT errors are temporary (errors from UCP should be only the non-correctable kind), and may be corrected otherwise (like rebooting the HCA), those errors should have separate codes indicating the intended remediation.
- Error code mockup list
- UCT_ERR_UNREACHABLE: generic code, the target is unreachable
- UCT_ERR_LNIC_FAILED: the local NIC has failed (uncorrectable)
- UCT_ERR_LNIC_REBOOT: the local NIC has failed (correctable, need to re-init the transport)
- UCT_ERR_RNIC_FAILED: the remote NIC has failed (uncorrectable)
- UCT_ERR_RNIC_REBOOT: the remote NIC has failed (correctable, need to re-init the transport)
- UCT_ERR_ROUTE_LOST: the switching infrastructure cannot route to the target
- UCT_ERR_PROC_FAILED: the target process has failed (may never be returned for some transports)