Getting HG_OPNOTSUPPORTED when performing RDMA on data living in CUDA memory #7

thomas-bouvier · 2023-01-03T16:49:16Z

Following #6, I'm trying to create a bulk to transfer CUDA variables over RDMA.

#include <torch/extension.h>
#include <iostream>
#include <thallium.hpp>

#define __DEBUG
#include "debug.hpp"

namespace tl = thallium;

int main(int argc, char** argv) {
    struct hg_init_info hii;
    memset(&hii, 0, sizeof(hii));
    hii.na_init_info.request_mem_device = true;
    tl::engine myEngine("tcp", MARGO_CLIENT_MODE, true, 1, &hii);

    tl::remote_procedure remote_do_rdma = myEngine.define("do_rdma");
    tl::endpoint server_endpoint = myEngine.lookup("tcp://127.0.0.1:1234");

    auto options = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
    torch::Tensor aug_samples = torch::zeros({3, 224, 224}, options);
    std::vector<std::pair<void*, std::size_t>> segments;
    segments.emplace_back(aug_samples.data_ptr(), aug_samples.nbytes());

    struct hg_bulk_attr attr;
    memset(&attr, 0, sizeof(attr));
    if (aug_samples.is_cuda()) {
        DBG("Samples are in CUDA memory!");
        attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_CUDA;
        attr.device = 0;
    } else {
        attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_HOST;
    }

    tl::bulk local_bulk = myEngine.expose(segments, tl::bulk_mode::write_only, attr);
    remote_do_rdma.on(server_endpoint)(local_bulk);

    return 0;
}

When running the code above, I get the following error :

[DEBUG 0] [client.cpp:37:main] Samples are in CUDA memory!
Function returned HG_OPNOTSUPPORTED
terminate called after throwing an instance of 'thallium::margo_exception'
  what():  [/opt/software/linux-ubuntu20.04-skylake_avx512/gcc-9.4.0/mochi-thallium-main-6blapl4zngvcquy5lciliiogjzydzupr/include/thallium/engine.hpp:1132][margo_bulk_create] Function returned HG_OPNOTSUPPORTED
Aborted (core dumped)

As suggested by @carns:

I initialized Mercury with device memory support:

struct hg_init_info hii;
memset(&hii, 0, sizeof(hii));
hii.na_init_info.request_mem_device = true;
tl::engine myEngine("tcp", MARGO_CLIENT_MODE, true, 1, &hii);

I built libfabric using the +cuda variant https://github.com/mochi-hpc/mochi-spack-packages/blob/e222ad18083171a2e6806a0d363621f9c142e45e/packages/libfabric/package.py#L75. However, this failed as I am working with Spack in a Docker container, and the build process is checking for runtime CUDA availability:

1 error found in build log:
     132    checking cuda_runtime.h presence... yes
     133    checking for cuda_runtime.h... yes
     134    configure: looking for library in lib64
     135    checking for cudaMemcpy in -lcudart... no
     136    configure: looking for library in lib
     137    checking for cudaMemcpy in -lcudart... no
  >> 138    configure: error: CUDA support requested but CUDA runtime not avail
            able.

I used the --enable-cuda-dlopen flag as suggested in ofiwg/libfabric#7790 (comment) to overcome this issue, and opened a PR mochi-hpc/mochi-spack-packages#16 to add the corresponding variant in the mochi-spack-packages repo. I should probably make sure that libfabric is supporting CUDA independently from Thallium first.

Any pointers to debug that issue? Thanks!

The text was updated successfully, but these errors were encountered:

carns · 2023-01-06T21:08:45Z

Hi @thomas-bouvier . It's a little hard to tell if the error is coming from Mercury or libfabric. Can you repeat running your example with more logging enabled?

Specifically export FI_LOG_LEVEL=debug and HG_LOG_LEVEL=debug will turn on just about everything at each level, I think.

Tagging @soumagne in case he has any insight. I think the problem might be more obvious with libfabric debug messages, though.

carns · 2023-01-06T21:14:41Z

Somewhat orthogonal to debugging the problem at hand, but I'll leave this thought here anyway: would it be helpful in the long run to have a standalone margo utility (similar to the margo-info tool) that can validate a given software stack's ability to register a CUDA region for RDMA on libfabric? That's the portion that's failing here, rather than the RDMA transfer itself. That means it could be validated with a single command-line process, I think.

thomas-bouvier · 2023-01-09T21:24:50Z

Hi @carns, thank you very much for your feedback. Here are the complete logs:

root@gemini-1:~/bug# FI_LOG_LEVEL=debug HG_LOG_LEVEL=debug LD_LIBRARY_PATH="/opt/view/lib;/opt/view/lib64;/opt/view/lib/python3.10/site-packages/torch/lib" ./client
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable perf_cntr=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable hook=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable hmem_cuda_use_gdrcopy=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable hmem_cuda_enable_xfer=<not set>
libfabric:1073237:1673298660::core:core:ofi_hmem_init():247<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:1073237:1673298660::core:core:ofi_hmem_init():247<info> Hmem iface FI_HMEM_ZE not supported
libfabric:1073237:1673298660::core:core:ofi_hmem_init():247<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:1073237:1673298660::core:core:ofi_hmem_init():247<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable hmem_disable_p2p=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable mr_cache_max_size=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable mr_cache_max_count=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable mr_cache_monitor=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:1073237:1673298660::core:mr:ofi_default_cache_size():77<info> default cache size=3380746444
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable provider=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable universe_size=<not set>
libfabric:1073237:1673298660::core:core:fi_param_get_():278<info> variable provider_path=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable enable_passthru=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable buffer_size=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable tx_size=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable rx_size=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable msg_tx_size=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable msg_rx_size=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable cm_progress_interval=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable cq_eq_fairness=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable data_auto_progress=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable use_rndv_write=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable def_wait_obj=<not set>
libfabric:1073237:1673298660::ofi_rxm:core:fi_param_get_():278<info> variable def_tcp_wait_obj=<not set>
libfabric:1073237:1673298660::core:core:ofi_register_provider():466<info> registering provider: ofi_rxm (116.10)
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable tx_size=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable rx_size=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable tx_iov_limit=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable rx_iov_limit=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable inline_size=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable min_rnr_timer=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable use_odp=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable prefer_xrc=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable xrcd_filename=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable cqread_bunch_size=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable gid_idx=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable device_name=<not set>
libfabric:1073237:1673298660::verbs:core:vrb_read_params():716<info> dmabuf support is disabled
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable dgram_use_name_server=<not set>
libfabric:1073237:1673298660::verbs:core:fi_param_get_():278<info> variable dgram_name_server_port=<not set>
libfabric:1073237:1673298660::verbs:fabric:verbs_devs_print():887<info> list of verbs devices found for FI_EP_MSG:
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():614<info> device mlx5_3: there are no active ports
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():614<info> device mlx5_3: there are no active ports
libfabric:1073237:1673298661::verbs:fabric:vrb_get_device_attrs():614<info> device mlx5_3: there are no active ports
libfabric:1073237:1673298661::core:core:ofi_register_provider():466<info> registering provider: verbs (116.10)
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable port_high_range=<not set>
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable port_low_range=<not set>
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable tx_size=<not set>
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable rx_size=<not set>
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable nodelay=<not set>
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable staging_sbuf_size=<not set>
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable prefetch_rbuf_size=<not set>
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable zerocopy_size=<not set>
libfabric:1073237:1673298661::core:core:ofi_register_provider():466<info> registering provider: tcp (116.10)
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable prov_name=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable port_high_range=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable port_low_range=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable tx_size=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable rx_size=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable nodelay=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable staging_sbuf_size=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable prefetch_rbuf_size=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable zerocopy_size=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable poll_fairness=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable poll_cooldown=<not set>
libfabric:1073237:1673298661::net:core:fi_param_get_():278<info> variable disable_auto_progress=<not set>
libfabric:1073237:1673298661::core:core:ofi_register_provider():466<info> registering provider: net (116.10)
libfabric:1073237:1673298661::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_perf (116.10)
libfabric:1073237:1673298661::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_debug (116.10)
libfabric:1073237:1673298661::core:core:fi_param_get_():278<info> variable hmem_cuda_use_gdrcopy=<not set>
libfabric:1073237:1673298661::core:core:fi_param_get_():278<info> variable hmem_cuda_enable_xfer=<not set>
libfabric:1073237:1673298661::core:core:ofi_hmem_init():247<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:1073237:1673298661::core:core:ofi_hmem_init():247<info> Hmem iface FI_HMEM_ZE not supported
libfabric:1073237:1673298661::core:core:ofi_hmem_init():247<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:1073237:1673298661::core:core:ofi_hmem_init():247<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:1073237:1673298661::core:core:fi_param_get_():278<info> variable hmem_disable_p2p=<not set>
libfabric:1073237:1673298661::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_hmem (116.10)
libfabric:1073237:1673298661::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_dmabuf_peer_mem (116.10)
libfabric:1073237:1673298661::core:core:ofi_register_provider():466<info> registering provider: ofi_hook_noop (116.10)
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_fabric_attr():410<info> Requesting provider tcp;ofi_rxm, skipping verbs
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():689<info> Unsupported protocol
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():690<info> Supported: FI_PROTO_RXM_TCP
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():690<info> Requested: FI_PROTO_RXM
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable use_srx=<not set>
libfabric:1073237:1673298661:ofi_rxm:core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable use_srx=<not set>
libfabric:1073237:1673298661:ofi_rxm:core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_check_ep_attr():745<info> Provider requires use of shared rx context
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable use_srx=<not set>
libfabric:1073237:1673298661:ofi_rxm:core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_check_ep_attr():745<info> Provider requires use of shared rx context
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_fabric_attr():410<info> Requesting provider tcp;ofi_rxm, skipping verbs
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():689<info> Unsupported protocol
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():690<info> Supported: FI_PROTO_RXM_TCP
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():690<info> Requested: FI_PROTO_RXM
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable use_srx=<not set>
libfabric:1073237:1673298661:ofi_rxm:core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable use_srx=<not set>
libfabric:1073237:1673298661:ofi_rxm:core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_check_ep_attr():745<info> Provider requires use of shared rx context
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable use_srx=<not set>
libfabric:1073237:1673298661:ofi_rxm:core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_check_ep_attr():745<info> Provider requires use of shared rx context
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661::core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661::tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661::tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661::tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661::tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661::tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661::tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661::tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661::tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661::tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661::tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661::tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661::tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661::tcp:core:util_getinfo_ifs():333<info> Chosen addr for using: 172.16.53.1, speed 10000
libfabric:1073237:1673298661::core:core:fi_fabric_():1340<info> Opened fabric: 172.16.48.0/20
libfabric:1073237:1673298661::core:core:fi_fabric_():1340<info> Opened fabric: 172.16.48.0/20
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable use_srx=<not set>
libfabric:1073237:1673298661:ofi_rxm:core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661::tcp:core:ofi_check_rx_attr():805<info> Tx only caps ignored in Rx caps
libfabric:1073237:1673298661::tcp:core:ofi_check_tx_attr():903<info> Rx only caps ignored in Tx caps
libfabric:1073237:1673298661::tcp:core:ofi_check_rx_attr():805<info> Tx only caps ignored in Rx caps
libfabric:1073237:1673298661::tcp:core:ofi_check_tx_attr():903<info> Rx only caps ignored in Tx caps
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable enable_dyn_rbuf=<not set>
libfabric:1073237:1673298661::ofi_rxm:av:util_av_init():487<info> AV size 1024
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable comp_per_progress=<not set>
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_fabric_attr():410<info> Requesting provider tcp;ofi_rxm, skipping verbs
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():689<info> Unsupported protocol
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():690<info> Supported: FI_PROTO_RXM_TCP
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():690<info> Requested: FI_PROTO_RXM
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_rx_attr():805<info> Tx only caps ignored in Rx caps
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_tx_attr():903<info> Rx only caps ignored in Tx caps
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_rx_attr():805<info> Tx only caps ignored in Rx caps
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_tx_attr():903<info> Rx only caps ignored in Tx caps
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():789<info> Tag size exceeds supported size
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():790<info> Supported: 6148914691236517205
libfabric:1073237:1673298661::ofi_rxm:core:ofi_check_ep_attr():790<info> Requested: 12297829382473034410
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable use_srx=<not set>
libfabric:1073237:1673298661:ofi_rxm:core:core:ofi_layering_ok():1025<info> Provider ofi_rxm is excluded
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:fi_param_get_():278<info> variable iface=<not set>
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.16.53.1, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: 172.17.0.1, iface name: docker0, speed: 100
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_get_list_of_addr():2051<info> Available addr: fe80::dac4:97ff:feb8:3283, iface name: enp1s0f0, speed: 10000
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1883<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:1073237:1673298661:ofi_rxm:tcp:core:ofi_insert_loopback_addr():1898<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:1073237:1673298661::tcp:core:ofi_check_rx_attr():805<info> Tx only caps ignored in Rx caps
libfabric:1073237:1673298661::tcp:core:ofi_check_tx_attr():903<info> Rx only caps ignored in Tx caps
libfabric:1073237:1673298661::tcp:core:ofi_check_rx_attr():805<info> Tx only caps ignored in Rx caps
libfabric:1073237:1673298661::tcp:core:ofi_check_tx_attr():903<info> Rx only caps ignored in Tx caps
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable enable_direct_send=<not set>
libfabric:1073237:1673298661::ofi_rxm:core:fi_param_get_():278<info> variable eager_limit=<not set>
libfabric:1073237:1673298661::ofi_rxm:core:rxm_ep_settings_init():1259<info> Settings:
                 MR local: MSG - 0, RxM - 0
                 Completions per progress: MSG - 1
                 Buffered min: 0
                 Min multi recv size: 16384
                 inject size: 128
                 Protocol limits: Eager: 16384, SAR: 16384
libfabric:1073237:1673298661::ofi_rxm:av:ofi_av_insert_addr():291<info> inserting addr
: fi_sockaddr_in://172.16.53.1:38933
libfabric:1073237:1673298661::ofi_rxm:av:ofi_av_insert_addr():314<info> fi_addr: 0
libfabric:1073237:1673298661::ofi_rxm:av:ofi_av_insert_addr():291<info> inserting addr
: fi_sockaddr_in://127.0.0.1:1234
libfabric:1073237:1673298661::ofi_rxm:av:ofi_av_insert_addr():314<info> fi_addr: 1
[DEBUG 1] [client.cpp:37:main] Samples are in CUDA memory!
Function returned HG_OPNOTSUPPORTED
terminate called after throwing an instance of 'thallium::margo_exception'
  what():  [/tmp/opt/software/linux-debian11-broadwell/gcc-10.2.1/mochi-thallium-main-akonmmfedv533tq745tml3x7x7fb6jvn/include/thallium/engine.hpp:1132][margo_bulk_create] Function returned HG_OPNOTSUPPORTED
Aborted

Many lines display <not set>, this looks suspicious to me. libfabric@1.16.1+cuda and rdma-core@41.0 are installed.

Input spec
--------------------------------
mochi-thallium@main
    ^argobots
    ^libfabric+cuda fabrics=rxm,tcp,verbs
    ^mercury@2.2.0~boostsys~checksum+ofi

I checked with the grid5000 team, OFED is not what's installed on the machine initially. They use the rdma-core packaged in debian. So it should be ok to build it myself using Spack (that's what I did).

I think the tool you describe would be really convenient, as RDMA+CUDA seems to be quite tricky to achieve :)

carns · 2023-01-10T16:24:37Z

I actually don't really see anything alarming in that output. Unfortunately it doesn't have any messages from Mercury, though. I forgot that you have to build the Mercury spack package with +debug for it to emit messages, though :(

Can you try again with it built that way?

I'll open an issue to track the CUDA validation command line utility idea.

soumagne · 2023-01-10T16:37:46Z

@carns error and warnings messages should be printed regardless. The +debug variant only affects debug messages. Actually in that case setting FI_LOG_LEVEL=warn and HG_LOG_LEVEL=warn should be sufficient I would expect.

thomas-bouvier · 2023-01-10T20:49:31Z

I built mercury+debug to be sure, and I got the exact same output :(

Input spec
--------------------------------
mochi-thallium@main
    ^argobots
    ^libfabric+cuda fabrics=rxm,tcp,verbs
    ^mercury@2.2.0~boostsys~checksum+debug+ofi

soumagne · 2023-01-10T21:11:46Z

You could also try setting HG_LOG_SUBSYS=na just in case something is somehow hiding the output.

thomas-bouvier · 2023-01-10T22:15:53Z

Interesting, HG_LOG_SUBSYS=na FI_LOG_LEVEL=warn HG_LOG_LEVEL=warn gives the following:

root@gemini-1:~/bug# HG_LOG_SUBSYS=na FI_LOG_LEVEL=warn HG_LOG_LEVEL=warn ./client
[DEBUG 1] [client.cpp:37:main] Samples are in CUDA memory!
# [10797.956487] mercury->mem: [error] /tmp/spack-stage/root/spack-stage-mercury-2.2.0-3eaxn43bgam7of6jxmjoq6jvbrcfn5md/spack-src/src/na/na_ofi.c:6076
 # na_ofi_mem_register(): selected provider does not support device registration
Function returned HG_OPNOTSUPPORTED
terminate called after throwing an instance of 'thallium::margo_exception'
  what():  [/tmp/opt/software/linux-debian11-broadwell/gcc-10.2.1/mochi-thallium-main-ruzuusshqrs2jyz4wfiror6icwsefmui/include/thallium/engine.hpp:1132][margo_bulk_create] Function returned HG_OPNOTSUPPORTED
Aborted

The full output with FI_LOG_LEVEL=debug HG_LOG_LEVEL=debug is a bit long (but seems useful):

full_output.txt

soumagne · 2023-01-10T22:26:02Z

which provider are you trying to use ? tcp ? I had somehow missed that in your previous log but it can only work with verbs and shm providers.

thomas-bouvier · 2023-01-11T00:06:29Z

I was using tcp as this example in the docs : https://github.com/mochi-hpc/mochi-doc/blob/29e3a87d30b500a7d70a4601893975254ec3e5de/code/thallium/08_rdma/client.cpp

I tried with verbs providers verbs and ofi+verbs, but the Thallium engine can't get initialized with these. margo-info shows them in red. I discussed this with @carns in the past : a suggestion was to build libfabric reusing the (external) vendor rdma-core. It turns out that what is installed on my system is the version of rdma-core packaged with debian. Building rdma-core myself should not be a problem.

Anyway, let's focus on shm providers for now. I updated my code as follows to leverage the na+sm provider:

client.cpp

#include <torch/extension.h>
#include <iostream>
#include <thallium.hpp>

#define __DEBUG
#include "debug.hpp"

namespace tl = thallium;

int main(int argc, char** argv) {
    struct hg_init_info hii;
    memset(&hii, 0, sizeof(hii));
    hii.na_init_info.request_mem_device = true;
    tl::engine myEngine("na+sm://122-1", THALLIUM_CLIENT_MODE, true, 1, &hii);

    tl::remote_procedure remote_do_rdma = myEngine.define("do_rdma");
    tl::endpoint server_endpoint = myEngine.lookup("na+sm://123-1");

    auto options = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
    torch::Tensor aug_samples = torch::zeros({3, 224, 224}, options);
    std::vector<std::pair<void*, std::size_t>> segments;
    segments.emplace_back(aug_samples.data_ptr(), aug_samples.nbytes());

    struct hg_bulk_attr attr;
    memset(&attr, 0, sizeof(attr));
    if (aug_samples.is_cuda()) {
        DBG("Samples are in CUDA memory!");
        attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_CUDA;
        attr.device = 0;
    } else {
        attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_HOST;
    }

    tl::bulk local_bulk = myEngine.expose(segments, tl::bulk_mode::write_only, attr);
    remote_do_rdma.on(server_endpoint)(local_bulk);

    return 0;
}

server.cpp

#include <torch/extension.h>
#include <iostream>
#include <thallium.hpp>
#include <thallium/serialization/stl/string.hpp>

namespace tl = thallium;

int main(int argc, char** argv) {

    tl::engine myEngine("na+sm://123-1", THALLIUM_SERVER_MODE);

    std::function<void(const tl::request&, tl::bulk&)> f =
        [&myEngine](const tl::request& req, tl::bulk& b) {
            auto options = torch::TensorOptions().dtype(torch::kFloat32);
            torch::Tensor tensor = torch::zeros({3, 224, 224}, options);
            std::vector<std::pair<void*, std::size_t>> segments;
            segments.emplace_back(tensor.data_ptr(), tensor.nbytes());

            tl::bulk bulk = myEngine.expose(segments, tl::bulk_mode::read_only);
            bulk >> b.on(req.get_endpoint());
        };
    myEngine.define("do_rdma", f).disable_response();
}

With this code, I get the following error on the server:

Function returned HG_FAULT
terminate called after throwing an instance of 'thallium::margo_exception'
  what():  [/tmp/opt/software/linux-debian11-broadwell/gcc-10.2.1/mochi-thallium-main-ruzuusshqrs2jyz4wfiror6icwsefmui/include/thallium/remote_bulk.hpp:157][margo_bulk_transfer] Function returned HG_FAULT
Aborted

Maybe something is wrong with my code though?

client.txt
Failling server.txt

soumagne · 2023-01-11T00:34:38Z

ok yeah I think it's better to focus on getting the verbs provider to work. Sorry for the confusion and for the redundancy there because of multiple libraries providing the same functionality but I actually meant ofi+shm (shared-memory provider from libfabric, which supports gdr copy) not na+sm (shared-memory plugin from mercury which is not GPU enabled) for the shared-memory functionality but I have not tested it with cuda, it is enabled to do what you want though, so in theory it should work. I don't think it has been tested with thallium either yet so you can try but it may be safer to stick with verbs.

carns · 2023-01-11T13:41:50Z

@soumagne Do you have any idea why we weren't getting this message originally (or even when setting HG_LOG_LEVEL=debug)?

# [10797.956487] mercury->mem: [error] /tmp/spack-stage/root/spack-stage-mercury-2.2.0-3eaxn43bgam7of6jxmjoq6jvbrcfn5md/spack-src/src/na/na_ofi.c:6076
 # na_ofi_mem_register(): selected provider does not support device registration

That might be a separate issue to open up. It would have helped a lot to see this sooner :)

soumagne · 2023-01-11T16:30:59Z

yes the default log subsys is fatal only so it still requires setting HG_LOG_SUBSYS to something other than fatal like hg or na to get all the log. When using HG_Set_log_level() it does that implicitly but not when setting the env variable, we could maybe improve that...

thomas-bouvier · 2023-01-11T23:54:01Z

@soumagne what about ucx? Does it support CUDA?

thomas-bouvier · 2023-01-12T17:54:21Z

Closing this as I get a different error from HG_OPNOTSUPPORTED now. Thank you for your input :)

soumagne · 2023-01-12T18:00:03Z

@thomas-bouvier UCX is another option that also supports CUDA, which we also enabled to support that type of transfer, though I don't think anybody tested it just yet :) the code is there though.

mdorier assigned carns Jan 3, 2023

carns mentioned this issue Jan 11, 2023

need command line utility for testing device memory access mochi-hpc/mochi-margo#232

Open

thomas-bouvier closed this as completed Jan 12, 2023

This was referenced Jan 12, 2023

explicitly set Mercury log level and log subsys mochi-hpc/mochi-margo#233

Open

automatically bump HG_LOG_SUBSYS when HG_LOG_LEVEL is set mercury-hpc/mercury#634

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting HG_OPNOTSUPPORTED when performing RDMA on data living in CUDA memory #7

Getting HG_OPNOTSUPPORTED when performing RDMA on data living in CUDA memory #7

thomas-bouvier commented Jan 3, 2023

carns commented Jan 6, 2023

carns commented Jan 6, 2023

thomas-bouvier commented Jan 9, 2023 •

edited

Loading

carns commented Jan 10, 2023

soumagne commented Jan 10, 2023

thomas-bouvier commented Jan 10, 2023

soumagne commented Jan 10, 2023

thomas-bouvier commented Jan 10, 2023

soumagne commented Jan 10, 2023

thomas-bouvier commented Jan 11, 2023

soumagne commented Jan 11, 2023

carns commented Jan 11, 2023

soumagne commented Jan 11, 2023

thomas-bouvier commented Jan 11, 2023

thomas-bouvier commented Jan 12, 2023

soumagne commented Jan 12, 2023

Getting HG_OPNOTSUPPORTED when performing RDMA on data living in CUDA memory #7

Getting HG_OPNOTSUPPORTED when performing RDMA on data living in CUDA memory #7

Comments

thomas-bouvier commented Jan 3, 2023

carns commented Jan 6, 2023

carns commented Jan 6, 2023

thomas-bouvier commented Jan 9, 2023 • edited Loading

carns commented Jan 10, 2023

soumagne commented Jan 10, 2023

thomas-bouvier commented Jan 10, 2023

soumagne commented Jan 10, 2023

thomas-bouvier commented Jan 10, 2023

soumagne commented Jan 10, 2023

thomas-bouvier commented Jan 11, 2023

soumagne commented Jan 11, 2023

carns commented Jan 11, 2023

soumagne commented Jan 11, 2023

thomas-bouvier commented Jan 11, 2023

thomas-bouvier commented Jan 12, 2023

soumagne commented Jan 12, 2023

thomas-bouvier commented Jan 9, 2023 •

edited

Loading