-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting HG_OPNOTSUPPORTED when performing RDMA on data living in CUDA memory #7
Comments
Hi @thomas-bouvier . It's a little hard to tell if the error is coming from Mercury or libfabric. Can you repeat running your example with more logging enabled? Specifically Tagging @soumagne in case he has any insight. I think the problem might be more obvious with libfabric debug messages, though. |
Somewhat orthogonal to debugging the problem at hand, but I'll leave this thought here anyway: would it be helpful in the long run to have a standalone margo utility (similar to the margo-info tool) that can validate a given software stack's ability to register a CUDA region for RDMA on libfabric? That's the portion that's failing here, rather than the RDMA transfer itself. That means it could be validated with a single command-line process, I think. |
Hi @carns, thank you very much for your feedback. Here are the complete logs:
Many lines display
I checked with the grid5000 team, OFED is not what's installed on the machine initially. They use the I think the tool you describe would be really convenient, as RDMA+CUDA seems to be quite tricky to achieve :) |
I actually don't really see anything alarming in that output. Unfortunately it doesn't have any messages from Mercury, though. I forgot that you have to build the Mercury spack package with Can you try again with it built that way? I'll open an issue to track the CUDA validation command line utility idea. |
@carns error and warnings messages should be printed regardless. The |
I built
|
You could also try setting |
Interesting,
The full output with |
which provider are you trying to use ? tcp ? I had somehow missed that in your previous log but it can only work with verbs and shm providers. |
I was using I tried with verbs providers Anyway, let's focus on shm providers for now. I updated my code as follows to leverage the client.cpp #include <torch/extension.h>
#include <iostream>
#include <thallium.hpp>
#define __DEBUG
#include "debug.hpp"
namespace tl = thallium;
int main(int argc, char** argv) {
struct hg_init_info hii;
memset(&hii, 0, sizeof(hii));
hii.na_init_info.request_mem_device = true;
tl::engine myEngine("na+sm://122-1", THALLIUM_CLIENT_MODE, true, 1, &hii);
tl::remote_procedure remote_do_rdma = myEngine.define("do_rdma");
tl::endpoint server_endpoint = myEngine.lookup("na+sm://123-1");
auto options = torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA);
torch::Tensor aug_samples = torch::zeros({3, 224, 224}, options);
std::vector<std::pair<void*, std::size_t>> segments;
segments.emplace_back(aug_samples.data_ptr(), aug_samples.nbytes());
struct hg_bulk_attr attr;
memset(&attr, 0, sizeof(attr));
if (aug_samples.is_cuda()) {
DBG("Samples are in CUDA memory!");
attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_CUDA;
attr.device = 0;
} else {
attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_HOST;
}
tl::bulk local_bulk = myEngine.expose(segments, tl::bulk_mode::write_only, attr);
remote_do_rdma.on(server_endpoint)(local_bulk);
return 0;
} server.cpp #include <torch/extension.h>
#include <iostream>
#include <thallium.hpp>
#include <thallium/serialization/stl/string.hpp>
namespace tl = thallium;
int main(int argc, char** argv) {
tl::engine myEngine("na+sm://123-1", THALLIUM_SERVER_MODE);
std::function<void(const tl::request&, tl::bulk&)> f =
[&myEngine](const tl::request& req, tl::bulk& b) {
auto options = torch::TensorOptions().dtype(torch::kFloat32);
torch::Tensor tensor = torch::zeros({3, 224, 224}, options);
std::vector<std::pair<void*, std::size_t>> segments;
segments.emplace_back(tensor.data_ptr(), tensor.nbytes());
tl::bulk bulk = myEngine.expose(segments, tl::bulk_mode::read_only);
bulk >> b.on(req.get_endpoint());
};
myEngine.define("do_rdma", f).disable_response();
} With this code, I get the following error on the server:
Maybe something is wrong with my code though? client.txt |
ok yeah I think it's better to focus on getting the verbs provider to work. Sorry for the confusion and for the redundancy there because of multiple libraries providing the same functionality but I actually meant |
@soumagne Do you have any idea why we weren't getting this message originally (or even when setting HG_LOG_LEVEL=debug)?
That might be a separate issue to open up. It would have helped a lot to see this sooner :) |
yes the default log subsys is |
@soumagne what about ucx? Does it support CUDA? |
Closing this as I get a different error from |
@thomas-bouvier UCX is another option that also supports CUDA, which we also enabled to support that type of transfer, though I don't think anybody tested it just yet :) the code is there though. |
Following #6, I'm trying to create a bulk to transfer CUDA variables over RDMA.
When running the code above, I get the following error :
As suggested by @carns:
+cuda
variant https://github.com/mochi-hpc/mochi-spack-packages/blob/e222ad18083171a2e6806a0d363621f9c142e45e/packages/libfabric/package.py#L75. However, this failed as I am working with Spack in a Docker container, and the build process is checking for runtime CUDA availability:I used the
--enable-cuda-dlopen
flag as suggested in ofiwg/libfabric#7790 (comment) to overcome this issue, and opened a PR mochi-hpc/mochi-spack-packages#16 to add the corresponding variant in themochi-spack-packages
repo. I should probably make sure that libfabric is supporting CUDA independently from Thallium first.Any pointers to debug that issue? Thanks!
The text was updated successfully, but these errors were encountered: