Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I encountered a segment error while transmitting with GPU address. #104

Open
heaibao817 opened this issue May 7, 2022 · 1 comment
Open

Comments

@heaibao817
Copy link

heaibao817 commented May 7, 2022

The GDB BackTrace is :
#0 0x00007ffff6d16cb4 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1 0x00007ffc805e7b16 in copy_to_scat (scat=0x7ff9bc18f6e0, buf=buf@entry=0x7ff9bc1894c0, size=size@entry=0x7ffa167fe2ec,
max=max@entry=1, ctx=ctx@entry=0x1c1e8780) at ../providers/mlx5/qp.c:88
#2 0x00007ffc805e7e07 in copy_to_scat (ctx=0x1c1e8780, max=1, size=0x7ffa167fe2ec, buf=0x7ff9bc1894c0, scat=)
at ../providers/mlx5/qp.c:78
#3 mlx5_copy_to_send_wqe (qp=qp@entry=0x7ff9bc18a230, idx=, buf=0x7ff9bc1894c0, size=)
at ../providers/mlx5/qp.c:161
#4 0x00007ffc805e51a4 in mlx5_parse_cqe (lazy=0, cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=,
cur_rsc=, cqe=, cqe64=, cq=) at ../providers/mlx5/cq.c:743
#5 mlx5_poll_one (cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=, cur_rsc=, cq=)
at ../providers/mlx5/cq.c:904
#6 poll_cq (cqe_ver=1, wc=, ne=, ibcq=0x7ff9bc188d40) at ../providers/mlx5/cq.c:932
#7 mlx5_poll_cq_v1 (ibcq=0x7ff9bc188d40, ne=32, wc=) at ../providers/mlx5/cq.c:1306
#8 0x00007ffce1248ab2 in ibv_poll_cq (wc=0x7ffa167fe5a0, num_entries=32, cq=)
/include/infiniband/verbs.h:2456

It seems like the ibv_poll_cq failed. But when I change to cpu addr, this problem will not happen.
I wonder what happened.

@nnurlan008
Copy link

Hi @heaibao817,

I have a similar problem. In my case, I need to assign GPU buffer for completion queue. I have Tesla K40 and connectx-4. Nvidia_peermem is loaded. But I get segmentation fault - bad address error with GPU memory address (returned by cudaMalloc). However, this problem does not happen with CPU address (returned by malloc). I wonder if you have been able to solve the issue you mentioned and if so, how?

Many thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants