how to run transformer-xl with parallel experts with single gpu? #211

HudashiNeo · 2024-09-10T08:40:41Z

seems fast-moe still cannot archive running multi experts in parallel with single gpu card?

laekov · 2024-09-10T08:46:37Z

It is supported using multiple cuda streams. Refer to the class FMoELinear for details.

HudashiNeo · 2024-09-11T02:06:57Z

多谢回复~
我看FMoE和FMoETransformerMLP这两个都是基于one_experts的（给FMoELinear里面的num_expert设置为1），如果不设置num_expert=1的话，代码可以跑是吗

laekov · 2024-09-11T02:18:53Z

多谢回复~ 我看FMoE和FMoETransformerMLP这两个都是基于one_experts的（给FMoELinear里面的num_expert设置为1），如果不设置num_expert=1的话，代码可以跑是吗

可以设成更大的数, 能跑.

HudashiNeo · 2024-09-11T02:30:03Z

可以了，确实很强大。我的配置上，代码比for-loop快了3倍以上，而且topk基本不影响速度，赞👍🏻

HudashiNeo · 2024-09-11T07:18:36Z

@laekov 我把gpu设置成cuda:1的话，就会报错：

  File "/usr/local/lib/python3.10/dist-packages/fastmoe-1.1.0-py3.10-linux-x86_64.egg/fmoe/linear.py", line 20, in forward
    global_output_buf = fmoe_cuda.linear_forward(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

laekov · 2024-09-12T08:54:12Z

可能有的数据还在 cuda:0 上?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to run transformer-xl with parallel experts with single gpu? #211

how to run transformer-xl with parallel experts with single gpu? #211

HudashiNeo commented Sep 10, 2024

laekov commented Sep 10, 2024

HudashiNeo commented Sep 11, 2024

laekov commented Sep 11, 2024

HudashiNeo commented Sep 11, 2024

HudashiNeo commented Sep 11, 2024

laekov commented Sep 12, 2024

how to run transformer-xl with parallel experts with single gpu? #211

how to run transformer-xl with parallel experts with single gpu? #211

Comments

HudashiNeo commented Sep 10, 2024

laekov commented Sep 10, 2024

HudashiNeo commented Sep 11, 2024

laekov commented Sep 11, 2024

HudashiNeo commented Sep 11, 2024

HudashiNeo commented Sep 11, 2024

laekov commented Sep 12, 2024