Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError: No module named 'fmoe_cuda' #177

Open
Taskii-Lei opened this issue Nov 6, 2023 · 3 comments
Open

ModuleNotFoundError: No module named 'fmoe_cuda' #177

Taskii-Lei opened this issue Nov 6, 2023 · 3 comments

Comments

@Taskii-Lei
Copy link

Taskii-Lei commented Nov 6, 2023

Describe the bug
I adapt fmoe into Megatron as the tutorial and want to run a script to train gpt. But when I run pretrain_gpt.sh, it raises the error called "ModuleNotFoundError: No module named 'fmoe_cuda'". In detail, I git clone the Megatron-LM repository and modify the functions mentioned in fastmoe/examples/megatron/fmoefy-v2.2.patch. Then, I git clone the fastmoe and put it in the Megatron folder like "./Megatron-LM/fastmoe" to avoid ModuleNotFoundError that may raise. But when I run the pretrain_gpt.sh , it still raises the error. I don't know quite a lot about the module compilation, so I'm here to ask for your great help. Thanks a lot!!

To Reproduce
Steps to reproduce the behavior:

  1. Compile with "..."
  2. Run "Megatron-LM/pretrain_gpt.sh" with Linux processes on 1 nodes with 8 gpus per node.

Expected behavior
I expect it trains a moefy-Megatron smoothly.

Logs

File "/workspace/S/huanglei/Megatron-LM-moefy/fmoe/functions.py", line 9, in <module>
    import fmoe_cuda
ModuleNotFoundError: No module named 'fmoe_cuda'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3457189) of binary: /lustre/S/huanglei/CondaEnv/MoE/bin/python
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-06_15:01:34
  host      : r8a100-b01
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3457189)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@laekov
Copy link
Owner

laekov commented Nov 6, 2023

You are supposed to compile and install the cuda module of fastmoe using setup.py

@a-adomavicius
Copy link

a-adomavicius commented Oct 22, 2024

I'm getting
ModuleNotFoundError: No module named 'fmoe_cuda'

when attempting to use fmoefy. I did install the cuda module using setup.py as suggested, but the fmoe_cuda module does not seem to work regardless. Here are the relevant CUDA-related outputs when running the installation setup.

/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.2) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem. warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda)) /usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.2 warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')

Are there specific CUDA-related requirements that I may be missing/needing to downgrade?

@laekov
Copy link
Owner

laekov commented Oct 23, 2024

I have not tried to compile the fmoe_cuda module with a different nvcc, so I am not sure if you should do the downgrade. I think you should first check whether the fmoe_cuda module is compiled and accessible. There should be a fmoe_cuda.cpython-***.so in the site-packages/fastmoe* directory of your python library directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants