Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行过程中报错: #25

Open
tutussss opened this issue Dec 10, 2020 · 11 comments
Open

运行过程中报错: #25

tutussss opened this issue Dec 10, 2020 · 11 comments

Comments

@tutussss
Copy link

linux 系统中运行,安装依赖包和apex,
运行目录为项目根目录:
预训练模型:存储项目根:80000/mp_rank_00_model_states.pt

运行:!bash scripts/generate_text.sh mpu/ example.txt
报错内容:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):
Traceback (most recent call last):
File "generate_samples.py", line 26, in
File "generate_samples.py", line 26, in
from utils import Timers
from utils import Timers
File "/content/CPM-Generate/utils.py", line 25, in
File "/content/CPM-Generate/utils.py", line 25, in
from fp16 import FP16_Optimizer
from fp16 import FP16_Optimizer
File "/content/CPM-Generate/fp16/init.py", line 15, in
File "/content/CPM-Generate/fp16/init.py", line 15, in
from .fp16util import (
from .fp16util import (
File "/content/CPM-Generate/fp16/fp16util.py", line 21, in
File "/content/CPM-Generate/fp16/fp16util.py", line 21, in
import mpu
import mpu
File "/content/CPM-Generate/mpu/init.py", line 35, in
File "/content/CPM-Generate/mpu/init.py", line 35, in
from .layers import ColumnParallelLinear
from .layers import ColumnParallelLinear
File "/content/CPM-Generate/mpu/layers.py", line 28, in
File "/content/CPM-Generate/mpu/layers.py", line 28, in
from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm
from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm
ModuleNotFoundError: No module named 'apex.normalization'
ModuleNotFoundError: No module named 'apex.normalization'
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 261, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', 'mpu/', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.
[37]

@zzy14
Copy link
Contributor

zzy14 commented Dec 14, 2020

您的apex安装可以有一些问题,显示No module named 'apex.normalization',可以考虑使用我们提供的docker。

@lichen222
Copy link

您的apex安装可以有一些问题,显示No module named 'apex.normalization',可以考虑使用我们提供的docker。

我也出现了这个问题,但是我使用了docker依然有这个问题,请问是怎么回事呢

@zhenhao-huang
Copy link

运行命令:bash scripts/generate_text.sh /path/to/CPM example.txt
模型放在/path/to/CPM这个文件下

@lichen222
Copy link

运行命令:bash scripts/generate_text.sh /path/to/CPM example.txt
模型放在/path/to/CPM这个文件下

Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "generate_samples.py", line 380, in
Traceback (most recent call last):
File "generate_samples.py", line 380, in
main()
File "generate_samples.py", line 361, in main
main()
File "generate_samples.py", line 361, in main
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 246, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.

模型已放在了正确位置,现在这个问题是怎么回事呢?

@zhenhao-huang
Copy link

运行命令:bash scripts/generate_text.sh /path/to/CPM example.txt
模型放在/path/to/CPM这个文件下

Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "generate_samples.py", line 380, in
Traceback (most recent call last):
File "generate_samples.py", line 380, in
main()
File "generate_samples.py", line 361, in main
main()
File "generate_samples.py", line 361, in main
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 246, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.

模型已放在了正确位置,现在这个问题是怎么回事呢?

#22 需要双卡运行

@tutussss
Copy link
Author

暂时没有双卡,单卡V100 32G资源。有没有修改建议或者解决方法?

@zzy14
Copy link
Contributor

zzy14 commented Dec 25, 2020

有的,现在有一个转换模型的脚本(chang_mp.py),可以把模型从双卡转成单卡,您可以参考一下新的README。

@lichen222
Copy link

我自己配置的三方库,目前机器上有8张卡,我该在哪里设置指定使用哪两张卡呢?
bash scripts/generate_text.sh ./ example.txt


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Generate Samples
WARNING: No training data specified
Traceback (most recent call last):
File "generate_samples.py", line 380, in
main()
File "generate_samples.py", line 361, in main
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
Generate Samples
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "generate_samples.py", line 380, in
main()
File "generate_samples.py", line 361, in main
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/lichen298097/anaconda3/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.

@zzy14
Copy link
Contributor

zzy14 commented Jan 2, 2021

@lichen222 试一试指定环境变量,CUDA_VISIBLE_DEVICES=0,1 加到命令之前。

@lichen222
Copy link

@lichen222 试一试指定环境变量,CUDA_VISIBLE_DEVICES=0,1 加到命令之前。


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/lichen298097/anaconda3/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' died with <Signals.SIGSEGV: 11>.
(base)
不知道为啥,又来一个问题,好像是涉及多机多卡运行的问题

@zzy14
Copy link
Contributor

zzy14 commented Jan 19, 2021

@lichen222 试一试指定环境变量,CUDA_VISIBLE_DEVICES=0,1 加到命令之前。

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/lichen298097/anaconda3/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' died with <Signals.SIGSEGV: 11>.
(base)
不知道为啥,又来一个问题,好像是涉及多机多卡运行的问题

这个报错看了感觉没啥信息量😂请问您后来跑通了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants