运行过程中报错： #25

tutussss · 2020-12-10T02:57:22Z

linux 系统中运行，安装依赖包和apex，
运行目录为项目根目录：
预训练模型：存储项目根：80000/mp_rank_00_model_states.pt

运行：!bash scripts/generate_text.sh mpu/ example.txt
报错内容：

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Traceback (most recent call last):
Traceback (most recent call last):
File "generate_samples.py", line 26, in
File "generate_samples.py", line 26, in
from utils import Timers
from utils import Timers
File "/content/CPM-Generate/utils.py", line 25, in
File "/content/CPM-Generate/utils.py", line 25, in
from fp16 import FP16_Optimizer
from fp16 import FP16_Optimizer
File "/content/CPM-Generate/fp16/init.py", line 15, in
File "/content/CPM-Generate/fp16/init.py", line 15, in
from .fp16util import (
from .fp16util import (
File "/content/CPM-Generate/fp16/fp16util.py", line 21, in
File "/content/CPM-Generate/fp16/fp16util.py", line 21, in
import mpu
import mpu
File "/content/CPM-Generate/mpu/init.py", line 35, in
File "/content/CPM-Generate/mpu/init.py", line 35, in
from .layers import ColumnParallelLinear
from .layers import ColumnParallelLinear
File "/content/CPM-Generate/mpu/layers.py", line 28, in
File "/content/CPM-Generate/mpu/layers.py", line 28, in
from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm
from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm
ModuleNotFoundError: No module named 'apex.normalization'
ModuleNotFoundError: No module named 'apex.normalization'
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 261, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', 'mpu/', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.
[37]

zzy14 · 2020-12-14T04:58:48Z

您的apex安装可以有一些问题，显示No module named 'apex.normalization'，可以考虑使用我们提供的docker。

lichen222 · 2020-12-17T10:12:54Z

您的apex安装可以有一些问题，显示No module named 'apex.normalization'，可以考虑使用我们提供的docker。

我也出现了这个问题，但是我使用了docker依然有这个问题，请问是怎么回事呢

zhenhao-huang · 2020-12-18T03:21:15Z

运行命令：bash scripts/generate_text.sh /path/to/CPM example.txt
模型放在/path/to/CPM这个文件下

lichen222 · 2020-12-22T11:11:07Z

运行命令：bash scripts/generate_text.sh /path/to/CPM example.txt
模型放在/path/to/CPM这个文件下

Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "generate_samples.py", line 380, in
Traceback (most recent call last):
File "generate_samples.py", line 380, in
main()
File "generate_samples.py", line 361, in main
main()
File "generate_samples.py", line 361, in main
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 246, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.

模型已放在了正确位置，现在这个问题是怎么回事呢？

zhenhao-huang · 2020-12-22T11:20:50Z

运行命令：bash scripts/generate_text.sh /path/to/CPM example.txt
模型放在/path/to/CPM这个文件下

Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "generate_samples.py", line 380, in
Traceback (most recent call last):
File "generate_samples.py", line 380, in
main()
File "generate_samples.py", line 361, in main
main()
File "generate_samples.py", line 361, in main
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 246, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.

模型已放在了正确位置，现在这个问题是怎么回事呢？

#22 需要双卡运行

tutussss · 2020-12-24T07:33:40Z

暂时没有双卡，单卡V100 32G资源。有没有修改建议或者解决方法？

zzy14 · 2020-12-25T08:35:47Z

有的，现在有一个转换模型的脚本（chang_mp.py），可以把模型从双卡转成单卡，您可以参考一下新的README。

lichen222 · 2020-12-28T09:21:27Z

我自己配置的三方库，目前机器上有8张卡，我该在哪里设置指定使用哪两张卡呢？
bash scripts/generate_text.sh ./ example.txt

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Generate Samples
WARNING: No training data specified
Traceback (most recent call last):
File "generate_samples.py", line 380, in
main()
File "generate_samples.py", line 361, in main
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
Generate Samples
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "generate_samples.py", line 380, in
main()
File "generate_samples.py", line 361, in main
initialize_distributed(args)
File "generate_samples.py", line 96, in initialize_distributed
device = args.rank % torch.cuda.device_count()
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/lichen298097/anaconda3/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.

zzy14 · 2021-01-02T14:21:45Z

@lichen222 试一试指定环境变量，CUDA_VISIBLE_DEVICES=0,1 加到命令之前。

lichen222 · 2021-01-04T11:31:14Z

@lichen222 试一试指定环境变量，CUDA_VISIBLE_DEVICES=0,1 加到命令之前。

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/lichen298097/anaconda3/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' died with <Signals.SIGSEGV: 11>.
(base)
不知道为啥，又来一个问题，好像是涉及多机多卡运行的问题

zzy14 · 2021-01-19T06:11:39Z

@lichen222 试一试指定环境变量，CUDA_VISIBLE_DEVICES=0,1 加到命令之前。

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2

using dynamic loss scaling
Traceback (most recent call last):
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lichen298097/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/lichen298097/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/lichen298097/anaconda3/bin/python', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', './', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' died with <Signals.SIGSEGV: 11>.
(base)
不知道为啥，又来一个问题，好像是涉及多机多卡运行的问题

这个报错看了感觉没啥信息量😂请问您后来跑通了吗？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

运行过程中报错： #25

运行过程中报错： #25

tutussss commented Dec 10, 2020

zzy14 commented Dec 14, 2020

lichen222 commented Dec 17, 2020

zhenhao-huang commented Dec 18, 2020

lichen222 commented Dec 22, 2020

zhenhao-huang commented Dec 22, 2020

tutussss commented Dec 24, 2020

zzy14 commented Dec 25, 2020

lichen222 commented Dec 28, 2020

zzy14 commented Jan 2, 2021

lichen222 commented Jan 4, 2021

zzy14 commented Jan 19, 2021

运行过程中报错： #25

运行过程中报错： #25

Comments

tutussss commented Dec 10, 2020

zzy14 commented Dec 14, 2020

lichen222 commented Dec 17, 2020

zhenhao-huang commented Dec 18, 2020

lichen222 commented Dec 22, 2020

zhenhao-huang commented Dec 22, 2020

tutussss commented Dec 24, 2020

zzy14 commented Dec 25, 2020

lichen222 commented Dec 28, 2020

zzy14 commented Jan 2, 2021

lichen222 commented Jan 4, 2021

zzy14 commented Jan 19, 2021