Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resnet50在第一轮中途突然结束 #82

Open
smile0655 opened this issue May 30, 2023 · 0 comments
Open

resnet50在第一轮中途突然结束 #82

smile0655 opened this issue May 30, 2023 · 0 comments

Comments

@smile0655
Copy link
Contributor

问题:日常任务的resnet50没跑通,在第一轮中途突然结束。
脚本:https://github.com/Oneflow-Inc/OneAutoTest/blob/main/onebench/models/ResNet50/run_week.sh
错误日志:

Details


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


------------------------ arguments ------------------------
batches_per_epoch ............................... 1000
channel_last .................................... True
ddp ............................................. False
fuse_bn_add_relu ................................ True
fuse_bn_relu .................................... True
gpu_stat_file ................................... None
grad_clipping ................................... 0.0
graph ........................................... True
label_smoothing ................................. 0.1
learning_rate ................................... 1.28
legacy_init ..................................... False
load_path ....................................... None
lr_decay_type ................................... cosine
metric_local .................................... True
metric_train_acc ................................ True
momentum ........................................ 0.875
nccl_fusion_max_ops ............................. 24
nccl_fusion_threshold_mb ........................ 16
num_classes ..................................... 1000
num_devices_per_node ............................ 8
num_epochs ...................................... 50
num_nodes ....................................... 1
ofrecord_part_num ............................... 256
ofrecord_path ................................... /ssd/dataset/ImageNet/ofrecord
print_interval .................................. 100
print_timestamp ................................. False
samples_per_epoch ............................... 1281167
save_init ....................................... False
save_path ....................................... None
scale_grad ...................................... True
skip_eval ....................................... False
synthetic_data .................................. False
total_batches ................................... -1
train_batch_size ................................ 40
train_global_batch_size ......................... 1280
use_fp16 ........................................ True
use_gpu_decode .................................. True
val_batch_size .................................. 20
val_batches_per_epoch ........................... 78
val_global_batch_size ........................... 640
val_samples_per_epoch ........................... 50000
warmup_epochs ................................... 5
weight_decay .................................... 3.0517578125e-05
zero_init_residual .............................. True
-------------------- end of arguments ---------------------
***** Model Init *****
W20230526 08:19:08.200799 999559 eager_local_op_interpreter.cpp:272] Casting a local tensor to a global tensor with Broadcast sbp will modify the data of input! If you want to keep the input local tensor unchanged, please set the arg copy to True.
***** Model Init Finish, time escapled: 1.93146 s *****
[rank:2] [train], epoch: 0/50, iter: 100/1000, loss: 0.86197, top1: 0.00300, throughput: 84.68 | 2023-05-26 08:19:55.622
[rank:6] [train], epoch: 0/50, iter: 100/1000, loss: 0.86224, top1: 0.00300, throughput: 84.68 | 2023-05-26 08:19:55.623[rank:5] [train], epoch: 0/50, iter: 100/1000, loss: 0.86172, top1: 0.00262, throughput: 84.68
| 2023-05-26 08:19:55.624
[rank:7] [train], epoch: 0/50, iter: 100/1000, loss: 0.86216, top1: 0.00350, throughput: 84.68 | 2023-05-26 08:19:55.625
[rank:0] [train], epoch: 0/50, iter: 100/1000, loss: 0.86167, top1: 0.00281, throughput: 84.68 | 2023-05-26 08:19:55.625
[rank:4] [train], epoch: 0/50, iter: 100/1000, loss: 0.86215, top1: 0.00256, throughput: 84.68 | 2023-05-26 08:19:55.622
[rank:1] [train], epoch: 0/50, iter: 100/1000, loss: 0.86248, top1: 0.00275, throughput: 84.67 | 2023-05-26 08:19:55.624
[rank:3] [train], epoch: 0/50, iter: 100/1000, loss: 0.86204, top1: 0.00294, throughput: 84.67 | 2023-05-26 08:19:55.624
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/26 08:19:55.698, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/26 08:19:55.699, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/26 08:19:55.700, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/26 08:19:55.700, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB
timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2023/05/26 08:19:55.701, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB
2023/05/26 08:19:55.701, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB
2023/05/26 08:19:55.702, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB
2023/05/26 08:19:55.703, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB
2023/05/26 08:19:55.703, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB
2023/05/26 08:19:55.703, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB
2023/05/26 08:19:55.703, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB
2023/05/26 08:19:55.704, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB
2023/05/26 08:19:55.705, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB
2023/05/26 08:19:55.706, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB
2023/05/26 08:19:55.708, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.709, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB
2023/05/26 08:19:55.709, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB
2023/05/26 08:19:55.709, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB
2023/05/26 08:19:55.710, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB
2023/05/26 08:19:55.712, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB
2023/05/26 08:19:55.712, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.713, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB
2023/05/26 08:19:55.715, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB
2023/05/26 08:19:55.716, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB
2023/05/26 08:19:55.716, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB
2023/05/26 08:19:55.716, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.718, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB
2023/05/26 08:19:55.719, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB
2023/05/26 08:19:55.720, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB
2023/05/26 08:19:55.721, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.723, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB
2023/05/26 08:19:55.724, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.724, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.725, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB
2023/05/26 08:19:55.726, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.728, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.728, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB
2023/05/26 08:19:55.729, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB
2023/05/26 08:19:55.731, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB
2023/05/26 08:19:55.732, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB
2023/05/26 08:19:55.732, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB
2023/05/26 08:19:55.733, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB
2023/05/26 08:19:55.733, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB
2023/05/26 08:19:55.735, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB
2023/05/26 08:19:55.735, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB
2023/05/26 08:19:55.736, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB
2023/05/26 08:19:55.738, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.740, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB
2023/05/26 08:19:55.740, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB
2023/05/26 08:19:55.741, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB
2023/05/26 08:19:55.742, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB
2023/05/26 08:19:55.744, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB
2023/05/26 08:19:55.744, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.745, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB
2023/05/26 08:19:55.748, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB
2023/05/26 08:19:55.748, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB
2023/05/26 08:19:55.748, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.749, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB
2023/05/26 08:19:55.751, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB
2023/05/26 08:19:55.752, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.754, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.755, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.756, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB
2023/05/26 08:19:55.758, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB
[rank:3] [train], epoch: 0/50, iter: 200/1000, loss: 0.83563, top1: 0.01081, throughput: 264.51 | 2023-05-26 08:20:10.746
[rank:4] [train], epoch: 0/50, iter: 200/1000, loss: 0.83491, top1: 0.01112, throughput: 264.48 | 2023-05-26 08:20:10.746
[rank:5] [train], epoch: 0/50, iter: 200/1000, loss: 0.83465, top1: 0.01087, throughput: 264.50 | 2023-05-26 08:20:10.747
[rank:0] [train], epoch: 0/50, iter: 200/1000, loss: 0.83560, top1: 0.01038, throughput: 264.48 | 2023-05-26 08:20:10.749
[rank:2] [train], epoch: 0/50, iter: 200/1000, loss: 0.83530, top1: 0.00956, throughput: 264.44 | 2023-05-26 08:20:10.748
[rank:1] [train], epoch: 0/50, iter: 200/1000, loss: 0.83483, top1: 0.01100, throughput: 264.50 | 2023-05-26 08:20:10.747
[rank:6] [train], epoch: 0/50, iter: 200/1000, loss: 0.83393, top1: 0.01156, throughput: 264.48 | 2023-05-26 08:20:10.747
[rank:7] [train], epoch: 0/50, iter: 200/1000, loss: 0.83595, top1: 0.01075, throughput: 264.48 | 2023-05-26 08:20:10.749
[rank:0] [train], epoch: 0/50, iter: 300/1000, loss: 0.81221, top1: 0.01681, throughput: 279.81 | 2023-05-26 08:20:25.044
[rank:3] [train], epoch: 0/50, iter: 300/1000, loss: 0.81065, top1: 0.01719, throughput: 279.77 | 2023-05-26 08:20:25.044
[rank:4] [train], epoch: 0/50, iter: 300/1000, loss: 0.81161, top1: 0.01575, throughput: 279.77 | 2023-05-26 08:20:25.044
[rank:6] [train], epoch: 0/50, iter: 300/1000, loss: 0.81175, top1: 0.01575, throughput: 279.77 | 2023-05-26 08:20:25.045
[rank:7] [train], epoch: 0/50, iter: 300/1000, loss: 0.81004, top1: 0.01738, throughput: 279.80 | 2023-05-26 08:20:25.045
[rank:2] [train], epoch: 0/50, iter: 300/1000, loss: 0.81092, top1: 0.01550, throughput: 279.79 | 2023-05-26 08:20:25.044
[rank:1] [train], epoch: 0/50, iter: 300/1000, loss: 0.81099, top1: 0.01838, throughput: 279.77 | 2023-05-26 08:20:25.045
[rank:5] [train], epoch: 0/50, iter: 300/1000, loss: 0.81109, top1: 0.01731, throughput: 279.77 | 2023-05-26 08:20:25.045
[rank:2] [train], epoch: 0/50, iter: 400/1000, loss: 0.79602, top1: 0.02162, throughput: 285.49 | 2023-05-26 08:20:39.055
[rank:7] [train], epoch: 0/50, iter: 400/1000, loss: 0.79583, top1: 0.02387, throughput: 285.50 | 2023-05-26 08:20:39.055
[rank:4] [train], epoch: 0/50, iter: 400/1000, loss: 0.79490, top1: 0.02044, throughput: 285.47 | 2023-05-26 08:20:39.056
[rank:5] [train], epoch: 0/50, iter: 400/1000, loss: 0.79409, top1: 0.02188, throughput: 285.49 | 2023-05-26 08:20:39.056
[rank:1] [train], epoch: 0/50, iter: 400/1000, loss: 0.79505, top1: 0.02175, throughput: 285.51 | 2023-05-26 08:20:39.055
[rank:3] [train], epoch: 0/50, iter: 400/1000, loss: 0.79605, top1: 0.02013, throughput: 285.48 | 2023-05-26 08:20:39.055
[rank:0] [train], epoch: 0/50, iter: 400/1000, loss: 0.79391, top1: 0.02344, throughput: 285.47 | 2023-05-26 08:20:39.056
[rank:6] [train], epoch: 0/50, iter: 400/1000, loss: 0.79521, top1: 0.02175, throughput: 285.49 | 2023-05-26 08:20:39.056
[rank:3] [train], epoch: 0/50, iter: 500/1000, loss: 0.78024, top1: 0.02750, throughput: 280.28 | 2023-05-26 08:20:53.327
[rank:4] [train], epoch: 0/50, iter: 500/1000, loss: 0.78196, top1: 0.02756, throughput: 280.26 | 2023-05-26 08:20:53.328
[rank:5] [train], epoch: 0/50, iter: 500/1000, loss: 0.78153, top1: 0.02837, throughput: 280.25 | 2023-05-26 08:20:53.329
[rank:7] [train], epoch: 0/50, iter: 500/1000, loss: 0.78232, top1: 0.02688, throughput: 280.24 | 2023-05-26 08:20:53.329
[rank:0] [train], epoch: 0/50, iter: 500/1000, loss: 0.78100, top1: 0.02794, throughput: 280.24 | 2023-05-26 08:20:53.329
[rank:2] [train], epoch: 0/50, iter: 500/1000, loss: 0.78063, top1: 0.02869, throughput: 280.26 | 2023-05-26 08:20:53.328
[rank:6] [train], epoch: 0/50, iter: 500/1000, loss: 0.78208, top1: 0.02794, throughput: 280.24 | 2023-05-26 08:20:53.329
[rank:1] [train], epoch: 0/50, iter: 500/1000, loss: 0.78039, top1: 0.02725, throughput: 280.22 | 2023-05-26 08:20:53.330
[rank:4] [train], epoch: 0/50, iter: 600/1000, loss: 0.76739, top1: 0.03381, throughput: 285.81 | 2023-05-26 08:21:07.323
[rank:6] [train], epoch: 0/50, iter: 600/1000, loss: 0.76635, top1: 0.03394, throughput: 285.83 | 2023-05-26 08:21:07.324
[rank:1] [train], epoch: 0/50, iter: 600/1000, loss: 0.76907, top1: 0.03556, throughput: 285.83 | 2023-05-26 08:21:07.324
[rank:7] [train], epoch: 0/50, iter: 600/1000, loss: 0.76877, top1: 0.03425, throughput: 285.80 | 2023-05-26 08:21:07.324
[rank:0] [train], epoch: 0/50, iter: 600/1000, loss: 0.76683, top1: 0.03475, throughput: 285.82[rank:2] [train], epoch: 0/50, iter: 600/1000, loss: 0.76919, top1: 0.03287, throughput: 285.79[rank:5] [train], epoch: 0/50, iter: 600/1000, loss: 0.76843, top1: 0.03237, throughput: 285.79 | 2023-05-26 08:21:07.325| 2023-05-26 08:21:07.324

| 2023-05-26 08:21:07.324
[rank:3] [train], epoch: 0/50, iter: 600/1000, loss: 0.76834, top1: 0.03563, throughput: 285.78 | 2023-05-26 08:21:07.324
[rank:0] [train], epoch: 0/50, iter: 700/1000, loss: 0.75627, top1: 0.04188, throughput: 281.29 | 2023-05-26 08:21:21.544
[rank:6] [train], epoch: 0/50, iter: 700/1000, loss: 0.75528, top1: 0.04062, throughput: 281.28 | 2023-05-26 08:21:21.544
[rank:2] [train], epoch: 0/50, iter: 700/1000, loss: 0.75596, top1: 0.03844, throughput: 281.28 | 2023-05-26 08:21:21.545
[rank:3] [train], epoch: 0/50, iter: 700/1000, loss: 0.75508, top1: 0.04425, throughput: 281.27 | 2023-05-26 08:21:21.545
[rank:4] [train], epoch: 0/50, iter: 700/1000, loss: 0.75610, top1: 0.04125, throughput: 281.25 | 2023-05-26 08:21:21.546
[rank:7] [train], epoch: 0/50, iter: 700/1000, loss: 0.75725, top1: 0.03919, throughput: 281.27 | 2023-05-26 08:21:21.546
[rank:5] [train], epoch: 0/50, iter: 700/1000, loss: 0.75407, top1: 0.04281, throughput: 281.24 | 2023-05-26 08:21:21.548
[rank:1] [train], epoch: 0/50, iter: 700/1000, loss: 0.75588, top1: 0.04056, throughput: 281.21 | 2023-05-26 08:21:21.548
[rank:7] [train], epoch: 0/50, iter: 800/1000, loss: 0.74315, top1: 0.04856, throughput: 280.02 | 2023-05-26 08:21:35.830
[rank:2] [train], epoch: 0/50, iter: 800/1000, loss: 0.74185, top1: 0.04681, throughput: 279.99 | 2023-05-26 08:21:35.831
[rank:0] [train], epoch: 0/50, iter: 800/1000, loss: 0.74356, top1: 0.04856, throughput: 279.97 | 2023-05-26 08:21:35.831
[rank:1] [train], epoch: 0/50, iter: 800/1000, loss: 0.74202, top1: 0.04931, throughput: 280.04 | 2023-05-26 08:21:35.832
[rank:5] [train], epoch: 0/50, iter: 800/1000, loss: 0.74417, top1: 0.04938, throughput: 280.04 | 2023-05-26 08:21:35.832
[rank:3] [train], epoch: 0/50, iter: 800/1000, loss: 0.74433, top1: 0.04813, throughput: 279.98 | 2023-05-26 08:21:35.832
[rank:4] [train], epoch: 0/50, iter: 800/1000, loss: 0.74303, top1: 0.04869, throughput: 279.95 | 2023-05-26 08:21:35.834
[rank:6] [train], epoch: 0/50, iter: 800/1000, loss: 0.74371, top1: 0.04662, throughput: 279.90 | 2023-05-26 08:21:35.835
[rank:6] [train], epoch: 0/50, iter: 900/1000, loss: 0.72924, top1: 0.05775, throughput: 282.84 | 2023-05-26 08:21:49.977
[rank:2] [train], epoch: 0/50, iter: 900/1000, loss: 0.73020, top1: 0.05494, throughput: 282.76 | 2023-05-26 08:21:49.977
[rank:3] [train], epoch: 0/50, iter: 900/1000, loss: 0.72997, top1: 0.05850, throughput: 282.80 | 2023-05-26 08:21:49.976
[rank:7] [train], epoch: 0/50, iter: 900/1000, loss: 0.73131, top1: 0.05394, throughput: 282.74 | 2023-05-26 08:21:49.978
[rank:0] [train], epoch: 0/50, iter: 900/1000, loss: 0.73049, top1: 0.05706, throughput: 282.77 | 2023-05-26 08:21:49.977
[rank:1] [train], epoch: 0/50, iter: 900/1000, loss: 0.72931, top1: 0.06069, throughput: 282.78 | 2023-05-26 08:21:49.977
[rank:5] [train], epoch: 0/50, iter: 900/1000, loss: 0.72920, top1: 0.05713, throughput: 282.73 | 2023-05-26 08:21:49.979
[rank:4] [train], epoch: 0/50, iter: 900/1000, loss: 0.73110, top1: 0.05444, throughput: 282.84 | 2023-05-26 08:21:49.976
[rank:6] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71598, top1: 0.06744, throughput: 279.09 | 2023-05-26 08:22:04.309
[rank:1] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71727, top1: 0.06263, throughput: 279.08 | 2023-05-26 08:22:04.310
[rank:7] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71911, top1: 0.06381, throughput: 279.09 | 2023-05-26 08:22:04.310
[rank:3] [train], epoch: 0/50, iter: 1000/1000, loss: 0.72109, top1: 0.06081, throughput: 279.06 | 2023-05-26 08:22:04.310
[rank:4] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71921, top1: 0.06088, throughput: 279.06 | 2023-05-26 08:22:04.310
[rank:0] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71580, top1: 0.06688, throughput: 279.08 | 2023-05-26 08:22:04.310
[rank:2] [train], epoch: 0/50, iter: 1000/1000, loss: 0.72062, top1: 0.06500, throughput: 279.05 | 2023-05-26 08:22:04.312
[rank:5] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71976, top1: 0.06400, throughput: 279.11 | 2023-05-26 08:22:04.311
F20230526 08:22:06.753672 1001567 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL
*** Check failure stack trace: ***
F20230526 08:22:06.776768 1001614 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL
F20230526 08:22:06.776928 1001622 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL
*** Check failure stack trace: ***
*** Check failure stack trace: ***
F20230526 08:22:06.778252 1001596 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL
*** Check failure stack trace: ***
F20230526 08:22:06.781836 1001541 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL
*** Check failure stack trace: ***
@ 0x7f16b72b6e9a google::LogMessage::Fail()
F20230526 08:22:06.793205 1001690 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL
*** Check failure stack trace: ***
F20230526 08:22:06.797328 1001585 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL
*** Check failure stack trace: ***
F20230526 08:22:06.797571 1001620 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL
*** Check failure stack trace: ***
@ 0x7fac48c13e9a google::LogMessage::Fail()
@ 0x7f1aa6fade9a google::LogMessage::Fail()
@ 0x7fa4773f0e9a google::LogMessage::Fail()
@ 0x7f459cc5ce9a google::LogMessage::Fail()
@ 0x7f16b72b9bd1 google::LogMessage::SendToLog()
@ 0x7f054fe5fe9a google::LogMessage::Fail()
@ 0x7f91a3ae9e9a google::LogMessage::Fail()
@ 0x7f1db80fae9a google::LogMessage::Fail()
@ 0x7f1aa6fb0bd1 google::LogMessage::SendToLog()
@ 0x7fac48c16bd1 google::LogMessage::SendToLog()
@ 0x7fa4773f3bd1 google::LogMessage::SendToLog()
@ 0x7f459cc5fbd1 google::LogMessage::SendToLog()
@ 0x7f16b72b6998 google::LogMessage::Flush()
@ 0x7f054fe62bd1 google::LogMessage::SendToLog()
@ 0x7f1db80fdbd1 google::LogMessage::SendToLog()
@ 0x7f91a3aecbd1 google::LogMessage::SendToLog()
@ 0x7f1aa6fad998 google::LogMessage::Flush()
@ 0x7fac48c13998 google::LogMessage::Flush()
@ 0x7f459cc5c998 google::LogMessage::Flush()
@ 0x7fa4773f0998 google::LogMessage::Flush()
@ 0x7f16b72ba259 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f054fe5f998 google::LogMessage::Flush()
@ 0x7f1db80fa998 google::LogMessage::Flush()
@ 0x7f91a3ae9998 google::LogMessage::Flush()
@ 0x7f1aa6fb1259 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fac48c17259 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa4773f4259 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f459cc60259 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f054fe63259 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f1db80fe259 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f16b20c9aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor()
@ 0x7f91a3aed259 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa472203aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor()
@ 0x7f1aa1dc0aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor()
@ 0x7f4597a6faef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor()
@ 0x7fac43a26aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor()
@ 0x7f1db2f0daef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor()
@ 0x7f919e8fcaef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor()
@ 0x7f054ac72aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor()
@ 0x7fa47220a242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute()
@ 0x7f1aa1dc7242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute()
@ 0x7f16b20d0242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute()
@ 0x7fac43a2d242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute()
@ 0x7f4597a76242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute()
@ 0x7f1db2f14242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute()
@ 0x7f919e903242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute()
@ 0x7fa46f9034ad oneflow::UserKernel::ForwardUserKernel()
@ 0x7f1a9f4c04ad oneflow::UserKernel::ForwardUserKernel()
@ 0x7f054ac79242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute()
@ 0x7fac411264ad oneflow::UserKernel::ForwardUserKernel()
@ 0x7f16af7c94ad oneflow::UserKernel::ForwardUserKernel()
@ 0x7f459516f4ad oneflow::UserKernel::ForwardUserKernel()
@ 0x7f1db060d4ad oneflow::UserKernel::ForwardUserKernel()
@ 0x7f919bffc4ad oneflow::UserKernel::ForwardUserKernel()
@ 0x7fa46f90369b oneflow::UserKernel::ForwardDataContent()
@ 0x7f1a9f4c069b oneflow::UserKernel::ForwardDataContent()
@ 0x7f05483724ad oneflow::UserKernel::ForwardUserKernel()
@ 0x7fac4112669b oneflow::UserKernel::ForwardDataContent()
@ 0x7f459516f69b oneflow::UserKernel::ForwardDataContent()
@ 0x7f16af7c969b oneflow::UserKernel::ForwardDataContent()
@ 0x7f1db060d69b oneflow::UserKernel::ForwardDataContent()
@ 0x7f919bffc69b oneflow::UserKernel::ForwardDataContent()
@ 0x7f054837269b oneflow::UserKernel::ForwardDataContent()
@ 0x7f1a9f481c53 oneflow::Kernel::Forward()
@ 0x7fa46f8c4c53 oneflow::Kernel::Forward()
@ 0x7fac410e7c53 oneflow::Kernel::Forward()
@ 0x7f4595130c53 oneflow::Kernel::Forward()
@ 0x7f16af78ac53 oneflow::Kernel::Forward()
@ 0x7f1db05cec53 oneflow::Kernel::Forward()
@ 0x7f919bfbdc53 oneflow::Kernel::Forward()
@ 0x7f0548333c53 oneflow::Kernel::Forward()
@ 0x7fa46f8c5229 oneflow::Kernel::Launch()
@ 0x7f1a9f482229 oneflow::Kernel::Launch()
@ 0x7fac410e8229 oneflow::Kernel::Launch()
@ 0x7f4595131229 oneflow::Kernel::Launch()
@ 0x7fa46fc1c4e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg()
@ 0x7f1a9f7d94e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg()
@ 0x7fac4143f4e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg()
@ 0x7f45954884e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg()
@ 0x7f16af78b229 oneflow::Kernel::Launch()
@ 0x7f16afae24e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg()
@ 0x7f1a9fef9b58 oneflow::Thread::PollMsgChannel()
@ 0x7fa47033cb58 oneflow::Thread::PollMsgChannel()
@ 0x7f1db05cf229 oneflow::Kernel::Launch()
@ 0x7f919bfbe229 oneflow::Kernel::Launch()
@ 0x7f4595ba8b58 oneflow::Thread::PollMsgChannel()
@ 0x7fac41b5fb58 oneflow::Thread::PollMsgChannel()
@ 0x7f0548334229 oneflow::Kernel::Launch()
@ 0x7f1a9fefb00e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7fa47033e00e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7f1db09264e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg()
@ 0x7f919c3154e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg()
@ 0x7f4595baa00e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7fac41b6100e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7f16b0202b58 oneflow::Thread::PollMsgChannel()
@ 0x7f054868b4e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg()
@ 0x7f16b020400e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7f1aa6fc0a70 execute_native_thread_routine
@ 0x7fa477403a70 execute_native_thread_routine
@ 0x7f1bb0a06609 start_thread
@ 0x7fa580e49609 start_thread
@ 0x7f1bb07d1133 clone
Stack trace (most recent call last) in thread 1001614:
@ 0x7fa580c14133 clone
Stack trace (most recent call last) in thread 1001596:
@ 0x7f459cc6fa70 execute_native_thread_routine
@ 0x7fac48c26a70 execute_native_thread_routine
@ 0x7f46a66b5609 start_thread
@ 0x7fad5266c609 start_thread
@ 0x7f46a6480133 clone
Stack trace (most recent call last) in thread 1001541:
@ 0x7fad52437133 clone
Stack trace (most recent call last) in thread 1001622:
@ 0x7f16b72c9a70 execute_native_thread_routine
@ 0x7f17c0d0f609 start_thread
@ 0x7f1db1046b58 oneflow::Thread::PollMsgChannel()
@ 0x7f17c0ada133 clone
Stack trace (most recent call last) in thread 1001567:
@ 0x7f919ca35b58 oneflow::Thread::PollMsgChannel()
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1aa6fc0a6f, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so0x7fa477403a6f", at , in 0x7f1a9fefb00d
, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa47033e00d, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1a9fef9b57, in Thread::PollMsgChannel()
Object " Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at ", at 0x7f1a9f7d94e3, in 0x7fa47033cb57
, in Thread::PollMsgChannel()
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa46fc1c4e3 Object ", in /data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so
", at 0x7f1a9f482228, in Kernel::Launch(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1a9f481c52 Object ", in /data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.soKernel::Forward(KernelContext*) const", at
0x7fa46f8c5228, in Kernel::Launch(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so Object "", at /data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so0x7f1a9f4c069a", at , in 0x7fa46f8c4c52UserKernel::ForwardDataContent(KernelContext*) const, in
Kernel::Forward(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa46f90369a, in UserKernel::ForwardDataContent(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1a9f4c04ac, in UserKernel::ForwardUserKernel(std::function<Blob* (std::string const&)> const&, user_op::OpKernelState*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at Object "0x7f1aa1dc7241/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so, in ", at (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const0x7fa46f9034ac
, in UserKernel::ForwardUserKernel(std::function<Blob* (std::string const&)> const&, user_op::OpKernelState*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at Object "0x7f1aa1dc0aee/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so, in ", at (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const0x7fa47220a241
, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1aa6fb1258, in
Object " Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at ", at 0x7fa472203aee0x7f1aa6fad997, in , in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const

Object " Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at ", at 0x7f1aa6fb0bd00x7fa4773f4258, in , in

Object " Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at /data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so0x7f1aa6fade99", at , in
0x7fa4773f0997, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1a988ddebe, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa4773f3bd0, in

Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa4773f0e99, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa468d20ebe, in
Aborted (Signal sent by tkill() 999564 1017)

Aborted (Signal sent by tkill() 999563 1017)
@ 0x7f0548dabb58 oneflow::Thread::PollMsgChannel()
@ 0x7f1db104800e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7f919ca3700e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc6fa6f, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4595baa00d, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4595ba8b57, in Thread::PollMsgChannel()
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f45954884e3, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4595131228, in Kernel::Launch(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4595130c52, in Kernel::Forward(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459516f69a, in UserKernel::ForwardDataContent(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459516f4ac, in UserKernel::ForwardUserKernel(std::function<Blob* (std::string const&)> const&, user_op::OpKernelState*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4597a76241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4597a6faee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc60258, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc5c997, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc5fbd0, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc5ce99, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f458e58cebe, in

Aborted (Signal sent by tkill() 999566 1017)
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c26a6f, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac41b6100d, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac41b5fb57, in Thread::PollMsgChannel()
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac4143f4e3, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac410e8228, in Kernel::Launch(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac410e7c52, in Kernel::Forward(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac4112669a, in UserKernel::ForwardDataContent(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac411264ac, in UserKernel::ForwardUserKernel(std::function<Blob* (std::string const&)> const&, user_op::OpKernelState*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac43a2d241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac43a26aee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c17258, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c13997, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c16bd0, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c13e99, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac3a543ebe, in

Aborted (Signal sent by tkill() 999568 1017)
@ 0x7f0548dad00e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7f1db810da70 execute_native_thread_routine
@ 0x7f91a3afca70 execute_native_thread_routine
@ 0x7f1ec1b53609 start_thread
@ 0x7f1ec191e133 clone
Stack trace (most recent call last) in thread 1001585:
@ 0x7f92ad542609 start_thread
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72c9a6f, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b020400d, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b0202b57, in Thread::PollMsgChannel()
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16afae24e3, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16af78b228, in Kernel::Launch(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16af78ac52, in Kernel::Forward(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16af7c969a, in UserKernel::ForwardDataContent(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16af7c94ac, in UserKernel::ForwardUserKernel(std::function<Blob* (std::string const&)> const&, user_op::OpKernelState*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b20d0241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b20c9aee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72ba258, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72b6997, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72b9bd0, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72b6e99, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16a8be6ebe, in

Aborted (Signal sent by tkill() 999559 1017)
@ 0x7f92ad30d133 clone
Stack trace (most recent call last) in thread 1001620:
@ 0x7f054fe72a70 execute_native_thread_routine
@ 0x7f06598b8609 start_thread
@ 0x7f0659683133 clone
Stack trace (most recent call last) in thread 1001690:
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db810da6f, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db104800d, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db1046b57, in Thread::PollMsgChannel()
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db09264e3, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db05cf228, in Kernel::Launch(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db05cec52, in Kernel::Forward(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db060d69a, in UserKernel::ForwardDataContent(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db060d4ac, in UserKernel::ForwardUserKernel(std::function<Blob* (std::string const&)> const&, user_op::OpKernelState*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db2f14241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db2f0daee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db80fe258, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db80fa997, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db80fdbd0, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db80fae99, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1da9a2aebe, in

Aborted (Signal sent by tkill() 999560 1017)
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3afca6f, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919ca3700d, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919ca35b57, in Thread::PollMsgChannel()
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919c3154e3, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919bfbe228, in Kernel::Launch(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919bfbdc52, in Kernel::Forward(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919bffc69a, in UserKernel::ForwardDataContent(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919bffc4ac, in UserKernel::ForwardUserKernel(std::function<Blob* (std::string const&)> const&, user_op::OpKernelState*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919e903241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919e8fcaee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3aed258, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3ae9997, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3aecbd0, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3ae9e99, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f9195419ebe, in

Aborted (Signal sent by tkill() 999562 1017)
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe72a6f, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f0548dad00d, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f0548dabb57, in Thread::PollMsgChannel()
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054868b4e3, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f0548334228, in Kernel::Launch(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f0548333c52, in Kernel::Forward(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054837269a, in UserKernel::ForwardDataContent(KernelContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f05483724ac, in UserKernel::ForwardUserKernel(std::function<Blob* (std::string const&)> const&, user_op::OpKernelState*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054ac79241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054ac72aee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe63258, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe5f997, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe62bd0, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe5fe99, in
Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054178febe, in

Aborted (Signal sent by tkill() 999561 1017)
Killing subprocess 999559
Killing subprocess 999560
Killing subprocess 999561
Killing subprocess 999562
Killing subprocess 999563
Killing subprocess 999564
Killing subprocess 999566
Killing subprocess 999568
Traceback (most recent call last):
File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 240, in
main()
File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 228, in main
sigkill_handler(signal.SIGTERM, None)
File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler
raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/data/home/zhouhongjun/miniconda3/envs/week_resnet/bin/python3', '-u', '/data/home/zhouhongjun/week_test/models/Vision/classification/image/resnet50/train.py', '--ofrecord-path', '/ssd/dataset/ImageNet/ofrecord', '--ofrecord-part-num', '256', '--num-devices-per-node', '8', '--lr', '1.28', '--momentum', '0.875', '--num-epochs', '50', '--train-batch-size', '40', '--train-global-batch-size', '1280', '--val-batch-size', '20', '--val-global-batch-size', '640', '--print-interval', '100', '--use-fp16', '--channel-last', '--scale-grad', '--graph', '--fuse-bn-relu', '--fuse-bn-add-relu', '--use-gpu-decode']' died with <Signals.SIGABRT: 6>.
oneflow-version(git_commit)=0.9.1.dev20230525+cu117
oneflow-commit(git_commit)=08ded68
oneflow-models(git_commit)=fc7cbf8da9b2ee21fa0e9613dd0668c3b45dad4d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant