Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练20次时失败 #37

Open
pursuingz opened this issue Jan 11, 2024 · 1 comment
Open

训练20次时失败 #37

pursuingz opened this issue Jan 11, 2024 · 1 comment

Comments

@pursuingz
Copy link

训练了3次都是在第20次时失败,大佬可以看一下吗
前两次是如下报错:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bfb40d4d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4bfb3d736b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4b946cdb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1985457 (0x7f4b9696d457 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1d4b680 (0x7f4be3baa680 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x7f4be3bab812 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x7f4be481a7bf in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x7f4be3e9e2ab in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2d2206b (0x7f4be4b8106b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2b5b453 (0x7f4be49ba453 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4015f9b (0x7f4be5e74f9b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x401641e (0x7f4be5e7541e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x7f4be43ee819 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x7f4be3e94e5b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x2eeef81 (0x7f4be4d4df81 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x7f4be456d15e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x132 (0x7f4bfb869d22 in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #19: NeuralNetwork::infer() + 0xb6b (0x7f4bfb86777b in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #20: <unknown function> + 0x5972d (0x7f4bfb86872d in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #21: <unknown function> + 0x145a0 (0x7f4bfba115a0 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #22: <unknown function> + 0x8609 (0x7f4c1b7ff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #23: clone + 0x43 (0x7f4c1b724133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1,最后运行报错如下:

terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/neural_network/___torch_mangle_1624.py", line 30, in forward
    p_conv = self.p_conv
    res_layers = self.res_layers
    _0 = (res_layers).forward(inputs, )
          ~~~~~~~~~~~~~~~~~~~ <--- HERE
    _1 = (p_bn).forward((p_conv).forward(_0, ), )
    _2 = (relu).forward(_1, )
  File "code/__torch__/torch/nn/modules/container/___torch_mangle_1613.py", line 16, in forward
    _1 = getattr(self, "1")
    _0 = getattr(self, "0")
    _4 = (_1).forward((_0).forward(inputs, ), )
                       ~~~~~~~~~~~ <--- HERE
    return (_3).forward((_2).forward(_4, ), )
  File "code/__torch__/neural_network/___torch_mangle_1594.py", line 25, in forward
    _1 = (conv2).forward((relu).forward(_0, ), )
    _2 = (bn2).forward(_1, )
    _3 = (downsample_bn).forward((downsample_conv).forward(inputs, ), )
                                  ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    input = torch.add_(_2, _3)
    return (relu).forward1(input, )
  File "code/__torch__/torch/nn/modules/conv/___torch_mangle_1592.py", line 10, in forward
    inputs: Tensor) -> Tensor:
    weight = self.weight
    input = torch._convolution(inputs, weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
            ~~~~~~~~~~~~~~~~~~ <--- HERE
    return input

Traceback of TorchScript, original code (most recent call last):
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(459): _conv_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(463): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(47): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py(217): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(84): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(1056): trace_module
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(794): trace
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(279): save_model
/root/autodl-tmp/alpha-zero-gomoku/test/../src/learner.py(114): learn
learner_test.py(17): <module>
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Aborted (core dumped)
@hijkzzz
Copy link
Owner

hijkzzz commented Jan 18, 2024

use CUDA 11.6/PyTorch 1.10/LibTorch 1.10(Pre-cxx11 ABI)/SWIG 4.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants