Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash before Second Pass #124

Open
half-potato opened this issue Feb 27, 2023 · 2 comments
Open

Crash before Second Pass #124

half-potato opened this issue Feb 27, 2023 · 2 comments

Comments

@half-potato
Copy link

Not sure what kind of information you need to debug this.

Base mesh has 88958 triangles and 44260 vertices.
Writing mesh:  out/nerf_car/dmtet_mesh/mesh.obj
    writing 44260 vertices
    writing 88260 texcoords
    writing 44260 normals
    writing 88958 faces
Writing material:  out/nerf_car/dmtet_mesh/mesh.mtl
Done exporting mesh
Traceback (most recent call last):
  File "/home/amai/nvdiffrec/train.py", line 625, in <module>
    geometry, mat = optimize_mesh(glctx, geometry, base_mesh.material, lgt, dataset_train, dataset_validate, FLAGS, 
  File "/home/amai/nvdiffrec/train.py", line 420, in optimize_mesh
    img_loss, reg_loss = trainer(target, it)
  File "/home/amai/.conda/envs/31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/amai/nvdiffrec/train.py", line 304, in forward
    return self.geometry.tick(glctx, target, self.light, self.material, self.image_loss_fn, it)
  File "/home/amai/nvdiffrec/geometry/dlmesh.py", line 68, in tick
    reg_loss = torch.tensor([0], dtype=torch.float32, device="cuda")
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
  what():  Cuda error: 700[cudaGraphicsUnregisterResource(s.cudaPosBuffer);]
Exception raised from rasterizeReleaseBuffers at /home/amai/.conda/envs/31/lib/python3.10/site-packages/nvdiffrast/common/rasterize.cpp:616 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe5edf57497 in /home/amai/.conda/envs/31/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fe5edf2ec94 in /home/amai/.conda/envs/31/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: rasterizeReleaseBuffers(int, RasterizeGLState&) + 0xe1 (0x7fe56cf53f49 in /home/amai/.cache/torch_extensions/py310_cu116/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #3: RasterizeGLStateWrapper::~RasterizeGLStateWrapper() + 0x33 (0x7fe56cfacee1 in /home/amai/.cache/torch_extensions/py310_cu116/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #4: std::default_delete<RasterizeGLStateWrapper>::operator()(RasterizeGLStateWrapper*) const + 0x22 (0x7fe56cf939fa in /home/amai/.cache/torch_extensions/py310_cu116/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #5: std::unique_ptr<RasterizeGLStateWrapper, std::default_delete<RasterizeGLStateWrapper> >::~unique_ptr() + 0x52 (0x7fe56cf88b62 in /home/amai/.cache/torch_extensions/py310_cu116/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #6: <unknown function> + 0xc06a5 (0x7fe56cf826a5 in /home/amai/.cache/torch_extensions/py310_cu116/nvdiffrast_plugin/nvdiffrast_plugin.so)
frame #7: <unknown function> + 0x355022 (0x7fe5c4c97022 in /home/amai/.conda/envs/31/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x355eef (0x7fe5c4c97eef in /home/amai/.conda/envs/31/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #19: <unknown function> + 0x23790 (0x7fe604562790 in /usr/lib/libc.so.6)
frame #20: __libc_start_main + 0x8a (0x7fe60456284a in /usr/lib/libc.so.6)

zsh: IOT instruction (core dumped)  python train.py --config configs/nerf_car.json
@jmunkberg
Copy link
Collaborator

I would suspect a memory issue. In the second pass, we switch to learning 2D textures so the memory requirement goes up a bit. If you are running near the memory limit, perhaps try decreasing the texture resolution a bit, e.g., using the config flag "texture_res": [ 512, 512 ],, or, if you are running on a GPU with <32GB of memory, perhaps also reduce the batch size.

You can track memory usage by nvidia-smi or nvitop https://github.com/XuehaiPan/nvitop

@half-potato
Copy link
Author

I agree. I just realized nvdiffrecmc works with batch size = 6 without crashing. This is for the purpose of benchmarking, so I hope this doesn't decrease accuracy too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants