Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.cuda.OutOfMemoryError #15

Open
iason-r opened this issue Dec 23, 2023 · 19 comments
Open

torch.cuda.OutOfMemoryError #15

iason-r opened this issue Dec 23, 2023 · 19 comments

Comments

@iason-r
Copy link

iason-r commented Dec 23, 2023

我使用的是16G的GPU,请问这个报错和数据集大小有关系吗

@JunyuanDeng
Copy link
Owner

16G应该是够的,您可以尝试降低batch size,最简单的做法就是把这一行的chunksize之间除以10:chunk_size//10。当然,这个会大大降低渲染速度

@iason-r
Copy link
Author

iason-r commented Jan 3, 2024

请问需要修改的是哪个文件的chunk_size

@JunyuanDeng
Copy link
Owner

你可以点上面的超链接

@iason-r
Copy link
Author

iason-r commented Jan 3, 2024

我修改了chunk_size为chunk_size//10之后依然报错
Traceback (most recent call last):
File "/home/sucronav/.conda/envs/torch/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/sucronav/.conda/envs/torch/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/sucronav/renbin/NeRF-LOAM/src/mapping.py", line 112, in spin
self.do_mapping(share_data, tracked_frame)
File "/home/sucronav/renbin/NeRF-LOAM/src/mapping.py", line 179, in do_mapping
bundle_adjust_frames(
File "/home/sucronav/renbin/NeRF-LOAM/src/variations/render_helpers.py", line 398, in bundle_adjust_frames
final_outputs = render_rays(
File "/home/sucronav/renbin/NeRF-LOAM/src/variations/render_helpers.py", line 211, in render_rays
intersections, hits = ray_intersect(
File "/home/sucronav/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sucronav/renbin/NeRF-LOAM/src/variations/voxel_helpers.py", line 534, in ray_intersect
pts_idx, min_depth, max_depth = svo_ray_intersect(
File "/home/sucronav/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/sucronav/renbin/NeRF-LOAM/src/variations/voxel_helpers.py", line 108, in forward
children = children.expand(S * G, *children.size()).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.61 GiB (GPU 0; 7.79 GiB total capacity; 979.16 MiB already allocated; 2.48 GiB free; 1016.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
程序运行过程中,我每秒刷新一次nvidia-smi
据我观察,gpu内存占用应该是没有达到100%的

@iason-r
Copy link
Author

iason-r commented Jan 3, 2024

Screenshot from 2024-01-03 16-18-50

@JunyuanDeng
Copy link
Owner

8G的显存确实有点少了,我之前没有写过把程序放在两张卡上运行过,不知道怎么修改。如果可以的话,尽量用16G以上的显存。

@iason-r
Copy link
Author

iason-r commented Jan 16, 2024

我使用24g的显存跑,但我跑了接近24个小时,才跑到68%,请问这正常吗
Screenshot from 2024-01-16 09-33-17

@JunyuanDeng
Copy link
Owner

嗯,目前确实是越跑越慢的,是我们目前的优化方向,您可以使用subscene分支,加快速度。记得fetch新的git更新

@iason-r
Copy link
Author

iason-r commented Jan 17, 2024

emm,24g显存为什么也报OutOfMemory
insert keyframe
********** current num kfs: 18 **********
tracking frame: 99%|███████████████████████████████████████████████████████████████████████████████████▎| 4501/4540 [47:26:23<44:59, 69.22s/it]Process Process-2:
Traceback (most recent call last):
File "/home/rb/anaconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/rb/anaconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/rb/NeRF-LOAM/src/mapping.py", line 112, in spin
self.do_mapping(share_data, tracked_frame)
File "/home/rb/NeRF-LOAM/src/mapping.py", line 179, in do_mapping
bundle_adjust_frames(
File "/home/rb/NeRF-LOAM/src/variations/render_helpers.py", line 398, in bundle_adjust_frames
final_outputs = render_rays(
File "/home/rb/NeRF-LOAM/src/variations/render_helpers.py", line 211, in render_rays
intersections, hits = ray_intersect(
File "/home/rb/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/rb/NeRF-LOAM/src/variations/voxel_helpers.py", line 534, in ray_intersect
pts_idx, min_depth, max_depth = svo_ray_intersect(
File "/home/rb/anaconda3/envs/torch/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/rb/NeRF-LOAM/src/variations/voxel_helpers.py", line 108, in forward
children = children.expand(S * G, *children.size()).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.50 GiB (GPU 0; 23.69 GiB total capacity; 5.14 GiB already allocated; 5.22 GiB free; 7.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@iason-r
Copy link
Author

iason-r commented Jan 17, 2024

对了,之前因为
sampled_rays_d = frame.rays_d[sample_mask].cuda()报错RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
所以我把
sample_mask = frame.sample_mask.cuda()
sampled_rays_d = frame.rays_d[sample_mask].cuda()
改成了
sample_mask = frame.sample_mask.cuda()
sample_mask = sample_mask.cuda()
frame.rays_d = frame.rays_d.cuda()
sampled_rays_d = frame.rays_d[sample_mask]
不知道对整体有什么影响吗

@JunyuanDeng
Copy link
Owner

Tried to allocate 5.50 GiB (GPU 0; 23.69 GiB total capacity; 5.14 GiB already allocated; 5.22 GiB free; 7.45 GiB reserved in total by PyTorch)
5G已经分配,7.5G reserved,理论上24G的卡,你还有11.5G的显存剩余,请问您还运行这别的算法吗,你可以用subscene分支来跑,这样会更快。

@iason-r
Copy link
Author

iason-r commented Jan 20, 2024

作者您好,我这一段时间一直在尝试运行咱们的包,但是一直卡在OutOfMemory这个问题,在解决这个问题的过程中我想学习一下咱们的代码,请问关于学习我们的代码以及深度学习这方面您有什么建议吗,我今年研一,入学之后一直在做项目,刚开始做科研,之前没有深入了解过深度学习

@JunyuanDeng
Copy link
Owner

如果没有了解过深度学习,你可以先学习一下Dive-into-DL-PyTorch这本书,中英文版都有。当然你也可以先学习一下机器学习,如果时间赶就不用了。学了基本的深度学习知识后,如果你想知道slam的东西,还得学习一下SLAM14讲这本书,了解基础知识。然后这些看完就有基本的slam和深度学习知识,之后再看看nerf的pytorch实现代码,了解nerf的原理,最后就可以看看有哪些nerf-slam的内容了,比如这个库,找找里面引用和star多的库看看。

@iason-r
Copy link
Author

iason-r commented Jan 22, 2024

好的,谢谢您了
这两天跑通了,但是跑了几次都跑飞了,感觉定位不太好呀

@JunyuanDeng
Copy link
Owner

是跑kitti吗?如果是别的场景,可能需要调整一下学习率。

@iason-r
Copy link
Author

iason-r commented Jan 22, 2024

跑的kitti00

@hhongwei1009
Copy link

hhongwei1009 commented Jun 20, 2024

好的,谢谢您了 这两天跑通了,但是跑了几次都跑飞了,感觉定位不太好呀

请问你是怎么跑通的啊,我也陷入了out of memory

@boyang9602
Copy link

boyang9602 commented Sep 2, 2024

@hhongwei1009 你是什么显卡? 我用的和作者一样的,可以跑通的
我试了KITTI 09,用的默认的09的配置,结果跟论文中差不多(稍微差一点)

APE w.r.t. translation part (m)
(with SE(3) Umeyama alignment)

       max	16.220123
      mean	5.422567
    median	3.834926
       min	0.216227
      rmse	7.052621
       sse	79135.482800
       std	4.509460

@hhongwei1009
Copy link

@hhongwei1009 你是什么显卡? 我用的和作者一样的,可以跑通的 我试了KITTI 09,用的默认的09的配置,结果跟论文中差不多(稍微差一点)

APE w.r.t. translation part (m)
(with SE(3) Umeyama alignment)

       max	16.220123
      mean	5.422567
    median	3.834926
       min	0.216227
      rmse	7.052621
       sse	79135.482800
       std	4.509460

我是4080S,已经放弃了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants