Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALSA lib error #18

Open
Cherryjingyao opened this issue Mar 15, 2024 · 15 comments
Open

ALSA lib error #18

Cherryjingyao opened this issue Mar 15, 2024 · 15 comments

Comments

@Cherryjingyao
Copy link

while I run the traing code
I found the ALSA error:
image

ALSA lib conf.c:5180:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5703:(snd_config_expand) Evaluate error: No such file or directory

why this will use ALSA and how can i fix it

@lukashermann
Copy link
Owner

Can you give a bit more context? At which point does this error occur? what are you running?

@Cherryjingyao
Copy link
Author

i'm running the training bash script using the debug dataset
python hulc/training.py +trainer.gpus=-1 +datamodule.root_data_dir=/data/calvin/debug_dataset datamodule/datasets=vision_lang_shm
after validation the error comes out
image

ps when i use the ABC_D dataset it collapse when load the data

@lukashermann
Copy link
Owner

It could be a problem with the GPU renderer when you run the rollouts. Can you try turning off rollouts and see if it still crashes? You have to add ~callbacks/rollout and ~callbacks/rollout_lh to the command line arguments. Which GPU do you have in your machine?

For the bigger dataset, you might not have enough shared memory, so try using the normal dataloader by setting datamodule/datasets=vision_lang.

@Cherryjingyao
Copy link
Author

when adding ~callbacks/rollout and ~callbacks/rollout_lh,it shows that: I have 4 A100 with 40G . and 5 EGL devices , with ID=4 can be used

@Cherryjingyao
Copy link
Author

when adding ~callbacks/rollout and ~callbacks/rollout_lh,it shows that:
image

I have 4 A100 with 40G . and 5 EGL devices , with ID=4 can be used .so I set EGL_DEVICE_ID=4 to run the code or it will crash

@lukashermann
Copy link
Owner

ah sorry, my bad, then try only using `callbacks/rollout_lh.

Is your Nvidia driver correctly installed? In the log you previously sent it mentions Mesa, which shouldn't be used if you have an Nvidia GPU with the correct driver.

@Cherryjingyao
Copy link
Author

how can i not use Mesa? Here is my GPU :
image

I tried using the normal dataloader by setting datamodule/datasets=vision_lang. it can load the data ,but after validation of epoch 0, it crashes again :
`[Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800149 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800149 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe1fdae0d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fe1fec886e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fe1fec8bc3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fe1fec8c839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fe2489a9e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fe24a003609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe249dc2133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800149 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe1fdae0d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fe1fec886e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fe1fec8bc3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fe1fec8c839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fe2489a9e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fe24a003609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe249dc2133 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe1fdae0d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7fe1fe9e2b11 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fe2489a9e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7fe24a003609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fe249dc2133 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800818 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800818 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd106797d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd10793f6e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd107942c3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd107943839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fd151660e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fd152cba609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd152a79133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800818 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd106797d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd10793f6e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd107942c3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd107943839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fd151660e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fd152cba609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd152a79133 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd106797d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7fd107699b11 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fd151660e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7fd152cba609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fd152a79133 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800919 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800919 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2aa3076d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2aa421e6e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2aa4221c3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2aa4222839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2aedf3fe95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2aef599609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2aef358133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800919 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2aa3076d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2aa421e6e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2aa4221c3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2aa4222839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2aedf3fe95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2aef599609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2aef358133 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2aa3076d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f2aa3f78b11 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7f2aedf3fe95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7f2aef599609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f2aef358133 in /lib/x86_64-linux-gnu/libc.so.6)

Error executing job with overrides: ['+trainer.gpus=-1', 'datamodule.root_data_dir=/data/calvin/task_ABC_D', 'datamodule/datasets=vision_lang', '+datamodule.num_workers=1']
Traceback (most recent call last):
File "/pfs-data/code/hulc/hulc/training.py", line 76, in train
trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore
File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
mp.start_processes(
File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGABRT`

I wonder if this is related to the used of Mesa

@lukashermann
Copy link
Owner

I just realized there is a mistake in the README file, when pytorch-lightning upgraded to a newer version, they renamed the trainer.gpus argument to trainer.devices. This is already reflected in the code, but not in the documentation. Can you try running it on a single GPU without rollouts and without the shared memory dataloader?

python hulc/training.py trainer.devices=1 datamodule.root_data_dir=/data/calvin/debug_dataset datamodule/datasets=vision_lang ~callbacks/rollout_lh

If this works, then try the multiprocessing version by setting trainer.devices=-1

@Cherryjingyao
Copy link
Author

I run this code as you advised ,but I found that the training speed is very slow 4GPU batch size 32 1 epoch will take almost 10 hour.
image
one more strange thing is that when i use debug dataset it can train normally with many epochs ,but it will crash when using the ABC_D dataset , no error just stop at
image

@Cherryjingyao
Copy link
Author

I also run the evaluate code withe pretrainde models
python hulc/evaluation/evaluate_policy_ori.py --dataset_path /data/calvin/task_ABC_D --train_folder ./checkpoints/HULC_ABC_D
How to get the video as output .i can't find some related params

@lukashermann
Copy link
Owner

lukashermann commented Mar 20, 2024

I suggest you increase the batch size, we used 8 NVIDIA RTX 2080 ti with only 12 GB of memory per GPU, so if you have A100, you can easily increase the batch size. Since you use 4 GPUs in your setup, you could start by using batch size 64 if you want to have the same effective batch size as we had in our experiments, but feel free to experiment with increasing it more. Also, you can try to use more dataloading workers by setting datamodule.datasets.vision_dataset.num_workers=4 and datamodule.datasets.lang_dataset.num_workers=4. Using the shared memory dataloader further speeds up training, but you need enough shared memory on your machine.

We trained our models for 30 epochs on 8 GPUs which took around 1 week (depending on the dataset).

I also run the evaluate code withe pretrainde models python hulc/evaluation/evaluate_policy_ori.py --dataset_path /data/calvin/task_ABC_D --train_folder ./checkpoints/HULC_ABC_D How to get the video as output .i can't find some related params

The code currently doesn't implement writing the video to a file, you can visualize it by setting --debug. However, it should be a straightforward modification to save the video output.

@Cherryjingyao
Copy link
Author

Thanks for your suggestion , I can run the code normally with num_workers=4 and batch_size =64 (although limited by the memory the speed is still slow)
After running one epoch I found nothing output , where the trained model saved ,and the save interval ,which params related to this .( I'm not familiar with the use of hydra )
Again thanks for your answering

@lukashermann
Copy link
Owner

By default, it saves the model every epoch. If you didn't set the log_dir command line argument, then it creates a folder runs in the hulc directory, where all the runs are saved.

@Cherryjingyao
Copy link
Author

I got it
thanks for answering!

@lukashermann
Copy link
Owner

in order to make the rollouts work, did you have a look at this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants