-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALSA lib error #18
Comments
Can you give a bit more context? At which point does this error occur? what are you running? |
It could be a problem with the GPU renderer when you run the rollouts. Can you try turning off rollouts and see if it still crashes? You have to add For the bigger dataset, you might not have enough shared memory, so try using the normal dataloader by setting |
when adding ~callbacks/rollout and ~callbacks/rollout_lh,it shows that: I have 4 A100 with 40G . and 5 EGL devices , with ID=4 can be used |
ah sorry, my bad, then try only using ` Is your Nvidia driver correctly installed? In the log you previously sent it mentions Mesa, which shouldn't be used if you have an Nvidia GPU with the correct driver. |
how can i not use Mesa? Here is my GPU : I tried using the normal dataloader by setting datamodule/datasets=vision_lang. it can load the data ,but after validation of epoch 0, it crashes again : terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800818 milliseconds before timing out. terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800919 milliseconds before timing out. terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): Error executing job with overrides: ['+trainer.gpus=-1', 'datamodule.root_data_dir=/data/calvin/task_ABC_D', 'datamodule/datasets=vision_lang', '+datamodule.num_workers=1'] I wonder if this is related to the used of Mesa |
I just realized there is a mistake in the README file, when
If this works, then try the multiprocessing version by setting |
I also run the evaluate code withe pretrainde models |
I suggest you increase the batch size, we used 8 NVIDIA RTX 2080 ti with only 12 GB of memory per GPU, so if you have A100, you can easily increase the batch size. Since you use 4 GPUs in your setup, you could start by using batch size 64 if you want to have the same effective batch size as we had in our experiments, but feel free to experiment with increasing it more. Also, you can try to use more dataloading workers by setting We trained our models for 30 epochs on 8 GPUs which took around 1 week (depending on the dataset).
The code currently doesn't implement writing the video to a file, you can visualize it by setting |
Thanks for your suggestion , I can run the code normally with num_workers=4 and batch_size =64 (although limited by the memory the speed is still slow) |
By default, it saves the model every epoch. If you didn't set the |
I got it |
in order to make the rollouts work, did you have a look at this issue ? |
while I run the traing code
I found the ALSA error:
ALSA lib conf.c:5180:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5703:(snd_config_expand) Evaluate error: No such file or directory
why this will use ALSA and how can i fix it
The text was updated successfully, but these errors were encountered: