-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data loaders abort when using multi-processing with a remote Aim repo #2540
Comments
Hey @andychisholm! Thanks for submitting the issue with such details. Really appreciate that 🙌 |
@alberttorosyan any thoughts on this one? Even a potentially fruitful direction to explore when debugging would be useful |
@andychisholm, I don't have good evidence on what's happening yet. The only possible thing which comes to my mind is following: I'll continue looking into this. Any additional information would be a huge help! |
I'm seeing the same issue. My aim repo is also remote. Seems related to this: Lightning-AI/pytorch-lightning#8821 |
Just to follow up on this one, I think it's to do with a lack of forking support for the GRPC client. Regardless of whether the aim loggers are used in sub-processes they blow up the data loaders in various non-deterministic ways. For example, if you do a DDP train with multiple GPUs and multiple dataloader workers per GPU this occurs, but if you switch the start method from the default |
I can also confirm this issue. It is related to #1297 I am guessing. |
🐛 Bug
We're seeing a non-deterministic error which occurs during a torch lightning train when we adopt a remote AIM repo for logging (i.e. setting
repo="aim://our-aim-server:53800/"
when initializing aaim.pytorch_lightning.AimLogger
.This only happens when switching from a local
AimLogger
to remote repo, with no other changes to the codebase.Mitigations
num_workers
on torch data loaders is reduced to 0, the issue does not reproduce. So it seems to be multi-processing relatedIt's difficult to see how Aim is involved at all in the data loader pipeline to produce a relationship like this, but this is what we can observe.
Error Detail
During the first epoch we typically see one to many:
Immediately followed by data loader abort stack traces, e.g:
To reproduce
Unable to provide a minimal reproduction at this stage.
Appreciate this is going to be incredibly difficult to debug! Just hoping someone's seen something like this before.
Expected behavior
Aim logger initialisation should not cause torch data loader deadlocks.
Environment
3.8.10
23.0
Ubuntu 20.04.5 LTS
2 GPU, DDP, dataloader num_workers=8
The text was updated successfully, but these errors were encountered: