-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Multi-GPU error related to process initialization? #4289
Comments
Hi @mateuszpieniak, sorry that our distributed training mechanism is a little opaque at the moment. We're have plans of adding a tutorial soon. The issue here is with your configuration file. You shouldn't set "distributed", "cuda_device", or "world_size" in the "trainer" part of your config. Instead, should just specify distributed training at the top level of your config like this: https://github.com/allenai/allennlp-models/blob/transformer-qa-training/training_config/rc/transformer_qa_distributed.jsonnet#L43 |
@epwalsh That's fine, thank you! I have a question though. Should |
Yes, |
I think we can close this, but if you have any other issues feel free to reach out! |
@epwalsh Just a quick follow up
Updated to
Since I mostly fine-tune the model, is it possible to "disable vocab building" to speed up (just for now)? |
Hi @mateuszpieniak, would you mind making a PR to add to the docs to address # 3? As for # 4, the vocab is created from all of the instances by the main process before any of the workers are spawned. See here: https://github.com/allenai/allennlp/blob/master/allennlp/commands/train.py#L271 The vocab is then saved to the serialization directory, and then the "vocab" params are modified so that that each spawned worker just reads that vocab from the saved files: https://github.com/allenai/allennlp/blob/master/allennlp/commands/train.py#L274 |
Let me know if that doesn't make sense, I'm pretty new to the distributed code. |
@epwalsh Sure, I will do a PR. I think it makes sense if the training takes place on a single machine. Otherwise, the vocabulary should be sent over the network to the workers, because they cannot read Btw, how does gradient accumulation work for distributed training? Is it the number of steps per worker before the gradients are sent to the master? Let's consider an example with 4 GPUs with |
You're right, this won't work over the network. Currently our distributed training will only work on a single machine (so, not exactly distributed, but in theory it's more efficient than the old multi-GPU training that used And yes, the |
System (please complete the following information):
Question
I tried to run some model using distributed computation:
but I failed 😢
I suspect that the work is ongoing (I use
v1.0.0rc4
), but clearly I need to somehow initialize a process group like below, which is either missing in the code or it's me who miss something?dist.init_process_group("gloo", rank=rank, world_size=world_size)
The text was updated successfully, but these errors were encountered: