-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Conversation
* Refactor logging setup to support distributed attrs * `cleanup_logging()` is replaced with stdlib's `logging.shutdown()` * Remove `TeeLogger` and use standard log handlers * Remove `replace_cr_with_newline` and use the standard logging practice of using `logging.Filter` * Introduce `rank` and `world_size` optional attributes to support distributed workers * Support for distributed training in `get_metrics` * Remove bad import * Fix duplicate log messages in stdout * Remove preemptive `logging.shutdown` `logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model` * Fix black formatting issues * Remove `tee_logger` references in API doc * Set log level from `ALLENNLP_DEBUG` env
* High level API changes to support distributed training * Fix flake8 error * Fix mypy error * Add docstring and misc fixes * Fix flake tests
Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_
* add some tests * another test, fix incorrect type annotations * torch mp uses it's own context, no need to set default * lint
* strip out old DP stuff, ensure multiple cuda devices raises errors * lint * remove unused attribute * remove _cuda_devices everywhere * fixes * move distributed config up to top level * lint * clean up * rename occurences of batch_group * remove hack from find_learning_rate * fix last tests * black * use a top level distributed config * correct error for int * change up parse_cuda_devices to raise good error and be strongly typed
@matt-gardner I assigned you just in case you had any Qs/ red flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a clean merge of all the PRs from torch-distributed
, so this is an easy LGTM from me. :)
If you two are happy with it, no concerns from me. It'd be nice to see a brief writeup somewhere with what's new and different here (maybe in release notes?) |
Yep, will do 👍 |
🚀 |
Hey @DeNeutoy , thanks for taking this up! Just adding a word of caution since I noticed that |
@scarecrow1123 we dug into this a bit and it's not clear that this is necessary. In particular, we're using |
@brendan-ai2 You're spot on with |
Thanks for the extra info, @scarecrow1123! Fortunately we're planning on ripping out the old |
Includes the following PRs from the
torch-distributed
branch:#3516
#3515
#3414
#3390
#3372