Dist tests #3515

DeNeutoy · 2019-12-12T20:47:17Z

Add some tests, fix a couple of minor bugs:

Include packages was the wrong type, so we crashed
torch.multiprocessing.set_start_method needs to be called with a force flag, otherwise in some environments it will raise an error.

I tested this on my multi-gpu machine, will run on CI too.

brendan-ai2

Thanks for these tests, Mark!

brendan-ai2 · 2019-12-12T21:01:51Z

allennlp/run.py

+    # means that it only runs if allennlp is being run as a binary.
+    # Without this, if another library that allennlp uses has set the start method,
+    # this line would crash.
+    torch.multiprocessing.set_start_method("spawn", force=True)


Hmm, this is interesting. What are the implications if another library has set a different start method and we force past that? I'm having trouble finding the docs for force. Pytorch seems to import it from Python which is here: https://github.com/python/cpython/blob/master/Lib/multiprocessing/context.py#L241 But the docs don't mention it and the code doesn't help much either... (Docs : https://docs.python.org/3/library/multiprocessing.html#multiprocessing.set_start_method doesn't mention it.)

I'm guessing you have some other source that explains why it's necessary?

It appears that setting this is some kind of global python state, which I don't understand - e.g running this sequentially in two separate python interpreters one after another, closing inbetween, results in it crashing. I'm not super sure about the implications, but using distributed with pytorch does require us to use this spawn method.

Actually on my linux machine, even opening a python interpreter causes it to already be set.

Oof, this is nasty. What it sounds like is that some orphaned processes are keeping alive the process that Python creates to spawn the workers (or something in that vein). See https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.

brendan-ai2 · 2019-12-12T22:41:11Z

Fairseq links for reference:
https://github.com/pytorch/fairseq/blob/b5f41f828b0ec9b67fa60aceb0778073d1b368b2/fairseq/distributed_utils.py#L71
https://github.com/pytorch/fairseq/blob/0d03aa889cf6f8363067a854cd40cbe5bd2bfb29/train.py#L299

brendan-ai2

LGTM, thanks!

* Logging and metrics changes for distributed training (#3372) * Refactor logging setup to support distributed attrs * `cleanup_logging()` is replaced with stdlib's `logging.shutdown()` * Remove `TeeLogger` and use standard log handlers * Remove `replace_cr_with_newline` and use the standard logging practice of using `logging.Filter` * Introduce `rank` and `world_size` optional attributes to support distributed workers * Support for distributed training in `get_metrics` * Remove bad import * Fix duplicate log messages in stdout * Remove preemptive `logging.shutdown` `logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model` * Fix black formatting issues * Remove `tee_logger` references in API doc * Set log level from `ALLENNLP_DEBUG` env * Changes to `train_model` for distributed training support (#3390) * High level API changes to support distributed training * Fix flake8 error * Fix mypy error * Add docstring and misc fixes * Fix flake tests * `Trainer` changes for distributed training (#3414) Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_ * Dist tests (#3515) * add some tests * another test, fix incorrect type annotations * torch mp uses it's own context, no need to set default * lint * strip out old DP stuff, ensure multiple cuda devices raises err… (#3516) * strip out old DP stuff, ensure multiple cuda devices raises errors * lint * remove unused attribute * remove _cuda_devices everywhere * fixes * move distributed config up to top level * lint * clean up * rename occurences of batch_group * remove hack from find_learning_rate * fix last tests * black * use a top level distributed config * correct error for int * change up parse_cuda_devices to raise good error and be strongly typed * fix merge

DeNeutoy and others added 2 commits December 12, 2019 11:49

add some tests

f9cbb4a

another test, fix incorrect type annotations

af783b4

DeNeutoy requested a review from brendan-ai2 December 12, 2019 20:47

brendan-ai2 reviewed Dec 12, 2019

View reviewed changes

torch mp uses it's own context, no need to set default

e7d9bdd

brendan-ai2 approved these changes Dec 12, 2019

View reviewed changes

lint

2c59120

DeNeutoy merged commit 9831717 into allenai:torch-distributed Dec 12, 2019

DeNeutoy deleted the dist-tests branch December 12, 2019 23:04

DeNeutoy mentioned this pull request Dec 16, 2019

Torch distributed #3529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dist tests #3515

Dist tests #3515

DeNeutoy commented Dec 12, 2019 •

edited

Loading

brendan-ai2 left a comment

brendan-ai2 Dec 12, 2019

brendan-ai2 Dec 12, 2019

DeNeutoy Dec 12, 2019

DeNeutoy Dec 12, 2019

brendan-ai2 Dec 12, 2019

brendan-ai2 commented Dec 12, 2019

brendan-ai2 left a comment

Dist tests #3515

Dist tests #3515

Conversation

DeNeutoy commented Dec 12, 2019 • edited Loading

brendan-ai2 left a comment

Choose a reason for hiding this comment

brendan-ai2 Dec 12, 2019

Choose a reason for hiding this comment

brendan-ai2 Dec 12, 2019

Choose a reason for hiding this comment

DeNeutoy Dec 12, 2019

Choose a reason for hiding this comment

DeNeutoy Dec 12, 2019

Choose a reason for hiding this comment

brendan-ai2 Dec 12, 2019

Choose a reason for hiding this comment

brendan-ai2 commented Dec 12, 2019

brendan-ai2 left a comment

Choose a reason for hiding this comment

DeNeutoy commented Dec 12, 2019 •

edited

Loading