Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WandDBLogger ddp race condition #1972

Closed
sshleifer opened this issue May 27, 2020 · 8 comments · Fixed by #2029
Closed

WandDBLogger ddp race condition #1972

sshleifer opened this issue May 27, 2020 · 8 comments · Fixed by #2029
Labels
help wanted Open to be worked on logger Related to the Loggers

Comments

@sshleifer
Copy link
Contributor

sshleifer commented May 27, 2020

When I run trainer(gpus=2, logger=wandblogger()) it tries to connect to wandb twice (I think) and then one of those attempts fails.

How to fix?

[original feature] (#627)

Code

logger = WandbLogger(name=model.output_dir.name)
logger.log_hyperparams(args) # you can see 1 copy of this on wandb.com

checkpoint_callback = ModelCheckpoint(
    filepath=str(model.output_dir / "{epoch}-{val_avg_rouge2:.4f}"), monitor="val_loss", mode="min", save_top_k=1,
)

train_params = dict(
    accumulate_grad_batches=args.gradient_accumulation_steps,
    gpus=2,
    max_epochs=2,
    checkpoint_callback=checkpoint_callback,
    callbacks=[LoggingCallback()],
    val_check_interval=0.25
    logger=logger,
    weights_summary=None,
    use_amp=True
)
trainer = pl.Trainer(**train_params)
trainer.fit(model)

Epic traceback

wandb: Tracking run with wandb version 0.8.36
wandb: Run data is saved locally in wandb/run-20200527_155653-1yd9jb0c
wandb: Syncing run cnn_12_9_no_teacher_mgpu
wandb: ⭐ View project at https://app.wandb.ai/sshleifer/transformers_fork-examples_summarization_bart
wandb: 🚀 View run at https://app.wandb.ai/sshleifer/transformers_fork-examples_summarization_bart/runs/1yd9jb0c
wandb: Run `wandb off` to turn off syncing.

GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0,1]
Using 16bit precision.
MASTER_ADDR environment variable is not defined. Set as localhost
initializing proc_rank 0 world 2
MASTER_ADDR environment variable is not defined. Set as localhost
initializing proc_rank 1 world 2
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

here's the 2nd connection

wandb: Tracking run with wandb version 0.8.36
wandb: Run data is saved locally in wandb/run-20200527_155706-1yd9jb0c
wandb: Syncing run cnn_12_9_no_teacher_mgpu
wandb: ⭐ View project at https://app.wandb.ai/sshleifer/transformers_fork-examples_summarization_bart
wandb: 🚀 View run at https://app.wandb.ai/sshleifer/transformers_fork-examples_summarization_bart/runs/1yd9jb0c
wandb: Run `wandb off` to turn off syncing.

Then error.

Traceback (most recent call last):                                                                                                                                                                                                 [69/1405]
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 389, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 973, in run_pretrain_routine
    self.configure_checkpoint_callback()
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_config.py", line 58, in configure_ch
wandb: Waiting for W&B process to finish, PID 8253
eckpoint_callback
    "checkpoints"
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/posixpath.py", line 94, in join
    genericpath._check_arg_types('join', a, *p)
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/genericpath.py", line 149, in _check_arg_types
    (funcname, s.__class__.__name__)) from None
TypeError: join() argument must be str or bytes, not 'NoneType'

wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: Program ended successfully.
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/wandb/apis/file_stream.py", line 213, in _thread_body
    self._send(ready_chunks)
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/wandb/apis/file_stream.py", line 249, in _send
    self._client.post, self._endpoint, json={'files': files}))
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/wandb/apis/file_stream.py", line 227, in _handle_response
    raise response
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/wandb/util.py", line 590, in request_with_retry
    response.raise_for_status()
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.wandb.ai/files/sshleifer/transformers_fork-examples_summarization_bart/a199yu5d/file_stream

wandb: Syncing 7 W&B file(s) and 0 media file(s)
wandb: Process crashed early, not syncing files
@sshleifer
Copy link
Contributor Author

When are you guys planning 7.7 release?

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 2, 2020

0.8.0 will be released next friday (june 12). This will be the last release before the stable 1.0.0 release slated for mid-july.

did this PR solve the problem you were having? test out HF + PL on master now to make sure it works well. Mainly we made it so that pickling is no longer an issue :)

this ddp implementation is much more scalable and works much better on single node instances.

@sshleifer
Copy link
Contributor Author

sshleifer commented Jun 4, 2020

Testing it now.
FYI

pip install https://github.com/PytorchLightning/pytorch-lightning/archive/master.zip --upgrade

fails with

  AttributeError: type object 'Callable' has no attribute '_abc_registry'
  ----------------------------------------
ERROR: Command errored out with exit status 1: /home/shleifer/miniconda3/envs/nb/bin/python /home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-dvi7mzvd/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools wheel Check the logs for full command output.

@sshleifer
Copy link
Contributor Author

pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@master --upgrade

also fails

(nb) ➜  ~ pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@master --upgrade
Collecting git+https://github.com/PytorchLightning/pytorch-lightning.git@master
  Cloning https://github.com/PytorchLightning/pytorch-lightning.git (to revision master) to /tmp/pip-req-build-q98ryeoi
  Running command git clone -q https://github.com/PytorchLightning/pytorch-lightning.git /tmp/pip-req-build-q98ryeoi
  Installing build dependencies ... error
  ERROR: Command errored out with exit status 1:
   command: /home/shleifer/miniconda3/envs/nb/bin/python /home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip install --
ignore-installed --no-user --prefix /tmp/pip-build-env-le91y8yf/overlay --no-warn-script-location --no-binary :none: --only-binary :n
one: -i https://pypi.org/simple -- setuptools wheel
       cwd: None
  Complete output (44 lines):
  Traceback (most recent call last):
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/runpy.py", line 193, in _run_module_as_main
      "__main__", mod_spec)
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/runpy.py", line 85, in _run_code
      exec(code, run_globals)
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/__main__.py", line 26, in <module>
      sys.exit(_main())
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_internal/cli/main.py", line 73, in main
      command = create_command(cmd_name, isolated=("--isolated" in cmd_args))
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_internal/commands/__init__.py", line 104, in create_comm
and
      module = importlib.import_module(module_path)
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/importlib/__init__.py", line 127, in import_module
      return _bootstrap._gcd_import(name[level:], package, level)
    File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
    File "<frozen importlib._bootstrap>", line 983, in _find_and_load
    File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
    File "<frozen importlib._bootstrap_external>", line 728, in exec_module
    File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 24, in <module>
      from pip._internal.cli.req_command import RequirementCommand, with_cleanup
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_internal/cli/req_command.py", line 16, in <module>
      from pip._internal.index.package_finder import PackageFinder
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_internal/index/package_finder.py", line 21, in <module>
      from pip._internal.index.collector import parse_links
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_internal/index/collector.py", line 14, in <module>
      from pip._vendor import html5lib, requests
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_vendor/requests/__init__.py", line 114, in <module>
      from . import utils
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_vendor/requests/utils.py", line 25, in <module>
      from . import certs
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_vendor/requests/certs.py", line 15, in <module>
      from pip._vendor.certifi import where
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_vendor/certifi/__init__.py", line 1, in <module>
      from .core import contents, where
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pip/_vendor/certifi/core.py", line 12, in <module>
      from importlib.resources import read_text
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/importlib/resources.py", line 11, in <module>
      from typing import Iterable, Iterator, Optional, Set, Union   # noqa: F401
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/typing.py", line 1357, in <module>
      class Callable(extra=collections_abc.Callable, metaclass=CallableMeta):
    File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/typing.py", line 1005, in __new__
      self._abc_registry = extra._abc_registry
  AttributeError: type object 'Callable' has no attribute '_abc_registry'
  ----------------------------------------
ERROR: Command errored out with exit status 1: /home/shleifer/miniconda3/envs/nb/bin/python /home/shleifer/miniconda3/envs/nb/lib/pyt
hon3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-le91y8yf/overlay --no-warn-script-location
--no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools wheel Check the logs for full command output.

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 4, 2020

@sshleifer can't replicate. install works fine.
Reset your environment or something?

https://colab.research.google.com/drive/1G-UZqDxkORegvy0oXgYU6IENN3Y9PQPz?usp=sharing

Sounds you have something weird in your env. I googled "type object 'Callable' has no attribute _abc_registry"

and found:
image

FYI... we test installs on every PR for all operating systems and python versions. I haven't seen a broken test about install, so it's likely your env

@sshleifer
Copy link
Contributor Author

sshleifer commented Jun 4, 2020

I did pip uninstall typing and then it worked, thx.
now I think multi-gpu eval is hanging.

I get terminal output

finetune.py: error: unrecognized arguments: --gpus 2
initializing ddp: LOCAL_RANK: 0/1 WORLD_SIZE:2

and my multi_gpu unittest also hangs with a similar message, even though I never pass a --gpus=2 through the command line. I pass n_gpu=2 and https://github.com/huggingface/transformers/blob/master/examples/lightning_base.py#L240 converts that to gpus=n_gpu.

Let me know if anything jumps out there, otherwise I'll debug.

@williamFalcon
Copy link
Contributor

ah yes. lightning now calls the script passing in the gpus flag. but if you called it you would have specified gpus. so, it’s weird that this is not working. can you share a colab?

@sshleifer
Copy link
Contributor Author

sshleifer commented Jun 5, 2020

I can't share my code, but I am using SummarizationTrainer which inherits from BaseTransformer

There is another bug on master in loading of checkpoints, potentially related to upgrading pl mid run?
At this [line] (https://github.com/huggingface/transformers/blob/master/examples/summarization/bart/finetune.py#L174)

Traceback (most recent call last):
  File "finetune.py", line 731, in <module>
    main(args)
  File "finetune.py", line 702, in main
    model = model.load_from_checkpoint(checkpoints[-1])
  File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1563, in load_from_checkpoint
    checkpoint[CHECKPOINT_KEY_MODULE_ARGS].update(kwargs)
KeyError: 'module_arguments'

@Borda Borda added the logger Related to the Loggers label Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on logger Related to the Loggers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants