Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard logging crashes the trainer #11103

Closed
twaslowski opened this issue Dec 16, 2021 · 10 comments · Fixed by #11105
Closed

Tensorboard logging crashes the trainer #11103

twaslowski opened this issue Dec 16, 2021 · 10 comments · Fixed by #11105
Labels
bug Something isn't working

Comments

@twaslowski
Copy link

🐛 Bug

When trying to call trainer.fit() on a model, PyTorch Lightning attempts to log an empty hparams dict using Tensorboard. Down the call stack, this results tensorboard logging the following object:

{hp_metric:-1}

which results in the following error being thrown:

ValueError:
you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.

To Reproduce

I ran the boring model on my machine, as can be seen in the following gist:

https://gist.github.com/TobiasWaslowski/3c203ea6430e3a008703df6ff7437575

Expected behavior

I'm assuming that if the hparams are empty, they should just not get logged.

Environment

  • CUDA:
    • GPU:
    • available: False
    • version: None
  • Packages:
    • numpy: 1.21.4
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0
    • pytorch-lightning: 1.5.5
    • tqdm: 4.62.3
  • System:
    • OS: Darwin
    • architecture:
      • 64bit
    • processor: i386
    • python: 3.8.5
    • version: Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64

Additional context

@twaslowski twaslowski added the bug Something isn't working label Dec 16, 2021
@rohitgr7
Copy link
Contributor

I ran your example and it's working fine for me. Can you try running it on a session (like colab) to share the issue on the exact? Because if there are no hyperparams, it doesn't log.
https://github.com/PyTorchLightning/pytorch-lightning/blob/cec2d7946b9da07289025e27e57597538d2c50ec/pytorch_lightning/trainer/trainer.py#L1225-L1228

@twaslowski
Copy link
Author

You're right, it does behave as expected on colab. On my machine however, it attempts to log the empty hparams object anyway, which causes the crash. As far as I can tell, this happens because there is a check
if hparams is not None
in trainer.py:1254, but no check for the object being empty.

In tensorboard, the following code block (tensorboard.py:202) then causes the issue that actually crashes the application:
if metrics is None: if self._default_hp_metric: metrics = {"hp_metric": -1}

Arguably this is could also be a tensorboard issue. From my understanding, the code defines a fallback object if the metrics are None, but then not even ten lines further down, this fallback object crashes the application.

For further debugging purposes, the stacktrace looks like this:

Traceback (most recent call last):
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 2127, in
main()
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 2118, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 1427, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 1434, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "PATH_TO_PROJECT/src/neuralnet/pytorch/lightning_test.py", line 66, in
run()
File "PATH_TO_PROJECT/src/neuralnet/pytorch/lightning_test.py", line 62, in run
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 906, in test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 949, in _test_impl
results = self._run(model, ckpt_path=self.tested_ckpt_path)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1184, in _run
self._pre_dispatch()
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1220, in _pre_dispatch
self._log_hyperparams()
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1255, in _log_hyperparams
self.logger.log_hyperparams(hparams_initial)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
return fn(*args, **kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 176, in log_hyperparams
@rank_zero_only
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
return fn(*args, **kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 233, in log_metrics
raise ValueError(m) from ex
ValueError:
you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.

Is there any reasonable explanation for this?

@rohitgr7
Copy link
Contributor

yes possibly this can be improved a little. But just curious, how is your hparams empty there?

@twaslowski
Copy link
Author

That's an excellent question. I'm new to PyTorch and Lightning, so I was asking myself the same question. At first I thought I had probably initialized the model incorrectly, but then using the examples provided here yielded the same error. I can't help but feel like this is probably a version- or distribution-specific issue (as the same code works on colab).

I wiped my entire venv and reinstalled everything, but unfortunately that hasn't fixed the issue as well. For the record, my tensorboard version is 2.7.0 (in addition to all the versions listed above).

Right now I'm seeing a couple of solutions, but they all just revolve around disabling logging (which I personally don't really need at the moment). Do you see anything else?

@rohitgr7
Copy link
Contributor

okay.. looks like it's triggering the log_hyperparams call but for me, logging params with -1 didn't raise this warning. Using the same tensorboard version.

@twaslowski
Copy link
Author

Oh shit, you're right! I might have misinterpreted the stacktrace. There is another stacktrace that looks as follows:

Traceback (most recent call last):
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 229, in log_metrics
self.experiment.add_scalar(k, v, step)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 43, in experiment
return get_experiment() or DummyExperiment()
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
return fn(*args, **kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in get_experiment
return fn(self)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 173, in experiment
self._experiment = SummaryWriter(log_dir=self.log_dir, **self._kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 220, in init
self._get_file_writer()
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 250, in _get_file_writer
self.file_writer = FileWriter(self.log_dir, self.max_queue,
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 60, in init
self.event_writer = EventFileWriter(
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 72, in init
tf.io.gfile.makedirs(logdir)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/tensorboard/lazy.py", line 65, in getattr
return getattr(load_once(self), attr_name)
AttributeError: module 'tensorflow' has no attribute 'io'

The above exception was the direct cause of the following exception:
[the stacktrace posted above]

I figured the true issue was probably the one below that, but it turns out that the 'module tensorflow has no attribute io' error message is more well-known and well-documented. Probably this issue was caused by a combination of a bad environment (I blame pip) and tensorboard's handling of the data. I'm not sure yet, but I'll update this thread as I dig deeper into this issue.

@awaelchli
Copy link
Contributor

Interesting, thanks for checking this. What could be done here is install again in a fresh environment, see if it is fixed and if it is, compare the two environments.

@twaslowski
Copy link
Author

Yeah, so this is the funniest thing ever. Yesterday at some point I just called it a day, and today I re-ran the example that I provided in the gist in the initial comment, and it just worked. I didn't change anything about the environment, I didn't even reboot, so I have no idea why it works now. My best guess that deactivating and reactivating the venv solved some the issue.
I won't actively look into it any further for now, because the project that I'm working on has a bit of a timeline, but I'll get back to this issue if it comes up again.

@rohitgr7
Copy link
Contributor

restaring what's not working is always the end-game solution 😂

closing this for now. feel free to reopen if it comes up again :)

@yotaro-shimose
Copy link

I had the same problem. In my case, the problem was both tensorboard and tensorboardX were installed in my environment.

After uninstalling tensorboard with the following command, the error went away.

pip uninstall tensorboard

Hope this helps someone. My lightning version is 2.0.1.post0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants