Tensorboard logging crashes the trainer #11103

twaslowski · 2021-12-16T08:23:18Z

🐛 Bug

When trying to call trainer.fit() on a model, PyTorch Lightning attempts to log an empty hparams dict using Tensorboard. Down the call stack, this results tensorboard logging the following object:

{hp_metric:-1}

which results in the following error being thrown:

ValueError:
you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.

To Reproduce

I ran the boring model on my machine, as can be seen in the following gist:

https://gist.github.com/TobiasWaslowski/3c203ea6430e3a008703df6ff7437575

Expected behavior

I'm assuming that if the hparams are empty, they should just not get logged.

Environment

CUDA:
- GPU:
- available: False
- version: None
Packages:
- numpy: 1.21.4
- pyTorch_debug: False
- pyTorch_version: 1.10.0
- pytorch-lightning: 1.5.5
- tqdm: 4.62.3
System:
- OS: Darwin
- architecture:
  - 64bit
- processor: i386
- python: 3.8.5
- version: Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64

Additional context

rohitgr7 · 2021-12-16T11:33:13Z

I ran your example and it's working fine for me. Can you try running it on a session (like colab) to share the issue on the exact? Because if there are no hyperparams, it doesn't log.
https://github.com/PyTorchLightning/pytorch-lightning/blob/cec2d7946b9da07289025e27e57597538d2c50ec/pytorch_lightning/trainer/trainer.py#L1225-L1228

twaslowski · 2021-12-16T11:54:32Z

You're right, it does behave as expected on colab. On my machine however, it attempts to log the empty hparams object anyway, which causes the crash. As far as I can tell, this happens because there is a check
if hparams is not None
in trainer.py:1254, but no check for the object being empty.

In tensorboard, the following code block (tensorboard.py:202) then causes the issue that actually crashes the application:
if metrics is None: if self._default_hp_metric: metrics = {"hp_metric": -1}

Arguably this is could also be a tensorboard issue. From my understanding, the code defines a fallback object if the metrics are None, but then not even ten lines further down, this fallback object crashes the application.

For further debugging purposes, the stacktrace looks like this:

Traceback (most recent call last):
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 2127, in
main()
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 2118, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 1427, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 1434, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/Development/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "PATH_TO_PROJECT/src/neuralnet/pytorch/lightning_test.py", line 66, in
run()
File "PATH_TO_PROJECT/src/neuralnet/pytorch/lightning_test.py", line 62, in run
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 906, in test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 949, in _test_impl
results = self._run(model, ckpt_path=self.tested_ckpt_path)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1184, in _run
self._pre_dispatch()
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1220, in _pre_dispatch
self._log_hyperparams()
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1255, in _log_hyperparams
self.logger.log_hyperparams(hparams_initial)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
return fn(*args, **kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 176, in log_hyperparams
@rank_zero_only
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
return fn(*args, **kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 233, in log_metrics
raise ValueError(m) from ex
ValueError:
you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.

Is there any reasonable explanation for this?

rohitgr7 · 2021-12-16T12:05:43Z

yes possibly this can be improved a little. But just curious, how is your hparams empty there?

twaslowski · 2021-12-16T12:12:25Z

That's an excellent question. I'm new to PyTorch and Lightning, so I was asking myself the same question. At first I thought I had probably initialized the model incorrectly, but then using the examples provided here yielded the same error. I can't help but feel like this is probably a version- or distribution-specific issue (as the same code works on colab).

I wiped my entire venv and reinstalled everything, but unfortunately that hasn't fixed the issue as well. For the record, my tensorboard version is 2.7.0 (in addition to all the versions listed above).

Right now I'm seeing a couple of solutions, but they all just revolve around disabling logging (which I personally don't really need at the moment). Do you see anything else?

rohitgr7 · 2021-12-16T12:53:42Z

okay.. looks like it's triggering the log_hyperparams call but for me, logging params with -1 didn't raise this warning. Using the same tensorboard version.

twaslowski · 2021-12-16T14:14:12Z

Oh shit, you're right! I might have misinterpreted the stacktrace. There is another stacktrace that looks as follows:

Traceback (most recent call last):
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 229, in log_metrics
self.experiment.add_scalar(k, v, step)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 43, in experiment
return get_experiment() or DummyExperiment()
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
return fn(*args, **kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in get_experiment
return fn(self)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 173, in experiment
self._experiment = SummaryWriter(log_dir=self.log_dir, **self._kwargs)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 220, in init
self._get_file_writer()
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 250, in _get_file_writer
self.file_writer = FileWriter(self.log_dir, self.max_queue,
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 60, in init
self.event_writer = EventFileWriter(
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 72, in init
tf.io.gfile.makedirs(logdir)
File "PATH_TO_PROJECT/src/venv/lib/python3.8/site-packages/tensorboard/lazy.py", line 65, in getattr
return getattr(load_once(self), attr_name)
AttributeError: module 'tensorflow' has no attribute 'io'

The above exception was the direct cause of the following exception:
[the stacktrace posted above]

I figured the true issue was probably the one below that, but it turns out that the 'module tensorflow has no attribute io' error message is more well-known and well-documented. Probably this issue was caused by a combination of a bad environment (I blame pip) and tensorboard's handling of the data. I'm not sure yet, but I'll update this thread as I dig deeper into this issue.

awaelchli · 2021-12-16T18:25:35Z

Interesting, thanks for checking this. What could be done here is install again in a fresh environment, see if it is fixed and if it is, compare the two environments.

twaslowski · 2021-12-17T13:03:21Z

Yeah, so this is the funniest thing ever. Yesterday at some point I just called it a day, and today I re-ran the example that I provided in the gist in the initial comment, and it just worked. I didn't change anything about the environment, I didn't even reboot, so I have no idea why it works now. My best guess that deactivating and reactivating the venv solved some the issue.
I won't actively look into it any further for now, because the project that I'm working on has a bit of a timeline, but I'll get back to this issue if it comes up again.

rohitgr7 · 2021-12-17T13:46:45Z

restaring what's not working is always the end-game solution 😂

closing this for now. feel free to reopen if it comes up again :)

yotaro-shimose · 2023-04-13T05:15:48Z

I had the same problem. In my case, the problem was both tensorboard and tensorboardX were installed in my environment.

After uninstalling tensorboard with the following command, the error went away.

pip uninstall tensorboard

Hope this helps someone. My lightning version is 2.0.1.post0.

twaslowski added the bug Something isn't working label Dec 16, 2021

rohitgr7 mentioned this issue Dec 16, 2021

Enable logging hparams only if there are any #11105

Merged

12 tasks

rohitgr7 closed this as completed Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorboard logging crashes the trainer #11103

Tensorboard logging crashes the trainer #11103

twaslowski commented Dec 16, 2021

rohitgr7 commented Dec 16, 2021

twaslowski commented Dec 16, 2021

rohitgr7 commented Dec 16, 2021

twaslowski commented Dec 16, 2021

rohitgr7 commented Dec 16, 2021

twaslowski commented Dec 16, 2021

awaelchli commented Dec 16, 2021

twaslowski commented Dec 17, 2021

rohitgr7 commented Dec 17, 2021

yotaro-shimose commented Apr 13, 2023

Tensorboard logging crashes the trainer #11103

Tensorboard logging crashes the trainer #11103

Comments

twaslowski commented Dec 16, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

rohitgr7 commented Dec 16, 2021

twaslowski commented Dec 16, 2021

rohitgr7 commented Dec 16, 2021

twaslowski commented Dec 16, 2021

rohitgr7 commented Dec 16, 2021

twaslowski commented Dec 16, 2021

awaelchli commented Dec 16, 2021

twaslowski commented Dec 17, 2021

rohitgr7 commented Dec 17, 2021

yotaro-shimose commented Apr 13, 2023