-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorboard logging crashes the trainer #11103
Comments
I ran your example and it's working fine for me. Can you try running it on a session (like colab) to share the issue on the exact? Because if there are no hyperparams, it doesn't log. |
You're right, it does behave as expected on colab. On my machine however, it attempts to log the empty hparams object anyway, which causes the crash. As far as I can tell, this happens because there is a check In tensorboard, the following code block (tensorboard.py:202) then causes the issue that actually crashes the application: Arguably this is could also be a tensorboard issue. From my understanding, the code defines a fallback object if the metrics are None, but then not even ten lines further down, this fallback object crashes the application. For further debugging purposes, the stacktrace looks like this: Traceback (most recent call last): Is there any reasonable explanation for this? |
yes possibly this can be improved a little. But just curious, how is your |
That's an excellent question. I'm new to PyTorch and Lightning, so I was asking myself the same question. At first I thought I had probably initialized the model incorrectly, but then using the examples provided here yielded the same error. I can't help but feel like this is probably a version- or distribution-specific issue (as the same code works on colab). I wiped my entire venv and reinstalled everything, but unfortunately that hasn't fixed the issue as well. For the record, my tensorboard version is 2.7.0 (in addition to all the versions listed above). Right now I'm seeing a couple of solutions, but they all just revolve around disabling logging (which I personally don't really need at the moment). Do you see anything else? |
okay.. looks like it's triggering the |
Oh shit, you're right! I might have misinterpreted the stacktrace. There is another stacktrace that looks as follows: Traceback (most recent call last): The above exception was the direct cause of the following exception: I figured the true issue was probably the one below that, but it turns out that the 'module tensorflow has no attribute io' error message is more well-known and well-documented. Probably this issue was caused by a combination of a bad environment (I blame pip) and tensorboard's handling of the data. I'm not sure yet, but I'll update this thread as I dig deeper into this issue. |
Interesting, thanks for checking this. What could be done here is install again in a fresh environment, see if it is fixed and if it is, compare the two environments. |
Yeah, so this is the funniest thing ever. Yesterday at some point I just called it a day, and today I re-ran the example that I provided in the gist in the initial comment, and it just worked. I didn't change anything about the environment, I didn't even reboot, so I have no idea why it works now. My best guess that deactivating and reactivating the venv solved some the issue. |
restaring what's not working is always the end-game solution 😂 closing this for now. feel free to reopen if it comes up again :) |
I had the same problem. In my case, the problem was both tensorboard and tensorboardX were installed in my environment. After uninstalling tensorboard with the following command, the error went away.
Hope this helps someone. My lightning version is 2.0.1.post0. |
🐛 Bug
When trying to call trainer.fit() on a model, PyTorch Lightning attempts to log an empty hparams dict using Tensorboard. Down the call stack, this results tensorboard logging the following object:
{hp_metric:-1}
which results in the following error being thrown:
ValueError:
you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.
To Reproduce
I ran the boring model on my machine, as can be seen in the following gist:
https://gist.github.com/TobiasWaslowski/3c203ea6430e3a008703df6ff7437575
Expected behavior
I'm assuming that if the hparams are empty, they should just not get logged.
Environment
Additional context
The text was updated successfully, but these errors were encountered: