-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TemporalFusionTransformer predicting with mode "raw" results in RuntimeError: Sizes of tensors must match except in dimension 0 #449
Comments
Facing a similar issue with another dataset. Any help would be appreciated! |
Could you try with the newest release 0.8.5? |
@jdb78 I tried 0.8.5. Had to update Unfortunately I run into a new issue. The following get's triggered as soon as you start training a TFT model
|
This is very strange because it is defined in the |
Yes, that's what I saw as well. Very strange indeed. I'm running this on an EC2 instance with 4 GPUs. Could it be that some process tries to perform an optimizer step before it gets properly initialized? |
Ok, from a script it is a bit surprising. Notebooks sometimes introduce weird issues - particularly with multi-processing. I assume, it executes well on a single GPU? Lightning wraps the optimizer, so maybe that is the issue. Can you check through the debugger that the wrapped optimizer has an |
Sorry for the late reply, only had time to look into this just now. Yes, using a single GPU I'm not getting this error. When I'm using 4 GPUs, it breaks though. >>> testopt = tft.optimizers()
>>> testopt
LightningRanger(groups=[{'N_sma_threshhold': 5, 'alpha': 0.5, 'betas': (0.95, 0.999), 'eps': 1e-05, 'k': 6, 'lr': 0.03, 'step_counter': 0, 'weight_decay': 0.0}])
>>> testopt._optimizer
Ranger (
Parameter Group 0
N_sma_threshhold: 5
alpha: 0.5
betas: (0.95, 0.999)
eps: 1e-05
k: 6
lr: 0.03
step_counter: 0
weight_decay: 0.0
)
>>> testopt._optimizer.radam_buffer
[[None, None, None], [None, None, None], [None, None, None], [None, None, None], [None, None, None], [None, None, None], [None, None, None], [None, None, None], [None, None, None], [None, None, None]]
>>> testopt._optimizer.alpha
0.5 Everything here looks as you'd expect, I think. However, inspecting >>> tft.optimizers()
Traceback (most recent call last):
File "/home/ubuntu/.pycharm_helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/home/ubuntu/.local/share/virtualenvs/tmp-XVr6zr33/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 114, in optimizers
opts = list(self.trainer.lightning_optimizers.values())
AttributeError: 'NoneType' object has no attribute 'lightning_optimizers' I'm not sure if this is expected? Anyway, even though everything at the breakpoint looks as expected, continuing the script from there still results in the error: AttributeError: 'Ranger' object has no attribute 'radam_buffer' |
Just as a test, could you try another optimizer such as "Adam". Simply pass `optimizer="adam" to the network. This is really strange as others did not experience the issue. I wonder if something has gone wrong in the installation process. |
I faced the same problem.
|
@jdb78 A different optimizer works fine. Installed a clean environment on several VMs now, and I run into the same above error every time using the default ranger optimizer. |
Hi @TomSteenbergen and @jdb78! I'm facing the same issue while training TFT model and then running The error message is: Notably, if I call I tried several setups:
Did you find the setup (versions of pytorch-forecasting, sklearn, etc) that allows solving this issue? |
I have the exactly same issue with you. But common out mode="raw" is not an option. Otherwise best_tft.plot_prediction(x, raw_prediction, idx=0) will show error too many indices for tensor of dimension 2 So any help will be appreciated! |
Hi, I also get the same errors as above. Has any of this been solved yet? |
Any solution on this? I get the same thing... |
It seems to be related to attention (n_batches x n_decoder_steps (that attend) x n_attention_heads x n_timesteps (to which is attended)) concatenation. Seems like there is an issue with the concatenation logic. Will work on a fix |
@jdb78 I am still getting this issue when retrying a failed run that got killed. So I am loading in the state_dict and it can't find that variable so throws a key error. Any ideas on why that would be happening or how to debug? |
Same issue here running the https://pytorch-forecasting.readthedocs.io/en/stable/tutorials/stallion.html tutorial,
|
Pass optimizer='adam' while creating the TFT model. It works with that. |
@akshat-chandna -> Where exactly do you pass the optimizer in the Stallion code? |
tft = TemporalFusionTransformer.from_dataset( |
I have the exact same issue when running the stallion tutorial code. |
I solve the problem by what akshat-chandna suggested by adding
into the from_dataset function, you can try it. |
---> 1 res = trainer.tuner.lr_find( File /opt/homebrew/Caskroom/miniforge/base/envs/aaa/lib/python3.9/site-packages/pytorch_lightning/tuner/tuning.py:267, in Tuner.lr_find(self, model, train_dataloaders, val_dataloaders, dataloaders, datamodule, method, min_lr, max_lr, num_training, mode, early_stop_threshold, update_attr) File /opt/homebrew/Caskroom/miniforge/base/envs/aaa/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:608, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn |
Not sure if it's the same issue, but we're facing the same problem when running the latest version of the package. What we noticed is that if you set |
Expected behavior
In order to generate the interpretation plots of a Temporal Fusion Transformer model, I first try to generate the raw predictions using
Actual behavior
The
predict
method however raises an error when usingmode="raw"
:Note that I have used the same dataloader earlier in a
predict
call without usingmode="raw"
and that works perfectly fine.The same error was raised someplace else as documented here #85 and that was fixed using https://github.com/jdb78/pytorch-forecasting/pull/108/files. Could it be this
padded_stack
function should also be used someplace else, or is something else going on? Perhaps somewhere in_concatenate_output()
, which is called in this line: https://github.com/jdb78/pytorch-forecasting/blob/master/pytorch_forecasting/models/base_model.py#L987 ?Please let me know whether it is indeed a bug, or if I am doing something wrong 🙏 Thank you!
The text was updated successfully, but these errors were encountered: