-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query about the reproducibility of the Motion Capture dataset in "Scalable Gradients..." (Li et al., 2020) #112
Comments
After looking further, I see that the KL penalty at |
0.01 worked for my experiments if you apply reasonable decay.
You would need KL at time 0, and the KL between the two stochastic processes. A nice example of how this is computed can be found here for a toy task. |
@nghiahhnguyen were you able to reproduce the results for this dataset? |
I was able to reduce the Test MSE to the range of [8.x, 1x.x] after some changes, which is much lower than my first attempt, but still short of the figure 4.03 +- 0.20 in the paper. |
Thanks for your reply, @nghiahhnguyen! Can you please share insights/hyperparameters from your setup that led to the improvement? On my end I am getting the lower-bound around 1e7 and test MSE around 30. |
Hi @abdulfatir, one point I found the most helpful, which I overlooked in my first attempt, is that the standard deviation of the predicted observations should be learnable (as stated in the paper), but the example used a fixed hyperparameter. When I noticed and changed it, Test MSE dropped significantly to [8.x, 1x.x], as I mentioned above. I'm not sure if you have noticed it already but I guess it's worth mentioning! Other than that, I don't recall any significant changes I made. In case it helps, I'm listing some significant hyperparameters in my best run:
With the above setting, my log-likelihood is around x*1e4 for training, validation, and testing. Please let me know if you managed to make further progress! |
Thanks a lot @nghiahhnguyen! I did use trainable scales in my experiments but only as a trainable vector (not output from a NN as in the paper). Thanks for sharing your hyperparameters. I am giving it a go now. Will keep you posted. Are you sure about clipping the grad norm to 0.5? Given the magnitude of gradients for this model, this is quite a small value. Something I noticed when using scales output from a NN for the observation distribution is that the model becomes very susceptible to crashing with NaNs (possibly due to numerical errors). Did you experience something similar? |
I'm glad that I can be of help @abdulfatir. I also experienced a lot of crashing with NaNs. That's why I started with such a small value of max norm for gradient clipping. I'm using gradient clipping only to stabilize the training process, so I guess you can try larger values and see if the process is still stable. I remembered that trying much larger values does not prevent NaNs-related crashing, but your setup might differ. |
@nghiahhnguyen @abdulfatir
Its probably been a while since you worked on this, any help is appreciated! Thank you. |
Is anyone willing to share their implementation to reproduce the results of the paper? 🥺 🙏 |
@matteoguarrera Unfortunately, despite trying for several weeks, I wasn't able to reproduce a number anywhere close to what's reported in the paper. Finally, for these datasets, I just copied the number from prior works in our paper: https://arxiv.org/abs/2301.11308 |
I am trying to reproduce the results of the CMU Motion Capture dataset. I use references from examples/latent_sde_lorenz.py, the paper, and the preprocessed dataset linked by ODE2VAE's repo.
My current runs' results have large discrepancies from the results in the paper so I want to check if there are any training details I'm missing. (I am not very familiar with Bayesian modeling so I try to follow the hyperparameters in the paper as closely as possible.)
Here are the main issues:
10^6
to10^9
depending on the choice of hyperparameters, while the code for ODE2VAE has the log-likelihood for ODE2VAE in the magnitude of10^4
.Here are the main training details I think I am most likely to wrongly interpret from the paper:
euler_heun
,milstein
, and evenreversible_heun
.0.01
in the paper, but in some runs with the initial lr to be0.001
, they seem to be more stable. I'm curious if you have any comments about this.1/5
of the minimum time difference between two observations. All observations are regular so we can choose the minimum time to be a particular valuea
(eg., 1), thendt
would be0.2 * a
. I want to know if my interpretation here is correct. The paper didn't mention this valuea
or the start/end time. It would be nice if you remembered this.Tagging @lxuechen since you know the most about the exp details. Thank you for your time!
The text was updated successfully, but these errors were encountered: