-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce results from sec. 6.1 in "Variational inference using normalizing flows" #22
Comments
I was thinking something similar
Is there any reason to use squared error instead? |
I agree with KL instead og squared error. Ether we can just set q0 to a standard normal or we could just fit the mean/sigma to U before optimizing the flow parameters? I think |
Yes, I think that makes sense. Maybe a good way would be to have I didn´t get the part about the sign changing, but I agree |
Are you up for testing it? Otherwise I'll try it out at some point, however that might first be next week er something like that |
I think in the above equations the KL term was swapped and it should be However, even with this change I didn't have much luck. The initial distribution (standard normal) just gets compressed to one of the sides (which side depends a bit on optimizer, initialization, etc.). I also tried a simpler case where I wonder if maybe there is some issue with the |
Ok thanks for the effort :)! so either we can't figure out how they ran the experiments or there is a bug in the Can you share the code you used to run the experiments with me? |
The code is here Maybe one interesting experiment would be to apply two planar transformations |
Thanks for the Code. Here's a few comments |
"I think you need to subtract p(zK) = N(zK | 0,I) from the loss i.e. eq (20) in the paper": Yes that was a bit cryptic :). I just meant that log p(x,zK) = log p(zK) + log p(x|zK) = log p(zK) + U_z(zK). I think your code is missing log p(zK)? Your second plot looks very much like the paper - thats great. Maybe they just didn't completely specify how they initialized the params then. If I add log p(zK), use k=8 and w=lasagne.init.Normal(mean=0.0, std=1.0) i get and a loss of ≈2.15 |
Hmm.. but doesn't the equation you are referencing only make sense in the standard VAE setting, where the observed data Here there is no observed data The KL-divergence uses |
Sorry for the late response. It looks really good. If you make a short example I'll be very happy to include it in the example section. I hope to get around to testing the norm-flow implementation more throughout on MNIST soon, but your results seems to indicate that it is working. |
I wrote an email to Rezende about this and he kindly confirmed that this is in fact a pretty tricky optimization problem and that the parameters should be initialized by drawing from a normal distribution with small variance (ie. the transforms start out close to the identity map). Unfortunately, I haven't still been able to find a single configuration that works well for all the problems, but for some it is OK (although not quite as good as the plots in the paper), e.g. I'm a little short on time right now, but once I get a working example I don't mind contributing it to Parmesan. If you try the MNIST example and you can afford to do multiple runs, I'd be interested it knowing if initializing the |
Hello, I have been working on reproducing the work in this paper as well. I found that, in both the synthetic examples and on MNIST, increasing the variance of the distribution from which parameter initializations are drawn was very helpful. For example, try Uniform(-1.5, 1.5). Annealing was also quite helpful for the synthetic cases, and I have found some evidence that it is also helpful for MNIST. I also found iterative training helpful for MNIST (i.e. successively add each flow layer throughout training), though it wasn't helpful for the synthetic examples. I would be interested to hear how any of this works for you. |
@wuaalb would it possible for you to share your implementation that produced the above gfycat with me? |
@yberol |
@wuaalb Thank you very much, really appreciate your help! |
@wuaalb Thanks again for sharing your implementation, it guided me a lot as I re-implemented it using autograd. However, I have a question. When I plot the histogram of the samples z_K, everything is as expected. You are also plotting q_K(z_K) on the uniform grid. To compute q_K(z_K) on the grid, wouldn't I need to follow the inverse flow and get z_0, z_1, ..., z_{K-1} that produced z_K? Right now, I compute z_K's when z_0 is the uniform grid, but then the z_K's are warped and the plot does not look very nice. I would appreciate it if you can clarify how you computed q_K(z_K) for me. |
It's been like a year and a half, so I don't remember any of the details.. To me your plots look more or less OK, just with a white background and maybe some scaling issue (?). I think in my code example this is the relevant code This is to avoid the white background cmap = matplotlib.cm.get_cmap(None)
ax.set_axis_bgcolor(cmap(0.)) I think q_K(z_K) is computed like log_qK_zK = log_q0_z0 - sum_logdet_J
qK_zK = T.exp(log_qK_zK) |
Hi, How many epochs do you need to train for each U(x) distribution? My experiments, after 15 epochs, each epoch 10000 steps, still nothing sampled. Regards, |
@justinmaojones, how do you train for MNIST if you don't know the target distribution? would you be so kind to share the (pseudo) code? |
Hi @yberol did you manage to figure out how to get around the warping issue of the transformed grid? |
As discussed in #21 it would be nice to reproduce the results from sec. 6.1 in the "Variational inference using normalizing flows" paper by Rezende et al.
I would guess the approach is:
The text was updated successfully, but these errors were encountered: