Reproduce results from sec. 6.1 in "Variational inference using normalizing flows" #22

casperkaae · 2015-11-02T11:27:04Z

As discussed in #21 it would be nice to reproduce the results from sec. 6.1 in the "Variational inference using normalizing flows" paper by Rezende et al.

I would guess the approach is:

sample input: x = q0 = N(z; mu, sigma^2 I)
run through mapping and normalizing flow to get p(U_hat)
update params using squared error between U and U_hat

wuaalb · 2015-11-02T13:28:47Z

I was thinking something similar

Generate input to the network from the initial distribution: z0 ~ q0(z) = N(z; mu, sigma^2 I). However where mu and sigma^2 should come from wasn't completely clear to me. The typical VAE way of estimating them from data with a MLP looks tricky because then we'd have to first sample the target distribution. Maybe they can just be fixed to something "reasonable" by hand looking at the target distribution. If the NF is powerful enough, this shouldn't matter too much anyways.
Run data through the normalizing flow layers and obtain the transformed distribution log qK(z) = log q0(z) - sum_k=1^K log |1 + u_k^T \psi_k(z_{k-1})|
Train model by minimizing KL-divergence between transformed distribution qK(z) target distribution q_true(z) = e^{-U(z)}/Z, where U(z) is one of the energy functions defined in table 1 of the paper. As the parameters being optimized do not affect q_true(z) we can safely ignore the partition function Z, if I'm not mistaken.
So the final loss would be something like loss = KL [qhat_true(z)||qK(z)] = E_{zK~qK(z)}[log qhat_true(zK) - log qK(zK)] = E_{z0~q0(z)}[-U(zK) - log(Z) - log q0(z0) + sum_k logdet J_k]

Is there any reason to use squared error instead?

casperkaae · 2015-11-02T13:54:41Z

I agree with KL instead og squared error. Ether we can just set q0 to a standard normal or we could just fit the mean/sigma to U before optimizing the flow parameters?

I think q_true(z) = e^{-U(z)}/Z should be p_true(z) = e^{-U(z)}/Z and the loss:
[phat_true(z)||qK(z)] = E_{zK~qK(z)}[log phat_true(zK) - log qK(zK)]
= E_{z0~q0(z)}[-U(zK) - log(Z) - log q0(z0) + sum_k logdet J_k]
\prop E_{z0~q0(z)}[-U(zK) - log q0(z0) + sum_k logdet J_k]

wuaalb · 2015-11-02T20:58:41Z

Yes, I think that makes sense. Maybe a good way would be to have eps ~ N(0, I) and then z0 = mu + exp(0.5*log_var)*eps and jointly optimize {mu, log_var} together with the NF parameters using the same KL-divergence loss as above.

I didn´t get the part about the sign changing, but I agree p(z) is a better name than q_true(z).

casperkaae · 2015-11-04T13:09:04Z

Are you up for testing it? Otherwise I'll try it out at some point, however that might first be next week er something like that

wuaalb · 2015-11-05T17:37:06Z

I think in the above equations the KL term was swapped and it should be KL[qK(z)||p(z)]..

However, even with this change I didn't have much luck. The initial distribution (standard normal) just gets compressed to one of the sides (which side depends a bit on optimizer, initialization, etc.).

I also tried a simpler case where p(z) is a diagonal covariance normal distribution with some manually set mean and variance, and I applied the flow f(z) = mu + sigma*z..
This does seem to work, except that I had some Theano-problems getting the transformed distribution out, and could only get the transformed samples out..

I wonder if maybe there is some issue with the NormalizingPlanarFlowLayer after all..

casperkaae · 2015-11-05T23:19:27Z

Ok thanks for the effort :)! so either we can't figure out how they ran the experiments or there is a bug in the NormalizingPlanarFlowLayercode.

Can you share the code you used to run the experiments with me?

wuaalb · 2015-11-06T13:27:25Z

The code is here
https://gist.github.com/wuaalb/c5b85d0c257d44b0d98a

Maybe one interesting experiment would be to apply two planar transformations f(z) = z + u h(w^T z + b) and manually figure out some settings of {u, w, b} (there's a geometric interpretation of the transform) that "split" a spherical distribution into a bimodal distribution like in figure 1 of the paper. Then try learning that and seeing what goes wrong.

casperkaae · 2015-11-09T12:38:35Z

Thanks for the Code. Here's a few comments
I think you need to subtract p(zK) = N(zK | 0,I) from the loss i.e. eq (20) in the paper?https://gist.github.com/wuaalb/c5b85d0c257d44b0d98a#file-nf6_1-py-L174:

if i do that with k=32 i get

I.e. one of the modes correct

wuaalb · 2015-11-09T12:56:22Z

Sounds interesting.. but I'm not completely sure what you mean; could you post the exact change? What's "the loss jf"?

I ran some more experiments; and fixed the 3rd plot (so 3rd and 4th plot should be identical).
With nflows=8 I get some thing similar to you

However, if I initialize with w=lasagne.init.Normal(mean=[1.0, 0.0]), it is able to split the distribution into a bimodal one.

So, I think this at least means that NormalizingPlanarFlowLayer outputs the correct f(z) and log det |J|. So if it has a problem, it is probably in how the constraints are applied or in the initialization of the parameters; or (perhaps more likely) there's something wrong with the loss or optimization.

casperkaae · 2015-11-09T13:23:31Z

"I think you need to subtract p(zK) = N(zK | 0,I) from the loss i.e. eq (20) in the paper": Yes that was a bit cryptic :). I just meant that log p(x,zK) = log p(zK) + log p(x|zK) = log p(zK) + U_z(zK). I think your code is missing log p(zK)?

Your second plot looks very much like the paper - thats great. Maybe they just didn't completely specify how they initialized the params then. If I add log p(zK), use k=8 and w=lasagne.init.Normal(mean=0.0, std=1.0) i get and a loss of ≈2.15

wuaalb · 2015-11-09T17:32:12Z

Hmm.. but doesn't the equation you are referencing only make sense in the standard VAE setting, where the observed data x is generated by a random processing involving unobserved latent variables z?

Here there is no observed data x.. We are just trying to get our approximate distribution q_K(z) to match the target distribution p(z) by minimizing the KL-divergence between the two..

The KL-divergence uses p(z_K) in place of p(x, z_K) in eq. 20. Keeping in mind that the energy function U(z) relates to probability like p(z) = 1/Z e^{-U(z)}, we get log p(z_K) = -U(z_K) - log Z (the normalization constant log Z term is omitted from the loss function as we do not know it and it doesn't affect the optimization).

wuaalb · 2015-11-10T12:51:52Z

Getting a little closer, but it seems quite fiddly..

casperkaae · 2015-11-15T15:16:42Z

Sorry for the late response.

It looks really good. If you make a short example I'll be very happy to include it in the example section. I hope to get around to testing the norm-flow implementation more throughout on MNIST soon, but your results seems to indicate that it is working.

wuaalb · 2015-11-17T18:00:12Z

I wrote an email to Rezende about this and he kindly confirmed that this is in fact a pretty tricky optimization problem and that the parameters should be initialized by drawing from a normal distribution with small variance (ie. the transforms start out close to the identity map).

Unfortunately, I haven't still been able to find a single configuration that works well for all the problems, but for some it is OK (although not quite as good as the plots in the paper), e.g.
http://gfycat.com/ReasonableVibrantDove

I'm a little short on time right now, but once I get a working example I don't mind contributing it to Parmesan.

If you try the MNIST example and you can afford to do multiple runs, I'd be interested it knowing if initializing the b parameter to const(0) vs. drawing from normal(std=1e-2) makes a difference (also initializing {u,w,b} with normal(std=1e-2) vs. normal(std=1e-3)).

justinmaojones · 2016-02-15T19:10:52Z

Hello, I have been working on reproducing the work in this paper as well. I found that, in both the synthetic examples and on MNIST, increasing the variance of the distribution from which parameter initializations are drawn was very helpful. For example, try Uniform(-1.5, 1.5).

Annealing was also quite helpful for the synthetic cases, and I have found some evidence that it is also helpful for MNIST. I also found iterative training helpful for MNIST (i.e. successively add each flow layer throughout training), though it wasn't helpful for the synthetic examples.

I would be interested to hear how any of this works for you.

yberol · 2017-01-26T17:08:24Z

@wuaalb would it possible for you to share your implementation that produced the above gfycat with me?

wuaalb · 2017-01-27T16:43:34Z

@yberol
Yes, sure, I uploaded it here https://gist.github.com/wuaalb/af959f5c7114cb8c9c458633d009fd14
Not 100% sure the current settings produce that gif.. If not, the correct settings are probably somewhere in there, but commented out.

yberol · 2017-01-27T16:45:43Z

@wuaalb Thank you very much, really appreciate your help!

yberol · 2017-02-22T23:44:41Z

@wuaalb Thanks again for sharing your implementation, it guided me a lot as I re-implemented it using autograd. However, I have a question.

When I plot the histogram of the samples z_K, everything is as expected. You are also plotting q_K(z_K) on the uniform grid. To compute q_K(z_K) on the grid, wouldn't I need to follow the inverse flow and get z_0, z_1, ..., z_{K-1} that produced z_K? Right now, I compute z_K's when z_0 is the uniform grid, but then the z_K's are warped and the plot does not look very nice. I would appreciate it if you can clarify how you computed q_K(z_K) for me.

wuaalb · 2017-02-23T14:00:14Z

It's been like a year and a half, so I don't remember any of the details..

To me your plots look more or less OK, just with a white background and maybe some scaling issue (?).

I think in my code example this is the relevant code
https://gist.github.com/wuaalb/af959f5c7114cb8c9c458633d009fd14#file-nf6_1-py-L234-L246

This is to avoid the white background

cmap = matplotlib.cm.get_cmap(None)
ax.set_axis_bgcolor(cmap(0.))

I think q_K(z_K) is computed like

log_qK_zK = log_q0_z0 - sum_logdet_J
qK_zK = T.exp(log_qK_zK)

weixsong · 2018-04-26T12:30:45Z

Hi,

How many epochs do you need to train for each U(x) distribution?

My experiments, after 15 epochs, each epoch 10000 steps, still nothing sampled.

Regards,
Wei

Chechgm · 2019-02-08T16:35:04Z

Hello, I have been working on reproducing the work in this paper as well. I found that, in both the synthetic examples and on MNIST, increasing the variance of the distribution from which parameter initializations are drawn was very helpful. For example, try Uniform(-1.5, 1.5).

Annealing was also quite helpful for the synthetic cases, and I have found some evidence that it is also helpful for MNIST. I also found iterative training helpful for MNIST (i.e. successively add each flow layer throughout training), though it wasn't helpful for the synthetic examples.

I would be interested to hear how any of this works for you.

@justinmaojones, how do you train for MNIST if you don't know the target distribution? would you be so kind to share the (pseudo) code?

vr308 · 2020-06-29T19:29:07Z

@wuaalb Thanks again for sharing your implementation, it guided me a lot as I re-implemented it using autograd. However, I have a question.

When I plot the histogram of the samples z_K, everything is as expected. You are also plotting q_K(z_K) on the uniform grid. To compute q_K(z_K) on the grid, wouldn't I need to follow the inverse flow and get z_0, z_1, ..., z_{K-1} that produced z_K? Right now, I compute z_K's when z_0 is the uniform grid, but then the z_K's are warped and the plot does not look very nice. I would appreciate it if you can clarify how you computed q_K(z_K) for me.

Hi @yberol did you manage to figure out how to get around the warping issue of the transformed grid?

casperkaae mentioned this issue Nov 2, 2015

Fixes problem with Normalizing Flows #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce results from sec. 6.1 in "Variational inference using normalizing flows" #22

Reproduce results from sec. 6.1 in "Variational inference using normalizing flows" #22

casperkaae commented Nov 2, 2015

wuaalb commented Nov 2, 2015

casperkaae commented Nov 2, 2015

wuaalb commented Nov 2, 2015

casperkaae commented Nov 4, 2015

wuaalb commented Nov 5, 2015

casperkaae commented Nov 5, 2015

wuaalb commented Nov 6, 2015

casperkaae commented Nov 9, 2015

wuaalb commented Nov 9, 2015

casperkaae commented Nov 9, 2015

wuaalb commented Nov 9, 2015

wuaalb commented Nov 10, 2015

casperkaae commented Nov 15, 2015

wuaalb commented Nov 17, 2015

justinmaojones commented Feb 15, 2016

yberol commented Jan 26, 2017

wuaalb commented Jan 27, 2017

yberol commented Jan 27, 2017

yberol commented Feb 22, 2017

wuaalb commented Feb 23, 2017

weixsong commented Apr 26, 2018

Chechgm commented Feb 8, 2019 •

edited

Loading

vr308 commented Jun 29, 2020

Reproduce results from sec. 6.1 in "Variational inference using normalizing flows" #22

Reproduce results from sec. 6.1 in "Variational inference using normalizing flows" #22

Comments

casperkaae commented Nov 2, 2015

wuaalb commented Nov 2, 2015

casperkaae commented Nov 2, 2015

wuaalb commented Nov 2, 2015

casperkaae commented Nov 4, 2015

wuaalb commented Nov 5, 2015

casperkaae commented Nov 5, 2015

wuaalb commented Nov 6, 2015

casperkaae commented Nov 9, 2015

wuaalb commented Nov 9, 2015

casperkaae commented Nov 9, 2015

wuaalb commented Nov 9, 2015

wuaalb commented Nov 10, 2015

casperkaae commented Nov 15, 2015

wuaalb commented Nov 17, 2015

justinmaojones commented Feb 15, 2016

yberol commented Jan 26, 2017

wuaalb commented Jan 27, 2017

yberol commented Jan 27, 2017

yberol commented Feb 22, 2017

wuaalb commented Feb 23, 2017

weixsong commented Apr 26, 2018

Chechgm commented Feb 8, 2019 • edited Loading

vr308 commented Jun 29, 2020

Chechgm commented Feb 8, 2019 •

edited

Loading