Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about diffusion #1

Open
XiangMochu opened this issue Mar 11, 2023 · 3 comments
Open

Question about diffusion #1

XiangMochu opened this issue Mar 11, 2023 · 3 comments
Labels
good first issue Good for newcomers

Comments

@XiangMochu
Copy link

Hi there, great work! Really appreciate that you open source the code so soon!

I have some questions about the diffusion and denoising process.

The image shown in the README is really impressive:
image

Does this image show the denoising process? If so, why the depth contents are shown in a 'near-to-far' way?

The random gaussian noise $\epsilon \sim \mathcal N(0, \mathbf I)$, and the GT depth map / depth prediction sould have been normalized to $[-1, 1]$; however, since the above image shows contents appearing from near to far, should I assume that the final depth map / the depth prediction is not of the range $[-1, 1]$, but is of greater range(e.g., $[0, 80]$ for kitti and $[0, 10]$ for NYU)?

If so, the diffusion and denoising steps are probabily problematic, since commonly, if we choose the gaussian noise as $\mathcal N(0, \mathbf I)$, the output range is chosen as $[-1, 1]$. And I have not yet found in the code about the normalization process.

Correct me if I'm wrong, I would be very happy to hear from you!

@duanyiqun
Copy link
Owner

duanyiqun commented Mar 11, 2023

Hi there, thanks very much for your good question.
The observation is correct, we have the same feeling that the depth contents are shown in a 'near-to-far' way in this case. I think a more precise description would be the initial depth map decoded from the depth latent (gaussian noise) would be averagely 'near', then after denoising, it comes to `far' for many pixels. But some pixels would be even 'closer'. We do observe in some cases, it initialized with a mediate depth value.

Yes, in that case, the depth range is [0,88],[0,10] for the output, but the depth latent is normalized.

The diffusion-denoising process is performed in a latent space with shapes h/2, w/2, and depth-dim. We set the depth dim to 16. This is realized in the depth encoder-decoder. The diffusion latent space is normalized, however, when we visualize this figure, we use the depth inverse transform to visualize the normalized latent feature to depth map with real depth value at each step. Please refer to this mediate layer vis

If during a single inference, considering the inference time, we only decode the latent after the whole denoising process.

About the phenomenon itself, it is rational intuitively since it first forms shapes and edges and refines the depth value step by step. We haven't got clear math proof of why it presents itself in this way. If you have any ideas, I would very much like to discuss this point with you.

Cheers

@duanyiqun duanyiqun added the good first issue Good for newcomers label Mar 11, 2023
@XiangMochu
Copy link
Author

Thank you so much for the reply! Now I get that the diffusion process happens in the latent space, and that makes sense to me. So we can not expect the final output produced by the latent feature that is corrupted by random noise to represent any reasonable contents until it is finally de-noised.

I'm still trying to figure out the details of the entire process, and the code is helping a lot. Thanks again for the open source, really appreciate that!

@duanyiqun
Copy link
Owner

duanyiqun commented Jul 27, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants