-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about diffusion #1
Comments
Hi there, thanks very much for your good question. Yes, in that case, the depth range is [0,88],[0,10] for the output, but the depth latent is normalized. The diffusion-denoising process is performed in a latent space with shapes h/2, w/2, and depth-dim. We set the depth dim to 16. This is realized in the depth encoder-decoder. The diffusion latent space is normalized, however, when we visualize this figure, we use the depth inverse transform to visualize the normalized latent feature to depth map with real depth value at each step. Please refer to this mediate layer vis If during a single inference, considering the inference time, we only decode the latent after the whole denoising process. About the phenomenon itself, it is rational intuitively since it first forms shapes and edges and refines the depth value step by step. We haven't got clear math proof of why it presents itself in this way. If you have any ideas, I would very much like to discuss this point with you. Cheers |
Thank you so much for the reply! Now I get that the diffusion process happens in the latent space, and that makes sense to me. So we can not expect the final output produced by the latent feature that is corrupted by random noise to represent any reasonable contents until it is finally de-noised. I'm still trying to figure out the details of the entire process, and the code is helping a lot. Thanks again for the open source, really appreciate that! |
Hi, TangTao (this comments disappeared which is quite strange)
Thank you for the question.
I didn't find the question under this issue, so I just replied to it
through mail.
I think the output of conv_inv_transform is the real depth value after
1/conv_inv_transform.clamp(1/max_depth). For this case eps is 0.001 which
means largest value output would be 1000m.
Best regards
… Message ID: ***@***.***>
|
Hi there, great work! Really appreciate that you open source the code so soon!
I have some questions about the diffusion and denoising process.
The image shown in the README is really impressive:
Does this image show the denoising process? If so, why the depth contents are shown in a 'near-to-far' way?
The random gaussian noise$\epsilon \sim \mathcal N(0, \mathbf I)$ , and the GT depth map / depth prediction sould have been normalized to $[-1, 1]$ ; however, since the above image shows contents appearing from near to far, should I assume that the final depth map / the depth prediction is not of the range $[-1, 1]$ , but is of greater range(e.g., $[0, 80]$ for kitti and $[0, 10]$ for NYU)?
If so, the diffusion and denoising steps are probabily problematic, since commonly, if we choose the gaussian noise as$\mathcal N(0, \mathbf I)$ , the output range is chosen as $[-1, 1]$ . And I have not yet found in the code about the normalization process.
Correct me if I'm wrong, I would be very happy to hear from you!
The text was updated successfully, but these errors were encountered: