You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a few questions concerning your paper, typically the Mask-Guided Coordination (Section 4.3)
Is the mask-guided coordination scheme also implemented during "appearance editing"?
Is masked attention applied in the spatial self-attention block or the temporal self-attention block, or both?
When and where is masked attention applied in terms of denoising timestep $t$ and attention layer $l$?
Is it only during content preservation $t>t_0, l>l_0$ (resp. structure control t<t_2, l>l_2)? In other words, is the $V$ (resp. $Q, K$) in formula (6) from the reconstruction branch?
For the mask $M$, do you use the same mask for all video frames (if so, could you elaborate how this mask is generated?) or do you concatenate all the frame masks?
P.S. What's the exact source prompt you use to generate the results in Figure 1? I attempted 'A raccoon is playing guitar' but it didn't quite nail that cartoonish and detailed background vibe as in your demo
Your guidance on these queries would be immensely valuable, many thanks!
The text was updated successfully, but these errors were encountered:
Hello 👋
Thank you for your amazing work!
I have a few questions concerning your paper, typically the Mask-Guided Coordination (Section 4.3)
Is it only during content preservation
P.S. What's the exact source prompt you use to generate the results in Figure 1? I attempted 'A raccoon is playing guitar' but it didn't quite nail that cartoonish and detailed background vibe as in your demo
Your guidance on these queries would be immensely valuable, many thanks!
The text was updated successfully, but these errors were encountered: