a question about max_attn_resolution and crossattn layer numbers #3

yangyichu · 2024-07-12T05:55:25Z

I see in the provided example checkpoint, max_attn_resolution is set to be 16. So during encoding, the image will go through downblocks of 64x64, 32x32, 16x16, 16x16, cross_attn is added twice after 16x16 downblocks. Yet during decoding the image will go through 16x16, 32x32, 64x64, 64x64, and cross_attn is added only once, is this an expected behavior(resulting in asymmetric encoder and decoder structure)?

Manchery · 2024-07-13T05:12:49Z

Hi, thank you for your interest in our work!

You are correct. There are indeed two cross-attention blocks for the encoders but only one for the decoders. This wasn't an intentional design choice. Initially, the cross-attention mechanism was supposed to be applied to multi-scale features. However, I set the max_attn_resolution to 16 mainly to save memory usage. Despite this, the current architecture performs well in practice. I will conduct experiments with more cross-attention blocks (e.g., setting max_attn_resolution to 32) to see if this can further improve performance. Thank you for pointing this out to my attention!

Zhao-ZD · 2024-11-27T11:55:43Z

great work,may i ask if 512x512 version will be released

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a question about max_attn_resolution and crossattn layer numbers #3

a question about max_attn_resolution and crossattn layer numbers #3

yangyichu commented Jul 12, 2024

Manchery commented Jul 13, 2024

Zhao-ZD commented Nov 27, 2024

a question about max_attn_resolution and crossattn layer numbers #3

a question about max_attn_resolution and crossattn layer numbers #3

Comments

yangyichu commented Jul 12, 2024

Manchery commented Jul 13, 2024

Zhao-ZD commented Nov 27, 2024