Assertion statements in attention implementation #264

dnnspark · 2022-04-08T23:34:03Z

❓ Questions and Help

I'm trying to implement Perceiver using xformers, and stumbled upon two assertion statements.

The first one is this one: Doesn't this have to be t.shape[2] % self.dim_head == 0, to be consistent with the error message one line below?
The second one is this one: why does the query projection have to preserve the dimension? I'm trying to implement a cross-attention scenario that query and key comes from different sources (So they are of differeint dimsensions) and the linear projections make sure they are of same dimensions (i.e. N x D_{query} -> N x d for query and M x D_{key} -> M x d). However, the assertion above enforce the query projection preserve the dimension. What's the point of this assertion (btw, this assertion does not exist in the original torchtext implementation)? Or, what's the right way to implement this idea?

The text was updated successfully, but these errors were encountered:

blefaudeux · 2022-04-13T02:40:03Z

hi @dnnspark, thanks for your message ! Replying to 1. first, can you walk me through the problem ? I may be a little tired but I don't see the issue right now ? dim_k is defined as dim_key // num_heads (ok, the choice of letters is probably not great), so it looks like we're talking about the same thing.
Looking into 2. :)

blefaudeux · 2022-04-13T02:49:37Z

ok, 2. now:

it's not a part of the repo that I personally like a lot (I wrote a big part of it, I'm probably allowed to say that), it's too complicated and confusing for what it does. The gist of it was to enable a self-attention small optim (project with one buffer, you save a lot of reads, that's here, and handle different init options which are not always the same in NLP or ViT for instance
it looks like this assertion was just there to remove one variable, it's not fundamentally required indeed as far as I can see. I can propose a PR to remove that, actually I had a draft PR to rewrite this part and it could be part of it
if you want absolute freedom (probably the best if you have an exotic projection scheme), you can swap this block by another one which is just right for you, it's used as is if part of the config

let me know if this helps, I can definitely follow up on this assert

dnnspark · 2022-04-13T03:40:11Z

Hi @blefaudeux, thanks for checking!

For 1, becasue the input of the forward() are query, key, value with no constraints on their shape (e.g. assertion or docstring), I was assuming it works in general cross attention scenarios. Let's say the shape of query, key, value are (4, 24, 300), (4, 36, 300), (4, 36, 200) respectively (batch_size=4) -- they are projected from source and target data outside of this function -- , and the number of heads is 20. It's a valid input because all channels (i.e. 300, 300, 200) are divisible by the number of heads: each chunk of the query and key, which are of shape of (4, 24, 15) and (4, 36, 15), are used to compute an attention matrix of shape (4, 24, 36) and multiplied to the corresponding chuck of value input of shape=(4, 36, 10) to fill a corresponding part of output of shape=(4, 24, 10). There are 20 of these output (num_heads=20), and they are concatenated to make the final output (4, 24, 200), same shape of query input (and it's added residually to it outside of this function).

In this case, dim_head=20 and dim_k=15 (300 / 20). But dim_value (200) is not divisible by dim_k.

For 2, I see; that makes a lot of sense for the self-attention. And agreed, it may need some adjustments for being used for more general cross attention use case.

Thanks!

blefaudeux · 2022-04-13T03:56:29Z

Hi @blefaudeux, thanks for checking!

For 1, becasue the input of the forward() are query, key, value with no constraints on their shape (e.g. assertion or docstring), I was assuming it works in general cross attention scenarios. Let's say the shape of query, key, value are (4, 24, 300), (4, 36, 300), (4, 36, 200) respectively (batch_size=4) -- they are projected from source and target data outside of this function -- , and the number of heads is 20. It's a valid input because all channels (i.e. 300, 300, 200) are divisible by the number of heads: each chunk of the query and key, which are of shape of (4, 24, 15) and (4, 36, 15), are used to compute an attention matrix of shape (4, 24, 36) and multiplied to the corresponding chuck of value input of shape=(4, 36, 10) to fill a corresponding part of output of shape=(4, 24, 10). There are 20 of these output (num_heads=20), and they are concatenated to make the final output (4, 24, 200), same shape of query input (and it's added residually to it outside of this function).

In this case, dim_head=20 and dim_k=15 (300 / 20). But dim_value (200) is not divisible by dim_k.

aah yes I see your point now, yes it implicitly assumes the same dimension everywhere, that's bad. Can be fixed, I'm trying to get out of a CI quagmire and will submit a PR, or feel free to do that if you fancy it

blefaudeux · 2022-04-13T03:57:13Z

oh, let me just volley a PR right now for 1. and this will be fixed. one sec

dnnspark · 2022-04-13T07:55:59Z

Hey @blefaudeux, with a second thought, I think you're right about this dimension issue. In my example above, there's a flaw:

There are 20 of these output (num_heads=20), and they are concatenated to make the final output (4, 24, 200), same shape of query input (and it's added residually to it outside of this function).

It's actually not the same shape of query input: (4, 24, 300). So I think the dimensions of all inputs (query, key, value) has to be always same. In that case, the first assertion is actually correct, even though the name is a bit confusing (which makes your PR still legit).

blefaudeux · 2022-04-13T15:41:26Z

Hey @blefaudeux, with a second thought, I think you're right about this dimension issue. In my example above, there's a flaw:

There are 20 of these output (num_heads=20), and they are concatenated to make the final output (4, 24, 200), same shape of query input (and it's added residually to it outside of this function).

It's actually not the same shape of query input: (4, 24, 300). So I think the dimensions of all inputs (query, key, value) has to be always same. In that case, the first assertion is actually correct, even though the name is a bit confusing (which makes your PR still legit).

an afterthought on my side is that this assert is not at the right place anyway, unless the projection conserves dimensions (I thought that was partly your point in your explanation actually). We check the dimensions pre-projection, then project, then head split (which is where the dimension misfit would be visible), but one could imagine an initial misfit which is "fixed" by differentiated projections (not saying that this would be a good thing to do, but it would work I believe). I'll try to fix that in the PR

@dnnspark

* Fixing #264, thanks @dnnspark * changelog addendum * moving the dimension check to post projection

blefaudeux · 2022-05-26T22:42:37Z

it turns out that some of the checks were not correct, undue constraints, fixed with the attached PR

blefaudeux · 2022-06-04T01:04:36Z

@dnnspark I think that this is fixed with the PR which landed yesterday ?

dnnspark changed the title ~~Assertion on linear projection of queries~~ Assertion statements in attention implementation Apr 11, 2022

blefaudeux self-assigned this Apr 13, 2022

blefaudeux added a commit that referenced this issue Apr 13, 2022

Fixing #264, thanks @dnnspark

75ff024

blefaudeux mentioned this issue Apr 13, 2022

[fix] Q/K/V needs to be divisible by heads, not more #269

Merged

10 tasks

blefaudeux added a commit that referenced this issue Apr 14, 2022

[fix][minor] Q/K/V needs to be divisible by heads, not more (#269)

7242042

* Fixing #264, thanks @dnnspark * changelog addendum * moving the dimension check to post projection

blefaudeux mentioned this issue May 26, 2022

[feat] Rewrite the input projection + add several init options #312

Merged

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion statements in attention implementation #264

Assertion statements in attention implementation #264

dnnspark commented Apr 8, 2022 •

edited

Loading

blefaudeux commented Apr 13, 2022

blefaudeux commented Apr 13, 2022

dnnspark commented Apr 13, 2022 •

edited

Loading

blefaudeux commented Apr 13, 2022

blefaudeux commented Apr 13, 2022

dnnspark commented Apr 13, 2022 •

edited

Loading

blefaudeux commented Apr 13, 2022

blefaudeux commented May 26, 2022

blefaudeux commented Jun 4, 2022

Assertion statements in attention implementation #264

Assertion statements in attention implementation #264

Comments

dnnspark commented Apr 8, 2022 • edited Loading

❓ Questions and Help

blefaudeux commented Apr 13, 2022

blefaudeux commented Apr 13, 2022

dnnspark commented Apr 13, 2022 • edited Loading

blefaudeux commented Apr 13, 2022

blefaudeux commented Apr 13, 2022

dnnspark commented Apr 13, 2022 • edited Loading

blefaudeux commented Apr 13, 2022

blefaudeux commented May 26, 2022

blefaudeux commented Jun 4, 2022

dnnspark commented Apr 8, 2022 •

edited

Loading

dnnspark commented Apr 13, 2022 •

edited

Loading

dnnspark commented Apr 13, 2022 •

edited

Loading