feat(triton): InplaceNorm + InstanceNorm #50

ClashLuke · 2021-10-28T21:12:00Z

I'd love to run LayerNorm in place and ideally also add InstanceNorm (by extracting the core normalization from LayerNorm) as HomebrewNLP is currently using a slow PyTorch-level implementation with a correct backward pass.

While we're at it, optionally fusing GLU and GLUv2 (gelu(f(x)) * g(x) + gelu(h(x))) with various activation functions and normalization might give another speed boost.

To add this myself, I'd need to fully understand triton's pointers and how to access the output instead of input in your LayerNorm implementation. Could you help me with that? or would you instead implement this yourself? Is this even in the scope of xformers?

The text was updated successfully, but these errors were encountered:

blefaudeux · 2021-10-28T22:34:45Z

that's a great question ! definitely open for contributions, and this looks very reasonable, there's a good chance that Triton gives something a lot faster than pytorch there. In terms of scope I think that it's very much ok, as xformers to me is also an optimized parts zoo (with some automatic builders for them, but that's optional).

Just a couple of caveats to begin with:

while some Triton subparts will probably work with a gtx 1xxx (if you're not using the tensor cores basically, no dot product), it's not guaranteed, so having a pytorch fallback is good to keep in mind.
we keep all our triton code here, k_xx means kernel (the @jit code), put aside so that it does not distort the code coverage metrics. LayerNorm is there (basically from the tutorial from @ptillet), making the norm parametrizable is a great idea, it's how the fused linear layer works (activations are here), Triton is awesome for that since it will generate a fused kernel on the fly for anything you pass (so this becomes easily extensible)
who does what ? PR most certainly welcome if you feel like it, else I can put up a PR which does not change layernorm per say, but makes the norm a @jit function that you can pass in, like it's done for the fused linear layer. This way you could easily experiment with other norms and submit them in PRs. As you prefer !
Mini heads up in that for now gelu is a little on the slow side with Triton (80% of pytorch, give or take), could be that this improves over time, and that the combination is still worth it. In any case it would be easy enough to try

) * moving local attention to sparse backend * better handling of the causal/window sizes implications * some cleaning up

ClashLuke mentioned this issue Oct 29, 2021

[feat] InplaceNorm, affine-less LayerNorm #53

Merged

3 tasks

blefaudeux closed this as completed in #53 Oct 29, 2021

xwhan pushed a commit to xwhan/xformers that referenced this issue Feb 8, 2022

[refactor] Moving local attention to sparse backend (facebookresearch#50

5e10833

) * moving local attention to sparse backend * better handling of the causal/window sizes implications * some cleaning up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(triton): InplaceNorm + InstanceNorm #50

feat(triton): InplaceNorm + InstanceNorm #50

ClashLuke commented Oct 28, 2021

blefaudeux commented Oct 28, 2021

feat(triton): InplaceNorm + InstanceNorm #50

feat(triton): InplaceNorm + InstanceNorm #50

Comments

ClashLuke commented Oct 28, 2021

blefaudeux commented Oct 28, 2021