-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernels for GroupNorm #353
Conversation
c2 += tl.sum(wdy) | ||
|
||
# Need to ensure additions to the same channel are atomic | ||
tl.atomic_add(DW_ptr + channel_idx, dW.to(dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ByronHsu is it possible for us to test on multiple GPU, specifically around
scope (str, optional) – Defines the scope of threads that observe the synchronizing effect of the atomic operation. Acceptable values are “gpu” (default), “cta” (cooperative thread array, thread block), or “sys” (stands for “SYSTEM”). The default value is “gpu”.
whether the default value works for multi-gpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what kind of testing? run on a 4 GPUs env to ensure the kernel working fine on a single GPU? Not sure how this is related to multi gpu. My understanding is the kernel only happens on 1 gpu
Very solid PR! |
@pramodith can you update the readme to include groupnorm |
Will do tomorrow! |
Summary
Implementation of group norm that achieves output parity with torch's
GroupNorm
.This is feature is a part of #285
Details
The formulas/equations involved in
GroupNorm
are the same asLayerNorm/BatchNorm
. The main differences lie in the axis along which the mean and std are computed + the dimensions of the Affine transformation parameters.In group norm W and B are of shape
(n_channels)
, however the mean and std are calculated over all the channels in a given group.Testing Done
Testing was done on a A100 PCIE and a A100 SXM-4.
We see an increase in speed, while the total memory used remains about the same. Note that benchmarking was done using a batch size of 128, Hidden Dim size of 512 and the number of channels per group fixed at 4.
These results look very similar to the layer norm benchmark too.
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence