-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CompatHelper: bump compat for Flux in [weakdeps] to 0.15, (keep existing compat) #1124
Closed
github-actions
wants to merge
1
commit into
main
from
compathelper/new_version/2024-12-06-00-19-20-030-01136868030
Closed
CompatHelper: bump compat for Flux in [weakdeps] to 0.15, (keep existing compat) #1124
github-actions
wants to merge
1
commit into
main
from
compathelper/new_version/2024-12-06-00-19-20-030-01136868030
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
compathelper/new_version/2024-12-06-00-19-20-030-01136868030
branch
from
December 6, 2024 00:19
fa03127
to
325951b
Compare
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 325951b | Previous: ef0d450 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4187.5 ns |
4208 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4750 ns |
4834 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5417 ns |
5375 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4458 ns |
4083 ns |
1.09 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
61500 ns |
58557 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10500 ns |
10625 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10542 ns |
10542 ns |
1 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10541 ns |
11375 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10292 ns |
10083 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
434414 ns |
415171 ns |
1.05 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1083 ns |
1334 ns |
0.81 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1125 ns |
1209 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1250 ns |
1333.5 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1250 ns |
1208 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18684 ns |
17961 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3791 ns |
4084 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3958 ns |
3959 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4292 ns |
4333 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4125 ns |
4000 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
112466 ns |
107003.5 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
71000 ns |
70834 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
64000 ns |
64375 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
63958 ns |
64500 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
78459 ns |
80375 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38059 ns |
36906 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2053104.5 ns |
2031562.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2089167 ns |
2088542 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2094875 ns |
2093958 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1997229.5 ns |
1926833 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
198929 ns |
192315 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
193958.5 ns |
196625 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
182917 ns |
195542 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
189667 ns |
185209 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
183062.5 ns |
182375 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166083 ns |
166552 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1146437.5 ns |
1111896 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1124750 ns |
1118729.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1124917 ns |
1119708 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1114208.5 ns |
1130333.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
536148.5 ns |
514050 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3604.5 ns |
3500 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4042 ns |
3416 ns |
1.18 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4145.5 ns |
4459 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3583 ns |
3416.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
68781 ns |
67303.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8875 ns |
9084 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9458 ns |
9750 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8875 ns |
9625 ns |
0.92 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8875 ns |
8625 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
500066 ns |
472568 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15333.5 ns |
15020.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15833 ns |
14666 ns |
1.08 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17041.5 ns |
18625 ns |
0.91 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15083.5 ns |
14875 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
56024 ns |
53079 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
216375 ns |
224750 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
215271 ns |
215104.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214417 ns |
215917 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214479 ns |
215083 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
278999 ns |
267364.5 ns |
1.04 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
584 ns |
750 ns |
0.78 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
666 ns |
709 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
750 ns |
750 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
812.5 ns |
750 ns |
1.08 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17806 ns |
17115 ns |
1.04 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1542 ns |
1500 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1833 ns |
1792 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1750 ns |
1500 ns |
1.17 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1375 ns |
1375 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
105021 ns |
99326.5 ns |
1.06 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
9334 ns |
7833 ns |
1.19 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
8125 ns |
7291 ns |
1.11 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
7958 ns |
7083 ns |
1.12 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10250 ns |
9958 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24322 ns |
23212 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221437.5 ns |
233458.5 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
236500 ns |
228125 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228459 ns |
228666 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
228125 ns |
214125 ns |
1.07 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
173657.5 ns |
164950.5 ns |
1.05 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3916 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3875 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23898.5 ns |
23508 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16750 ns |
16959 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16917 ns |
17042 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17583 ns |
17083 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16917 ns |
16708 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
166893 ns |
160457.5 ns |
1.04 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
610520.5 ns |
611125 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
601167 ns |
609042 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
604333 ns |
606834 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
605666 ns |
605520.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113814 ns |
113172 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1429334 ns |
1423834 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1422125 ns |
1422458 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1427166 ns |
1424292 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1422708 ns |
1420334 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
217800 ns |
209423.5 ns |
1.04 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1075458 ns |
1082229.5 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
969416.5 ns |
970792 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1352521 ns |
1346208 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1319166.5 ns |
1300333 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
282523.5 ns |
270348.5 ns |
1.05 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
6011438 ns |
5996021 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4574438 ns |
4506125 ns |
1.02 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4957125 ns |
4914416 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5725208.5 ns |
5507375 ns |
1.04 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1113189 ns |
1074060 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
541 ns |
542 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
541 ns |
541 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
24154 ns |
23487 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2209 ns |
2167 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2208 ns |
2125 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2291 ns |
2167 ns |
1.06 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2125 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
180828 ns |
168855 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4500 ns |
4167 ns |
1.08 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4666 ns |
4334 ns |
1.08 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5000 ns |
5041 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4229.5 ns |
3667 ns |
1.15 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
67149 ns |
64100 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11625 ns |
11291 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11666 ns |
11875 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11875 ns |
12291 ns |
0.97 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10916.5 ns |
11000 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
464242.5 ns |
442842 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6625 ns |
6042 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6583 ns |
6104.5 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7166 ns |
7209 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5917 ns |
5708 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
53689.5 ns |
51573 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17791 ns |
17041.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18270.5 ns |
17292 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18937.5 ns |
17625 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16875 ns |
17250 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
312733.5 ns |
299598.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
33904.5 ns |
32513 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8917 ns |
8458 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8958 ns |
9000 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9458 ns |
9084 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8417 ns |
8458 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
161406.5 ns |
155298 ns |
1.04 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
96583 ns |
96666 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
96041 ns |
96708 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
96500 ns |
96292 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
91292 ns |
96375 ns |
0.95 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112280 ns |
111447.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
285125 ns |
278125 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
275916 ns |
275250 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
279000 ns |
274583.5 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
275041 ns |
277584 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
192737.5 ns |
190076 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3398084 ns |
3409792 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3118416 ns |
3047666 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3027854 ns |
3023958 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4090167 ns |
3959958 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
577236 ns |
579376.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7633396 ns |
7632583 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7448020.5 ns |
7497667 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7461875 ns |
7451520.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8159958 ns |
8199583 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1336808 ns |
1349456 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17544583 ns |
17500916.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17552542 ns |
17545437.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17549333 ns |
17599584 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14130000 ns |
14108083 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
24110562.5 ns |
23772875 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33971375 ns |
34134729 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37615667 ns |
37435375 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34505479 ns |
34708708 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1844860 ns |
1860458 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
316909854.5 ns |
316659729.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
235311521 ns |
235623563 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
196995687.5 ns |
195619437 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
279801417 ns |
279867979.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13918562 ns |
13932935 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
274414625 ns |
273833833 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
265951833 ns |
267231583 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
256018875 ns |
255610333 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
330216209 ns |
329098667 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22125 ns |
21375 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22646 ns |
22125 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23542 ns |
25292 ns |
0.93 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21959 ns |
21125 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
95421 ns |
94977 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
103875 ns |
103542 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
105354.5 ns |
103791 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
103854 ns |
105125 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
115292 ns |
103250 ns |
1.12 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
511347 ns |
500332.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6333 ns |
5875 ns |
1.08 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6125 ns |
6417 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6333 ns |
6750 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6084 ns |
6000 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
69796 ns |
68160.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14709 ns |
14500 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16042 ns |
15000 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15792 ns |
16500 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14750 ns |
14584 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
489291 ns |
477825.5 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3123667 ns |
3101458 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2118374.5 ns |
2118542 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2311791 ns |
2321249.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
5004083 ns |
4650021 ns |
1.08 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
589263 ns |
585427 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23622375 ns |
23564209 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18685708 ns |
18768041 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
17945875 ns |
17974229 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35440521 ns |
35659708 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2898295.5 ns |
2760352.5 ns |
1.05 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33848562.5 ns |
34076750.5 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27636625.5 ns |
27653896 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28626542 ns |
28752229 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
40805979.5 ns |
40853625 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
71979.5 ns |
74667 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
72708 ns |
71833.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
73833 ns |
73521 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
72604.5 ns |
71770.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
105033.5 ns |
100115 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
210291.5 ns |
292083 ns |
0.72 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
318083 ns |
224167 ns |
1.42 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
316833 ns |
297708 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
275083 ns |
205792 ns |
1.34 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
559668 ns |
537710 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11833 ns |
11750 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11750 ns |
11416 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12583.5 ns |
12542 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12000 ns |
12270.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
73102 ns |
71148.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26292 ns |
26208 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27417 ns |
26875 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27834 ns |
27625 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26834 ns |
26500 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
487396 ns |
468928 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12417 ns |
12250 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12583 ns |
12166 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13459 ns |
13500 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12250 ns |
12042 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
54541.5 ns |
52398 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26541 ns |
25250 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26083 ns |
26125 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26709 ns |
26042 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26000 ns |
26000 ns |
1 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
312239.5 ns |
301242 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
179625 ns |
179104.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
180084 ns |
179750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
182020.5 ns |
180583 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
180458.5 ns |
178625 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
59147.5 ns |
55842.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
593312 ns |
582584 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
583041 ns |
591917 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
603459 ns |
594313 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
582292 ns |
583166 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
287774 ns |
280084 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6042 ns |
5958 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6333 ns |
6000 ns |
1.06 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7312.5 ns |
6500 ns |
1.13 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6187.5 ns |
5625 ns |
1.10 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
70958 ns |
70229 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14312.5 ns |
13875 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15000 ns |
14542 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15458 ns |
15187.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14375 ns |
14458 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
467345.5 ns |
456073.5 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1257541.5 ns |
1235292 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1308291 ns |
1304042 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1375979.5 ns |
1374021 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1091750 ns |
1092083 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
302420 ns |
302409 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4231312 ns |
4120521 ns |
1.03 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4443833 ns |
4446875 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4589250 ns |
4623750 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3708520.5 ns |
3716729.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1044852 ns |
1039016 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1875 ns |
1792 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1875 ns |
1833 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1875 ns |
1917 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
24020 ns |
23753 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4917 ns |
4833 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4958 ns |
4917 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5042 ns |
4875 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4916 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
193691 ns |
186693 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6166 ns |
5959 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
6000 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6875 ns |
7083 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5645.5 ns |
5667 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
56339 ns |
54622.5 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11334 ns |
11167 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11625 ns |
11541 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11834 ns |
11250 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10500 ns |
10542 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
339285.5 ns |
325703 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
334 ns |
375 ns |
0.89 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
375 ns |
334 ns |
1.12 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
375 ns |
292 ns |
1.28 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
334 ns |
333 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23097 ns |
22898 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2750 ns |
2792 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3042 ns |
3041 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2834 ns |
3041 ns |
0.93 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2750 ns |
2750 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
161548 ns |
157339 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11375 ns |
11625 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12125 ns |
12083 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12750 ns |
12417 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11417 ns |
11229.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
58287.5 ns |
55735 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24750 ns |
24959 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25166 ns |
25042 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25083 ns |
25042 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25041 ns |
25042 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
302525 ns |
288122.5 ns |
1.05 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4209 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4167 ns |
4208 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4209 ns |
4208 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24886 ns |
24760 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16208 ns |
16333 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16167 ns |
16333 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16708 ns |
16500 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16250 ns |
16459 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
200919.5 ns |
193221.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5792 ns |
5791 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5791 ns |
5792 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5750 ns |
5791 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5833 ns |
5750 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
35005 ns |
33178 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20750 ns |
20750 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21083 ns |
20708 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21500 ns |
20916 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20709 ns |
20708 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
178740 ns |
172900.5 ns |
1.03 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
416958 ns |
420188 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
386792 ns |
386937.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
484624.5 ns |
482833 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
138875 ns |
106250 ns |
1.31 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
67016 ns |
67134 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
938167 ns |
865417 ns |
1.08 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
980375 ns |
948604 ns |
1.03 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1193208 ns |
1189500 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
457062.5 ns |
411770.5 ns |
1.11 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
190892.5 ns |
190610 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
140688 ns |
136750 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
139083 ns |
133396 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
144500 ns |
133166.5 ns |
1.09 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
78146 ns |
138854 ns |
0.56 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193931.5 ns |
192824 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1928521 ns |
1917250 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1927791.5 ns |
1912124.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1927541 ns |
1920250 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1916479.5 ns |
1942521 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
406874.5 ns |
395139 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22097 ns |
22003 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1875 ns |
1834 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1834 ns |
1833 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
176834 ns |
168855 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6625 ns |
6812.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7083.5 ns |
6750 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7917 ns |
8187.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6833 ns |
6334 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
63646 ns |
59378.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9459 ns |
9312.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9458 ns |
9209 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9333 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9083 ns |
9083 ns |
1 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
320974.5 ns |
305200.5 ns |
1.05 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
113058083 ns |
112669000 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173084604 ns |
174180000 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
143396333 ns |
143189875 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
111140500 ns |
112387917 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5478656 ns |
5463061 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
623531250 ns |
616937396 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
557544917 ns |
558474917 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
453377687.5 ns |
448891770.5 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
625106375 ns |
624388062.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34926627 ns |
38238112 ns |
0.91 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
670253666 ns |
665577792 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
665374020.5 ns |
667381166.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
602557458.5 ns |
616459979 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
744888292 ns |
747251209 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
62792 ns |
62750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
53333 ns |
53834 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
53167 ns |
53458 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81875 ns |
82125 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38096 ns |
37037 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1931958 ns |
1926667 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1976500 ns |
1974291 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1982812.5 ns |
1980021 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1883666 ns |
1901875 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
176530.5 ns |
171617 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
268354 ns |
265333 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
266084 ns |
269750 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
266770.5 ns |
269083.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267083 ns |
264854.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
138471 ns |
124229 ns |
1.11 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
684875 ns |
687584 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
704895.5 ns |
678833 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
680063 ns |
680125 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
693625 ns |
635854 ns |
1.09 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
733589 ns |
697446 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2236667 ns |
2242458 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2232792 ns |
2097875 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2245895.5 ns |
2254458 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2260750 ns |
2199750.5 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
133706 ns |
132519 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5517709 ns |
5507312 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5515458 ns |
5516959 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5507792 ns |
5495292 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5478667 ns |
5486271 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
757656.5 ns |
737355 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
682208 ns |
678417 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
665834 ns |
671291 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
667625 ns |
668458 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
676792 ns |
682958 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47666 ns |
46914 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1824500 ns |
1824791.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1715667 ns |
1728375 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1715792 ns |
1718604.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2083667 ns |
2080500 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
226213.5 ns |
221890.5 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
72292 ns |
70750 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
52625 ns |
53125 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
52875 ns |
52916 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82333 ns |
82375 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
29129 ns |
28168 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2054083 ns |
2031792 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2092958 ns |
2096833.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2095458 ns |
2088000 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2000313 ns |
2001083.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
191172 ns |
187289.5 ns |
1.02 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13413292 ns |
13449750 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12504667 ns |
12528021.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12610375 ns |
12554687.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15299333 ns |
15230083 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
514224 ns |
513617 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
46987896 ns |
46862979 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41518750 ns |
41543521 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
40802062.5 ns |
40829437.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
57892000 ns |
58532271 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3029248.5 ns |
2896866 ns |
1.05 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
98175875 ns |
74392375 ns |
1.32 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
90780729.5 ns |
90893292 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
92033792 ns |
92732000 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
75986750 ns |
76658749.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
73292 ns |
70625 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
61667 ns |
64875 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
53292 ns |
64625 ns |
0.82 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82208.5 ns |
81917 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47360.5 ns |
47851 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1930000 ns |
1923187.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1976916.5 ns |
1983437.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1977875 ns |
1973333 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1892000 ns |
1883833 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
190114 ns |
193982.5 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
416 ns |
292 ns |
1.42 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
32602 ns |
32956 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6458 ns |
6125 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6708 ns |
6416 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6708 ns |
6375 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6084 ns |
5875 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
175347.5 ns |
176118.5 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32038 ns |
32831 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2792 ns |
2667 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2875 ns |
2916 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2875 ns |
2875 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2667 ns |
2625 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
165406 ns |
165694 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
277154687.5 ns |
278326104 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
340341375 ns |
340448937.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
310234625 ns |
308909437.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
279111562.5 ns |
278977666.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7096722 ns |
7109405 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
999677125 ns |
997951584 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
940767958 ns |
940941292 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
833671583 ns |
832217625 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1012000792 ns |
1009333917 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34086228 ns |
33893371 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1815872125 ns |
1394325042 ns |
1.30 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1685763250 ns |
1705224209 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1637202792 ns |
1693911291 ns |
0.97 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1305223708 ns |
1308776729 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1463833 ns |
1456667 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1432479 ns |
1462958 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1452084 ns |
1454521 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1468708.5 ns |
1451416.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
127733 ns |
127922 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5038771 ns |
5012417 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5036500 ns |
5028750 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5032417 ns |
5027959 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5015521 ns |
5027187.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
597041 ns |
506424 ns |
1.18 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
158379834 ns |
157716375 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
128622999.5 ns |
136859042 ns |
0.94 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
164768458 ns |
164218250 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
151616791 ns |
151479417 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4983504 ns |
4879107 ns |
1.02 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
627145292 ns |
634203459 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
492688333 ns |
607766083 ns |
0.81 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
467716375 ns |
456653750 ns |
1.02 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
650997542 ns |
653815125 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16031412 ns |
17510307 ns |
0.92 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8989292 ns |
8926646 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
9014166 ns |
9038916.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7920250 ns |
7947771 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
10062500 ns |
10104354 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1605825.5 ns |
1594648 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36825604 ns |
36795042 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
37713917 ns |
38004792 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
34552167 ns |
34295916.5 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
37842541.5 ns |
37862042 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6453497 ns |
6452447 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47250 ns |
47334 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47583 ns |
47417 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47709 ns |
47625 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47250 ns |
47042 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
17936 ns |
18361 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50250 ns |
50042 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50417 ns |
50292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50625 ns |
50542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50333 ns |
50292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
211767 ns |
194710.5 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7041.5 ns |
6750 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6834 ns |
6875 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7791 ns |
7709 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6791.5 ns |
6541 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
101571.5 ns |
94841 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
9542 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10375 ns |
10209 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10292 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10042 ns |
9958 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
554325.5 ns |
543786 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5917 ns |
5917 ns |
1 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6083 ns |
6292 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7437.5 ns |
6750 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5666 ns |
5666 ns |
1 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
110518.5 ns |
105080 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13209 ns |
12583 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13292 ns |
13750 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13709 ns |
13375 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13041 ns |
13375 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
509430.5 ns |
521491.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1125 ns |
1083 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1084 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
33078 ns |
33226 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8250 ns |
8125 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8333 ns |
8500 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8333 ns |
7875 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8145.5 ns |
8041 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
210419.5 ns |
215927 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23291 ns |
23125 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23042 ns |
23209 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23437.5 ns |
23250 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23333 ns |
23250 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18552 ns |
18682 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52250 ns |
52250 ns |
1 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52750 ns |
53125 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52791.5 ns |
52833 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52250 ns |
52250 ns |
1 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
290907 ns |
310779 ns |
0.94 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1451833 ns |
1455520.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1403125 ns |
1461770.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1459625 ns |
1464563 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1445291 ns |
1420375.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196080 ns |
196494.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5025083.5 ns |
5004917 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5017937.5 ns |
4928042 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4999374.5 ns |
5012292 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4864042 ns |
5010708.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
613757 ns |
619791 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3125458.5 ns |
3153125 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2143333 ns |
2140000 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2332708.5 ns |
2307083.5 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4916959 ns |
4612500 ns |
1.07 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
580570.5 ns |
580901 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24461333.5 ns |
24408833 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
19674833 ns |
19732667 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18932416.5 ns |
19045729.5 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36440292 ns |
36515125 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3020876 ns |
2842137 ns |
1.06 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34147666 ns |
34057083.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28226125 ns |
28326333 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28040083 ns |
28024667 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
42650334 ns |
42838792 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
141275708 ns |
140571271 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
143424625 ns |
143484104 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
121020437.5 ns |
120774500 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
186571750 ns |
187527416 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22544817 ns |
22777810 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
2534668458.5 ns |
1387998541 ns |
1.83 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1999737917 ns |
2164279542 ns |
0.92 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
2143878500 ns |
1082658958.5 ns |
1.98 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
823358604 ns |
828842208.5 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
116555809 ns |
118414466 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
71625 ns |
79708.5 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
73166.5 ns |
72542 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
75458 ns |
75520.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
83709 ns |
73458 ns |
1.14 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
302677.5 ns |
238954.5 ns |
1.27 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
287916 ns |
286459 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
191896 ns |
295292 ns |
0.65 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
193875 ns |
302292 ns |
0.64 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
282458.5 ns |
240521 ns |
1.17 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1280743 ns |
1217040 ns |
1.05 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35380770.5 ns |
35202521 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
35682750 ns |
35899625 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
31317625 ns |
31197042 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
39667312.5 ns |
39929583.5 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5842485 ns |
5845222 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
148257437.5 ns |
147855667 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
151961833.5 ns |
153555375 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
136251833.5 ns |
134579979 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
150216771.5 ns |
150196958.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34881249 ns |
34892998 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
112925875 ns |
114292542 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173003479.5 ns |
173321542 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
143333750 ns |
143543334 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
93518125 ns |
93943084 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5460899.5 ns |
5434556 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
471090667 ns |
473131708 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
516727250 ns |
515810125.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
443934041.5 ns |
442518292 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
612554750.5 ns |
614699291.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32275777 ns |
35179278 ns |
0.92 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
805778542 ns |
804964083 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
655573354 ns |
656838729.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
583320500 ns |
594341604 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
734069042 ns |
735687542 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1368041 ns |
1353083 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
919584 ns |
1020917 ns |
0.90 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
919958 ns |
995292 ns |
0.92 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2054875 ns |
2104875 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
566946 ns |
569348 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
3006395.5 ns |
2979875 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2633438 ns |
2615833 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2441625 ns |
2614124.5 ns |
0.93 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3438792 ns |
3699541.5 ns |
0.93 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1769601 ns |
1670621 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5817666.5 ns |
5794812.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5798541.5 ns |
5833354.5 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5755562.5 ns |
5800917 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2898500 ns |
2911437.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
8042 ns |
7875 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
7083 ns |
7000 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
7083 ns |
7000 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10500 ns |
10583 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24997 ns |
24801 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225729 ns |
222541.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220854 ns |
221250 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223458 ns |
220833.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
208541.5 ns |
217041.5 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
255805 ns |
245776 ns |
1.04 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
454522209 ns |
451162917 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
206014875 ns |
205123625.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
172978167 ns |
178414666.5 ns |
0.97 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
453775750 ns |
454897875 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7670617.5 ns |
7671486 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1106555666.5 ns |
1093247396 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
920762020.5 ns |
925248250 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
917863666 ns |
837547083 ns |
1.10 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1163983167 ns |
1163363584 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26581282 ns |
26761104.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5416 ns |
5500 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5916 ns |
5458 ns |
1.08 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6812 ns |
6875 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5375 ns |
5291.5 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
152662.5 ns |
149694 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
6833.5 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7750 ns |
7395.5 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7500 ns |
7792 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
6875 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
605121.5 ns |
579102 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
583 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
541 ns |
583 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
24120 ns |
23601 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9292 ns |
9166 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9500 ns |
9042 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9541 ns |
9250 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8875 ns |
10166.5 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
215400.5 ns |
199458 ns |
1.08 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
371667 ns |
354500 ns |
1.05 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
375417 ns |
352375 ns |
1.07 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
351583 ns |
355687.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
352146 ns |
357479.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21526 ns |
21220 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
803229 ns |
824396 ns |
0.97 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
803291 ns |
778375 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
782000 ns |
777666 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
823583 ns |
821813 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
255289 ns |
231309.5 ns |
1.10 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
334750 ns |
331125 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
337375 ns |
344833 ns |
0.98 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
452125 ns |
453000 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
11145.5 ns |
10292 ns |
1.08 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17940 ns |
18084 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
715917 ns |
709750 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
725833.5 ns |
741354 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1008166.5 ns |
1003291.5 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
26375 ns |
26479 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
229391 ns |
223194.5 ns |
1.03 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
372084 ns |
370292 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
347875.5 ns |
353396 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
437562.5 ns |
439292 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
43937.5 ns |
29916.5 ns |
1.47 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22253 ns |
22856 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
735458.5 ns |
727458 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
779000 ns |
790208 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1027500 ns |
1034916 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
97750.5 ns |
90395.5 ns |
1.08 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
203262.5 ns |
197661 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3542 ns |
3417 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3625 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3750 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3500 ns |
3417 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17828 ns |
17539 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4333 ns |
4208 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4458 ns |
4375 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4334 ns |
4250 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4166 ns |
4125 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
250213.5 ns |
213017 ns |
1.17 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3792 ns |
3729 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3959 ns |
4083 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4167 ns |
4958 ns |
0.84 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3917 ns |
3417 ns |
1.15 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
189460 ns |
159837 ns |
1.19 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8375 ns |
8167 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8958.5 ns |
8583 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8583 ns |
8667 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8458 ns |
8375 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1128440.5 ns |
1042725 ns |
1.08 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
206000 ns |
205667 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
212583 ns |
213208 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
211625 ns |
213500 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
201167 ns |
200458 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34618 ns |
34523 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
644375 ns |
645542 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
634375 ns |
671042 ns |
0.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
623645.5 ns |
621458.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
634438 ns |
580854.5 ns |
1.09 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
324576 ns |
298737.5 ns |
1.09 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
1261583 ns |
1234437.5 ns |
1.02 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1252458 ns |
1277666 ns |
0.98 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
1193020.5 ns |
1190750 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
1154583 ns |
1152750 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
207043 ns |
206763.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4562458 ns |
4518542 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4633334 ns |
4787042 ns |
0.97 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4473209 ns |
4473666.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
4541812.5 ns |
5146541 ns |
0.88 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
928231 ns |
931436.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3375 ns |
3667 ns |
0.92 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4208 ns |
3667 ns |
1.15 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
3895.5 ns |
4041 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3437.5 ns |
2959 ns |
1.16 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
198397 ns |
185683 ns |
1.07 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7250 ns |
7167 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7500 ns |
7333 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7125 ns |
7667 ns |
0.93 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7125 ns |
6833 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
954038 ns |
942579 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1639667 ns |
1642000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1208084 ns |
1207250 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1372541 ns |
1390000 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2354041.5 ns |
2427938 ns |
0.97 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
212523.5 ns |
212907.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12357250 ns |
12368250 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9563604 ns |
9590500 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9298917 ns |
9295438 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18074854 ns |
18019000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1941189 ns |
1954764 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17373646 ns |
17359458 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14399791 ns |
14385104 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14361833 ns |
14370541 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21107083.5 ns |
21035500 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
137812.5 ns |
134083.5 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
85520.5 ns |
139416.5 ns |
0.61 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
90834 ns |
134958 ns |
0.67 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
85708 ns |
131334 ns |
0.65 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126331.5 ns |
125600 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2035479 ns |
2022916.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1955875 ns |
2047021 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2015750 ns |
2034334 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2032000 ns |
2039125 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
946151.5 ns |
948556 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
1125 ns |
1458 ns |
0.77 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
1792 ns |
1792 ns |
1 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
3416.5 ns |
3520.5 ns |
0.97 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
1625 ns |
1229.5 ns |
1.32 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15769 ns |
16310 ns |
0.97 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2750 ns |
2542 ns |
1.08 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2791 ns |
2792 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
2834 ns |
2875 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2792 ns |
2834 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
179385 ns |
182763.5 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
8042 ns |
7958 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6875 ns |
6875 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6917 ns |
6875 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10500 ns |
10583 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33770 ns |
33908 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
249667 ns |
225041 ns |
1.11 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
229500 ns |
221625 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220625 ns |
220833 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
206542 ns |
215291 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
298866.5 ns |
320916 ns |
0.93 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3667 ns |
3708 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22007 ns |
22605 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14417 ns |
14500 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14583 ns |
14625 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14417 ns |
14500 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14417 ns |
14500 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
433380 ns |
456450 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
143979.5 ns |
142749.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
135729.5 ns |
91312 ns |
1.49 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
94667 ns |
142292 ns |
0.67 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
134792 ns |
138792 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125748 ns |
125035 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1934750 ns |
1919500 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1925812.5 ns |
1942104 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1653333 ns |
1929000 ns |
0.86 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1911042 ns |
1927250 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
914085 ns |
877064 ns |
1.04 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
878291 ns |
877458.5 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
825334 ns |
825458.5 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1230104 ns |
1230104 ns |
1 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
951084 ns |
955479 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
268596 ns |
269410 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2838667 ns |
2816333 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2541125.5 ns |
2528771 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3337917 ns |
3342458 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3402687.5 ns |
3349729.5 ns |
1.02 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1617436.5 ns |
1555391.5 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
14958 ns |
14833 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15500 ns |
14875 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
16958 ns |
18500 ns |
0.92 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15500 ns |
16875 ns |
0.92 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
151571 ns |
131035 ns |
1.16 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
263125 ns |
227209 ns |
1.16 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
256666 ns |
215791 ns |
1.19 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223750 ns |
216958 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
255125 ns |
225250 ns |
1.13 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
665282.5 ns |
594103.5 ns |
1.12 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
218833.5 ns |
221333 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
221375 ns |
222875 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
221166 ns |
222583 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
223792 ns |
219042 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
309310.5 ns |
242007 ns |
1.28 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
509062.5 ns |
548917 ns |
0.93 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
501208 ns |
511041.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
505458 ns |
509917 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
504145.5 ns |
508458 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1356900.5 ns |
1234181 ns |
1.10 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
3875 ns |
4083 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
4083.5 ns |
4041 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
4188 ns |
4417 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
4375 ns |
3666.5 ns |
1.19 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
17142 ns |
17140 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7500 ns |
7209 ns |
1.04 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7292 ns |
7459 ns |
0.98 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7291 ns |
7333.5 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7416.5 ns |
7417 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
185412 ns |
183429.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17708 ns |
18833 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17167 ns |
16666 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18416 ns |
21083 ns |
0.87 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15958 ns |
18396 ns |
0.87 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
186970.5 ns |
131942 ns |
1.42 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
218833 ns |
245395.5 ns |
0.89 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
247292 ns |
212292 ns |
1.16 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
213625 ns |
214833 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
211500 ns |
213708 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
962747 ns |
833743 ns |
1.15 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4917 ns |
4208 ns |
1.17 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5166 ns |
4833 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
4833 ns |
4916.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4166.5 ns |
3854.5 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
195909 ns |
208168.5 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10416 ns |
10333 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10875 ns |
10459 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10625 ns |
11084 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10375 ns |
10145.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1010229 ns |
994315 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3875 ns |
3458 ns |
1.12 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3917 ns |
3791 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4458 ns |
4042 ns |
1.10 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3333 ns |
3167 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
219391 ns |
209797 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7917 ns |
7416 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7833 ns |
7459 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7834 ns |
8083.5 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7083 ns |
7459 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1030231.5 ns |
997101.5 ns |
1.03 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23757250 ns |
23443625 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
35471625 ns |
34805208 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
38290895.5 ns |
37298500 ns |
1.03 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34483875 ns |
34536209 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1833809 ns |
1851929 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
185699542 ns |
185954395.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
160028771 ns |
159888645.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146263209 ns |
144873209 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
437740041 ns |
438754792 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16480986 ns |
16496173 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
271222229 ns |
269927937.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
260376333 ns |
259799312.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
300106000 ns |
298856875 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
486503812 ns |
487045354.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182791 ns |
189541.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
190667 ns |
182167 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
184333 ns |
183416.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
181792 ns |
182375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
208501 ns |
187318 ns |
1.11 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
637249.5 ns |
636187.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
634645.5 ns |
597458.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
592520.5 ns |
588459 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
632229 ns |
596146 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1057588 ns |
944443 ns |
1.12 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3863020.5 ns |
3952375 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3940187.5 ns |
4007646 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3577667 ns |
3594292 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4857375 ns |
4885708 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
533313.5 ns |
552348.5 ns |
0.97 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
18144146 ns |
18061833 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
18426875 ns |
18498208.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16971458 ns |
17053770.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
19835208 ns |
19733813 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2625144.5 ns |
2636788.5 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
541 ns |
500 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
33403 ns |
32315 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
9145.5 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9417 ns |
9625 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9291 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9042 ns |
8792 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
256811 ns |
247143.5 ns |
1.04 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
502531666.5 ns |
497882542 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
468706708 ns |
466893292 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
359169771 ns |
356555750 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
597742458 ns |
601192353.5 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12474979 ns |
12465773.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1889533104 ns |
1887759917 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1628715000 ns |
1627534167 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1500257208.5 ns |
1505961604 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2135455146 ns |
2123318791.5 ns |
1.01 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49477206.5 ns |
49303078 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1655875 ns |
1652917 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1207208 ns |
1209833 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1417834 ns |
1397667 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2430000 ns |
2460062.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215864 ns |
214417 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12748479.5 ns |
12745021 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9953749.5 ns |
9950208 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9718749.5 ns |
9693541 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18387500 ns |
18371500 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2012401 ns |
2028129 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17706542 ns |
17681833 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14704250 ns |
14711375 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14671854 ns |
14648250 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21442208 ns |
21429709 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26209 ns |
26167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26250 ns |
26167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26208 ns |
26167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26167 ns |
26166 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24383 ns |
23744 ns |
1.03 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67125 ns |
67208 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67458 ns |
67208 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66875 ns |
67166 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
67000 ns |
66916 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
387270.5 ns |
365755.5 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
206833 ns |
206375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
212000 ns |
212666 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
211917 ns |
211542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
200041 ns |
200291 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26798 ns |
25711 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
657750 ns |
655729 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
667438 ns |
632000 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
670645.5 ns |
673667 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
587959 ns |
630708 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
332552 ns |
322192 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
693500 ns |
683459 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
541604.5 ns |
682708 ns |
0.79 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
688312.5 ns |
691916.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
688417 ns |
680834 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131360 ns |
130902.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2267792 ns |
2242354.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2254250 ns |
2244709 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2249042 ns |
2244875.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2240291 ns |
2229125 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1150961 ns |
1093705 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19875 ns |
20396 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16875 ns |
16833 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18042 ns |
23020.5 ns |
0.78 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16959 ns |
19166 ns |
0.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
133829.5 ns |
131648.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
270250 ns |
265541.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
227875 ns |
232167 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
260125 ns |
264625 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218417 ns |
259979 ns |
0.84 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
989553.5 ns |
939947 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
541 ns |
1.16 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23299 ns |
23249 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9542 ns |
9583.5 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9750 ns |
9708 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10042 ns |
10041 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9583 ns |
9541 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
245463 ns |
242690 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5417 ns |
5542 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5709 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7125 ns |
6667 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5375 ns |
5250 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
207180 ns |
206130.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7125 ns |
6709 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
7417 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7667 ns |
7875 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7333 ns |
6708 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
736539 ns |
735324.5 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2250 ns |
2000 ns |
1.13 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2167 ns |
2229.5 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2250 ns |
2125 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2292 ns |
2292 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18095 ns |
17909 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6750 ns |
6375 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6666 ns |
6792 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6875 ns |
6875 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6334 ns |
6208 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
305699.5 ns |
303359 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
776770.5 ns |
751688 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
749458.5 ns |
779292 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
780583 ns |
779395.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
778083.5 ns |
776146 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21274 ns |
20845 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
801416 ns |
796792 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
810687.5 ns |
791166 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
813917 ns |
808708 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
810584 ns |
775292 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
270452 ns |
267264 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
8042 ns |
8000 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
7000 ns |
6687.5 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6917 ns |
6958 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10625 ns |
10458 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32850 ns |
32932 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
266167 ns |
261062.5 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
266062.5 ns |
237583 ns |
1.12 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
269771 ns |
271396 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
255895.5 ns |
252646 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
335080 ns |
331767 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10334 ns |
10250 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10334 ns |
10542 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10708 ns |
11208 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10167 ns |
10250 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
224193 ns |
218675.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24666 ns |
25000 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24708 ns |
24625 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
26208 ns |
25583 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25000 ns |
24416 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1065472 ns |
1056250 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106377708 ns |
106355042 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
117402750 ns |
117397229.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
120815791 ns |
120585312.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117612208.5 ns |
117183084 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2661070 ns |
2657952 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
375259750 ns |
374187771 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
347518125 ns |
350821292 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
360594625 ns |
361003333 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
481637083 ns |
479876375 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15260454.5 ns |
15234863.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
791338708.5 ns |
604863708 ns |
1.31 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
771555583 ns |
773786667 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
811283708 ns |
812604291 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
768279167 ns |
770323375 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6750 ns |
6833 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7229.5 ns |
7084 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8125 ns |
8062.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6875 ns |
6250 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
213488 ns |
213616 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13917 ns |
13458 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14125 ns |
13875 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
13959 ns |
14416 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13375 ns |
13625 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1023254 ns |
1017707 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6000 ns |
6208 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6458 ns |
6042 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7167 ns |
7145.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5666 ns |
5417 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
210071 ns |
208255 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12541 ns |
11958 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12792 ns |
12729.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12916 ns |
13250 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12000 ns |
12500 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
725821.5 ns |
723959 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
5500 ns |
6209 ns |
0.89 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
5834 ns |
6375 ns |
0.92 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
6250 ns |
6375 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
6000 ns |
5500 ns |
1.09 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16767 ns |
16943 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
15666 ns |
15250 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
15458 ns |
15625 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
15584 ns |
15625 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
15541 ns |
15500 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
185594.5 ns |
186257 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23662 ns |
23245 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6500 ns |
6375 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6667 ns |
6375 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6625 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6125 ns |
6187.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
227826.5 ns |
225046 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5958 ns |
5750 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5834 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5875 ns |
5833 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5834 ns |
5792 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24912 ns |
24205 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21084 ns |
20875 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21334 ns |
21417 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21750 ns |
21541.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21250 ns |
21229.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
249065 ns |
246651 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
149583 ns |
194166.5 ns |
0.77 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
193250 ns |
200521 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
189875 ns |
190666.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
183271 ns |
185562 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167107 ns |
166320.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1347374.5 ns |
1329104.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1326834 ns |
1324792 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1334167 ns |
1328041 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1143875 ns |
1337729.5 ns |
0.86 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1251168.5 ns |
1221500 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21812.5 ns |
24687.5 ns |
0.88 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
21834 ns |
22000 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25500 ns |
25667 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21917 ns |
21250 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
319912.5 ns |
254624.5 ns |
1.26 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
145292 ns |
130791 ns |
1.11 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
178958 ns |
132062.5 ns |
1.36 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
183625 ns |
179458 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
178729 ns |
179520.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1370058 ns |
1317432 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
417 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23172 ns |
22902 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6459 ns |
6208 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6667 ns |
6709 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6833 ns |
6917 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6292 ns |
6291 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
243330.5 ns |
240780 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4542 ns |
4875 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4416 ns |
4542 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5500 ns |
5500 ns |
1 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4542 ns |
4417 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
230541.5 ns |
229531.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
10083 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10250 ns |
10375 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10416 ns |
10583 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
10416 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1274677 ns |
1276460 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1667 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1667 ns |
1583 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1584 ns |
1584 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22978 ns |
22954 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5625 ns |
5792 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5917 ns |
5958 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5875 ns |
5875 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5708 ns |
5584 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
262617.5 ns |
258626 ns |
1.02 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6869583 ns |
6841563 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6385187.5 ns |
6377645.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6569104 ns |
6542167 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7620979 ns |
7612146 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
212993 ns |
213873 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24088541 ns |
24061541 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21276291 ns |
21280959 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21151250 ns |
21049937 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29764104.5 ns |
29725708.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2094452 ns |
2091556 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
48969166 ns |
37658500 ns |
1.30 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45455437 ns |
45669958 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45774146 ns |
45878312.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
37954083.5 ns |
38309416.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6250 ns |
5917 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
6042 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7125 ns |
6958.5 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5666 ns |
5542 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
204070.5 ns |
210091 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8250 ns |
8041 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8417 ns |
8250 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8459 ns |
8500 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7708 ns |
8250 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
984301.5 ns |
992082 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1573917 ns |
1552375 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1268875 ns |
1278292 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1643334 ns |
1634959 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2167478.5 ns |
2176750 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
271090 ns |
269882.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7942625 ns |
7890000 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6611271 ns |
6564479 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7244083.5 ns |
7223979 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10441166.5 ns |
10470041 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1758088 ns |
1748953.5 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
371479.5 ns |
375500 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
374416.5 ns |
379708 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
454791 ns |
454583 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
30291 ns |
34834 ns |
0.87 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
44073.5 ns |
46336 ns |
0.95 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
745166 ns |
739834 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
811042 ns |
821979 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1067042 ns |
1062042 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
77167 ns |
119270.5 ns |
0.65 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
274366 ns |
274066 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
414167 ns |
412125 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
305917 ns |
305917 ns |
1 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
306000 ns |
305916 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
758584 ns |
757958 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43580 ns |
44006 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
659625 ns |
658583 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
524708 ns |
525792 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
524583 ns |
523167 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
973250 ns |
973083 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
189177.5 ns |
189089 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
692125 ns |
672875 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
672375 ns |
676521 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
676292 ns |
644292 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
674917 ns |
672333 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131547 ns |
131017.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2463791.5 ns |
2466812.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2466917 ns |
2456312.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2455459 ns |
2425417 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2437166 ns |
2465333 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1168757 ns |
1103271 ns |
1.06 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
2125 ns |
2333 ns |
0.91 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
4958.5 ns |
2875 ns |
1.72 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
4458 ns |
4500 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
3125 ns |
3167 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15381 ns |
16213 ns |
0.95 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5417 ns |
5208 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5625 ns |
5625 ns |
1 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5667 ns |
5667 ns |
1 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5583 ns |
5459 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
184580 ns |
184737.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1480542 ns |
1481125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1517458 ns |
1519875 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1517583 ns |
1522875 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1449000 ns |
1453417 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40149 ns |
40096 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5141521 ns |
5124333 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5293812.5 ns |
5295937.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5299875 ns |
5290354 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5011041 ns |
4993187.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
195851 ns |
194429.5 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3666 ns |
3666 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3709 ns |
3666 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3666 ns |
3625 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3666 ns |
3667 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33274 ns |
33150 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15333 ns |
15208 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15500 ns |
15375 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15458 ns |
15416 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15166 ns |
15250 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
349588.5 ns |
349182 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
93375 ns |
93000 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
94000 ns |
103209 ns |
0.91 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
92917 ns |
92958 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
102709 ns |
92833 ns |
1.11 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
112783 ns |
113197 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
317750 ns |
315959 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
316292 ns |
319270.5 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
317584 ns |
317000 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
317584 ns |
317333 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
193819 ns |
191577 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1000 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1084 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1000 ns |
1000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
24048 ns |
23307 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8333 ns |
7792 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8375 ns |
8375 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8583 ns |
8125 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7750 ns |
8000 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
246693 ns |
244539 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
536187.5 ns |
531791 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
512687.5 ns |
517334 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
577375.5 ns |
578729.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
240479 ns |
256916 ns |
0.94 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129151.5 ns |
130622 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1428937.5 ns |
1386812.5 ns |
1.03 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1472834 ns |
1483208.5 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1798541.5 ns |
1776708 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
869958 ns |
871125 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
274083.5 ns |
273552 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31689 ns |
31822 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6542 ns |
5958 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6542 ns |
6459 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6708 ns |
6416 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
5958 ns |
6167 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
249959 ns |
246678.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1773333 ns |
1774479 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1768771 ns |
1782250.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1792416.5 ns |
1777916 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1795771 ns |
1766937 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
168746.5 ns |
169504.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4384292 ns |
4354563 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4287458 ns |
3899583 ns |
1.10 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4380250 ns |
4361500 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4188958 ns |
4355333 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1121956.5 ns |
1064911 ns |
1.05 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6812.5 ns |
24479 ns |
0.28 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7312.5 ns |
7541 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7520.5 ns |
7833 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7000 ns |
22208.5 ns |
0.32 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
20861 ns |
19777 ns |
1.05 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
79833.5 ns |
72854.5 ns |
1.10 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
64104 ns |
51667 ns |
1.24 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
52000 ns |
51833 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
70000 ns |
70542 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
268016 ns |
193123 ns |
1.39 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
17625 ns |
17625 ns |
1 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
17917 ns |
18250 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
18250 ns |
17708 ns |
1.03 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
17917 ns |
17250 ns |
1.04 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18127 ns |
18352 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53250 ns |
53000 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53334 ns |
53250 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53458 ns |
53542 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53333 ns |
53375 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
312603 ns |
317963.5 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
107625 ns |
107500 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
97000 ns |
107125 ns |
0.91 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
98375 ns |
105625 ns |
0.93 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
99458 ns |
97584 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47074.5 ns |
46786 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
324291 ns |
323417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
323334 ns |
327750 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
324458.5 ns |
322667 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
323375 ns |
325000 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
210382 ns |
207825 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1505584 ns |
1504209 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1545458 ns |
1545458 ns |
1 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1545125 ns |
1549042 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1467020.5 ns |
1478167 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52015 ns |
51382 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5125208 ns |
5122771 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5288334 ns |
5291458 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5293292 ns |
5291125 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4977583 ns |
5000125 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
202621 ns |
200987.5 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28166 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28167 ns |
28250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28250 ns |
28125 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28084 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24292 ns |
24367 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66583.5 ns |
66375 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66458 ns |
66583 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67584 ns |
66375 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66833 ns |
66375 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
479926.5 ns |
493214.5 ns |
0.97 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1516292 ns |
1497500 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1145334 ns |
1150584 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1158375 ns |
1142791.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2105229 ns |
2256875 ns |
0.93 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
577247 ns |
579142.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3108479 ns |
3080625.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2741458 ns |
2682000 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2737042 ns |
2729917 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3824687.5 ns |
3656583 ns |
1.05 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
1965139 ns |
1939352 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7930583 ns |
7890875 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7916584 ns |
7897375 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7914458 ns |
7904208 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4821083.5 ns |
4815458 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
139291 ns |
138395.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
137125 ns |
78917 ns |
1.74 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
139979.5 ns |
132458.5 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
140687.5 ns |
140084 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192231 ns |
193872 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2033500 ns |
2020209 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2035250 ns |
1690750 ns |
1.20 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2028999.5 ns |
2025250 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2011666 ns |
2006209 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
744838 ns |
742900 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
deleted the
compathelper/new_version/2024-12-06-00-19-20-030-01136868030
branch
December 6, 2024 03:53
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request changes the compat entry for the
Flux
package from0.14.25
to0.14.25, 0.15
.This keeps the compat entries for earlier versions.
Note: I have not tested your package with this new compat entry.
It is your responsibility to make sure that your package tests pass before you merge this pull request.