This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/dropout_enz
branch
from
September 5, 2024 03:31
cdfc713
to
26a745f
Compare
avik-pal
changed the title
fix: looped dropout implementation on CPU
fix: dropout enzyme test fixes
Sep 5, 2024
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #153 +/- ##
==========================================
+ Coverage 76.11% 78.11% +2.00%
==========================================
Files 38 38
Lines 1959 1956 -3
==========================================
+ Hits 1491 1528 +37
+ Misses 468 428 -40 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 26a745f | Previous: 1afc1c7 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5479.5 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5291.5 ns |
6375 ns |
0.83 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7958 ns |
8000 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5645.5 ns |
6375 ns |
0.89 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
119440 ns |
119198 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2765240 ns |
2649209 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
742041 ns |
704000 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
412334 ns |
417764 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
9812 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9917 ns |
9625 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10875 ns |
10042 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10041 ns |
9541 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
555583 ns |
551456 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
16647491 ns |
16841216 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2610875 ns |
2645125 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
681268 ns |
659636 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1458 ns |
1395.5 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3125 ns |
1687.5 ns |
1.85 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1750 ns |
1875 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1417 ns |
2521 ns |
0.56 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21829 ns |
21867 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1292057 ns |
1304894 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
215500 ns |
212604 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
30965 ns |
30820.5 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4229.5 ns |
4209 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3625 ns |
4312.5 ns |
0.84 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4708 ns |
3917 ns |
1.20 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3834 ns |
4375 ns |
0.88 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
147051 ns |
146279 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
9309598 ns |
8894773.5 ns |
1.05 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1600188 ns |
1523375 ns |
1.05 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
149932 ns |
148982 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57500 ns |
57542 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46500 ns |
46584 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
40000 ns |
39875 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82792 ns |
83708 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36927 ns |
36787 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
563011 ns |
582007 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1075125 ns |
985625 ns |
1.09 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
80601 ns |
84391 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2029208 ns |
2036583 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2092917 ns |
2086750 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2089208 ns |
2079917 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1995416.5 ns |
1987312.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
230122 ns |
227214 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
8058504 ns |
7854957 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7868625 ns |
7818750 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1282994 ns |
967560 ns |
1.33 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
153125 ns |
154083 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
149500 ns |
146958 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
148396 ns |
149979.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
145458 ns |
165187.5 ns |
0.88 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166122.5 ns |
166381 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7788069 ns |
7795058 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1449625 ns |
1464583 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
190462 ns |
207072 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1107375 ns |
1110895.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1094750 ns |
1103209 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1122771 ns |
1118687 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1110937 ns |
1109562.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
712693.5 ns |
711437 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34722508 ns |
33922938.5 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
5817083 ns |
6051917 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1040261 ns |
1036360 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5292 ns |
5208 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5000 ns |
4271 ns |
1.17 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5875 ns |
5375 ns |
1.09 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4791.5 ns |
4584 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
94128 ns |
94268 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5352782 ns |
5136056 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
451041.5 ns |
711583 ns |
0.63 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
71450 ns |
69481 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8833 ns |
8667 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9084 ns |
8500 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9167 ns |
8917 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8625 ns |
8333 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
610507 ns |
603970 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
36416804.5 ns |
33683319.5 ns |
1.08 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
7166292 ns |
5821292 ns |
1.23 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
388814 ns |
389889 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17645.5 ns |
17729.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17875 ns |
20042 ns |
0.89 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19833 ns |
20584 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17708.5 ns |
20416.5 ns |
0.87 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
67107 ns |
66995 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
2751251.5 ns |
2897295 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1290999.5 ns |
1301292 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
72801 ns |
73931 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
211875 ns |
211625 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
245374.5 ns |
218875 ns |
1.12 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214770.5 ns |
218667 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
223833 ns |
224875 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
359840 ns |
357740 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
12166191 ns |
14308445 ns |
0.85 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5640520.5 ns |
5704396 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
475685 ns |
473855 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
583 ns |
625 ns |
0.93 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
708 ns |
666 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
875 ns |
750 ns |
1.17 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
708 ns |
666 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20835 ns |
20965 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1137595 ns |
1157358.5 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
295208.5 ns |
283542 ns |
1.04 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
34241 ns |
32571 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1375 ns |
1375 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1541 ns |
1375 ns |
1.12 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1583 ns |
1500 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1375 ns |
1334 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
126497 ns |
125947 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8400943 ns |
8433349.5 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1686083 ns |
1594979.5 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
138152 ns |
138471 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7334 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6166 ns |
6125 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5417 ns |
5333 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9875 ns |
10417 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23913.5 ns |
23836 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1233546.5 ns |
1232101.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
599042 ns |
583125 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49130 ns |
46460 ns |
1.06 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
260604 ns |
227708 ns |
1.14 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
263083 ns |
235583 ns |
1.12 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
264875 ns |
264667 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
249916 ns |
248583 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
192935 ns |
190580 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
29855996.5 ns |
29562269.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8650417 ns |
8564854.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
617941.5 ns |
611281 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4125 ns |
4084 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4125 ns |
4084 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4125 ns |
4125 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4083 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23226 ns |
23789 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
1923166 ns |
2018577 ns |
0.95 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
225125 ns |
219791.5 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48510 ns |
50370 ns |
0.96 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16875 ns |
16958 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
17334 ns |
17083 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16958 ns |
17083 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
17000 ns |
16666 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
194789 ns |
197449 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
10525168 ns |
9693737.5 ns |
1.09 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
951896 ns |
940458 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
175272 ns |
176226.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
512270.5 ns |
509500 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
405292 ns |
405083 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
332458 ns |
332459 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
864917 ns |
865125 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113343 ns |
113130 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
392476 ns |
391060 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
457875 ns |
451416 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
247872 ns |
248703 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2323375 ns |
2324333 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2034417 ns |
2025375.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1753937.5 ns |
1752833.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3275375 ns |
3200583 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
242794 ns |
244865 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
9219742 ns |
11656548 ns |
0.79 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1995000 ns |
1966229 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
759198 ns |
761317.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
7000 ns |
6250 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6895.5 ns |
6145.5 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6916 ns |
7729 ns |
0.89 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6187.5 ns |
6375 ns |
0.97 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
92518.5 ns |
93009 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5487357.5 ns |
5406797 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
855771 ns |
758167 ns |
1.13 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
59771 ns |
60110 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11958 ns |
10646 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11667 ns |
10542 ns |
1.11 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11667 ns |
11084 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10437 ns |
10375 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
637059 ns |
660576 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38550571.5 ns |
38819677 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5722834 ns |
5487104 ns |
1.04 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
411715 ns |
416424 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
541 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23788 ns |
23635 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2286830 ns |
2221310 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
329896 ns |
319750 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
53510 ns |
53401 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2084 ns |
2083 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2083 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2084 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2042 ns |
2125 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
221806.5 ns |
232566 ns |
0.95 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
10957008 ns |
11381984 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
2037167 ns |
1912541.5 ns |
1.07 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
186492 ns |
186466.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8958 ns |
8375 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9292 ns |
8750 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10042 ns |
10438 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8042 ns |
8958 ns |
0.90 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
107667.5 ns |
104173 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
2997805.5 ns |
3244842 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
910042 ns |
896708 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
74635.5 ns |
74231 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17250 ns |
17708 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18520.5 ns |
17750 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18750 ns |
18187.5 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17333.5 ns |
18041.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
614632 ns |
610296 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
15956679.5 ns |
17126722 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5596125 ns |
5229458 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
387124 ns |
387209 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
708 ns |
625 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
541 ns |
1.16 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
459 ns |
500 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
35673 ns |
35555 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1212929.5 ns |
1100087 ns |
1.10 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
455250 ns |
438541 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
45700 ns |
47930 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8937.5 ns |
9312 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9167 ns |
8125 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9916.5 ns |
9792 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9104 ns |
9146 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
257298 ns |
256000 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19099554 ns |
19311232 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5329750 ns |
4774937.5 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
377764 ns |
378844 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396833.5 ns |
397000 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288042 ns |
288125 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
215375 ns |
215667 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
751604.5 ns |
756875 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111805 ns |
111981 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
321560.5 ns |
320003 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
419417 ns |
365500 ns |
1.15 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
78301 ns |
78230 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1453833 ns |
1460875 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1134125 ns |
1135291.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
860417 ns |
862687.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2438959 ns |
2357291 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
208454 ns |
209166.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
9610442 ns |
9267436 ns |
1.04 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1660083.5 ns |
1516312.5 ns |
1.09 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
323654 ns |
323643 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7354.5 ns |
6667 ns |
1.10 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7625 ns |
6959 ns |
1.10 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8125 ns |
8958.5 ns |
0.91 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7166.5 ns |
7334 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
141224.5 ns |
144567 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5656682.5 ns |
5867002 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
722312.5 ns |
707270.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59641 ns |
70660 ns |
0.84 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13292 ns |
15395.5 ns |
0.86 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14791.5 ns |
12417 ns |
1.19 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15749.5 ns |
14250 ns |
1.11 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14687.5 ns |
13312 ns |
1.10 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
935389 ns |
958993.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41759340.5 ns |
40369162 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6916250 ns |
5752729.5 ns |
1.20 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
432564 ns |
433804 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26375 ns |
24416 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25000 ns |
26417 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28042 ns |
28687 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
27062.5 ns |
26874.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
201126.5 ns |
201880.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7831227.5 ns |
8100056 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1175458 ns |
896833 ns |
1.31 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
116261 ns |
114876.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
151167 ns |
148834 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
119479.5 ns |
104708 ns |
1.14 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
148999.5 ns |
153500 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
152500 ns |
116979 ns |
1.30 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1087665 ns |
1086710 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43137168 ns |
41151661 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6444584 ns |
5843229.5 ns |
1.10 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
593036 ns |
594985 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
85833 ns |
73958 ns |
1.16 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76500 ns |
76791.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76833 ns |
80166 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76625 ns |
75417 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
208103.5 ns |
207189 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7493588 ns |
7362606 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
540624.5 ns |
519687.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
131571 ns |
126391.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
313791 ns |
297334 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
296834 ns |
221667 ns |
1.34 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
311958 ns |
288917 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221937.5 ns |
221041.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1133559 ns |
1119401 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43223483.5 ns |
41008184.5 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6691417 ns |
6497687.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
701802 ns |
694627 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16312 ns |
16417 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
17395.5 ns |
16583 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18291 ns |
17792 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
17167 ns |
16708 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
147610 ns |
147421 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5804228 ns |
5759467 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
765792 ns |
427292 ns |
1.79 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
239413 ns |
237703 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25979 ns |
24833.5 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27937.5 ns |
27042 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27958 ns |
27166.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
25937.5 ns |
27125 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
986333 ns |
984196 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
40747244 ns |
40719457 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6169458 ns |
5828333 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
715038 ns |
714022 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
10417 ns |
11562.5 ns |
0.90 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11709 ns |
10375 ns |
1.13 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14083 ns |
12083 ns |
1.17 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
10167 ns |
11083 ns |
0.92 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
124659 ns |
124895.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3568552 ns |
3575871 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
924583 ns |
912833 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
239052 ns |
242943 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
22958.5 ns |
21125 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
22104.5 ns |
21917 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
22334 ns |
22000 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
20917 ns |
21416 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
707851 ns |
706086.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21734391 ns |
21428227.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5474542 ns |
5387146 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
688247 ns |
673547 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
62500 ns |
64000.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
66291 ns |
63500 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
67937 ns |
66166 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
67979 ns |
62584 ns |
1.09 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
107328 ns |
105629.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3446851 ns |
3434086.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1322146 ns |
1323250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
236942 ns |
237572 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
441666.5 ns |
448750 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
438625 ns |
437958 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
448292 ns |
446666 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
437333 ns |
449583 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
518756.5 ns |
517219 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21007823 ns |
21208755 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6177354.5 ns |
5978042 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
719262.5 ns |
730458 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6958 ns |
6958.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7791 ns |
6833 ns |
1.14 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
9313 ns |
8041 ns |
1.16 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7603.5 ns |
7771 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
146777.5 ns |
145909.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5499279 ns |
5602766 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
699541.5 ns |
628395.5 ns |
1.11 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59140 ns |
58991 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14166 ns |
14042 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14084 ns |
15750 ns |
0.89 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14584 ns |
13917 ns |
1.05 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15062.5 ns |
13479 ns |
1.12 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
958445 ns |
954313 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
38754584 ns |
38432249.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
6493041.5 ns |
5549500 ns |
1.17 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
408904 ns |
404584 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6149000 ns |
6160416 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6375812.5 ns |
6378167 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
3225875.5 ns |
3224791.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11899792 ns |
11924000 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301882 ns |
301800.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
293803 ns |
294983 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19107145.5 ns |
19104958 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19915333.5 ns |
19957229 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
11120042 ns |
11123708.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
37016645.5 ns |
36532604 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1023438 ns |
1023618 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1152527 ns |
1158122 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
959 ns |
917 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1000 ns |
958 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1000 ns |
958 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
958 ns |
959 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23409 ns |
23554 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2081403 ns |
2143802 ns |
0.97 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
337209 ns |
316188 ns |
1.07 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
213602 ns |
215672 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3666 ns |
3625 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3750 ns |
3667 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3666 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3667 ns |
3666 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
281249 ns |
283503 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11252797 ns |
11257238 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2188854 ns |
2086333.5 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
646317 ns |
637297 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8958 ns |
8000 ns |
1.12 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8584 ns |
7958 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9813 ns |
9042 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7916.5 ns |
7854 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
121231.5 ns |
120818.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3432461 ns |
3517154 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
881937 ns |
776959 ns |
1.14 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
66270 ns |
67641 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11938 ns |
11729.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
12375 ns |
12250 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12792 ns |
12334 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11833 ns |
12458.5 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
646205 ns |
643501 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21341489 ns |
21447178 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5427458 ns |
5189125.5 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
369394 ns |
365334 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22699 ns |
22596 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2011541.5 ns |
1951713 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
330708 ns |
225750 ns |
1.46 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
52201 ns |
52251 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2917 ns |
3041 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3292 ns |
3208 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3333 ns |
3375 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
3083 ns |
3042 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
204312 ns |
204741 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9376732 ns |
9227567 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1703999.5 ns |
1619250 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
172072 ns |
172842 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11083 ns |
11250 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12459 ns |
11334 ns |
1.10 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
14083.5 ns |
13125 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11354 ns |
11458 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
122250.5 ns |
121547.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3430038 ns |
3353104 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
931916 ns |
869041 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
239013 ns |
243193 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
22062.5 ns |
22000 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21645.5 ns |
20583 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
22708 ns |
21167 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20875 ns |
20791 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
602097 ns |
598450 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20239973.5 ns |
19931223.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4932625 ns |
4695229 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
648807 ns |
652706.5 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4334 ns |
4375 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4458 ns |
4417 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4417 ns |
4416 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4375 ns |
4416 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24546 ns |
24359 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2228549 ns |
2166080 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
229500 ns |
223833 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
52241 ns |
52541 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16584 ns |
16667 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16625 ns |
16500 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16541 ns |
16375 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16500 ns |
16333 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
329742.5 ns |
331128 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12273947 ns |
12599810 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1797000 ns |
1647875.5 ns |
1.09 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
212507 ns |
212037.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2083 ns |
1959 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2167 ns |
2083 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2167 ns |
1958 ns |
1.11 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2084 ns |
1958 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
35728 ns |
35684 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1153804 ns |
1146851 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
490000 ns |
441458.5 ns |
1.11 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
206052 ns |
206802 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
19624.5 ns |
16645.5 ns |
1.18 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
18416 ns |
16750 ns |
1.10 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
18459 ns |
16562.5 ns |
1.11 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
17187.5 ns |
17208.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
294948.5 ns |
294264.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21393926 ns |
20813859 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5546958 ns |
5292083 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
703407 ns |
703797.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
59375 ns |
59583.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
64104.5 ns |
63625 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
62458.5 ns |
62625 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51167 ns |
51292 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66631 ns |
66405 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
101111 ns |
103511 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
206500 ns |
199395.5 ns |
1.04 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
158604 ns |
157250 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
116854 ns |
133937.5 ns |
0.87 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
310812.5 ns |
317729 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
217149 ns |
216342 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
580606 ns |
579316 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
84375 ns |
82458.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82375 ns |
85271 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
86542 ns |
90209 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82375 ns |
140417 ns |
0.59 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192561 ns |
192334 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5470941 ns |
5533381 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1673458 ns |
1893708 ns |
0.88 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
171362 ns |
170101.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1893750 ns |
1851687.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1917895.5 ns |
1882334 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1925896 ns |
1926500 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1822750 ns |
1891958.5 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
536315 ns |
532324 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
26386495.5 ns |
25979046 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8985104 ns |
9683125 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1083051 ns |
1080090 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21947.5 ns |
21761 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2129791 ns |
2115738 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
369791 ns |
346875 ns |
1.07 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
45251 ns |
45220 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1834 ns |
1750 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
252263 ns |
253104 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9862510 ns |
9490240.5 ns |
1.04 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1601917 ns |
1088979 ns |
1.47 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
185317 ns |
187502 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8083 ns |
8084 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9395.5 ns |
8438 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11417 ns |
10875 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8833 ns |
8209 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
119822.5 ns |
119061 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3293817 ns |
3459549.5 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
930041.5 ns |
880209 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
236242.5 ns |
237872 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9709 ns |
10167 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10125 ns |
9208 ns |
1.10 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10541.5 ns |
9500 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9375 ns |
9167 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
528892 ns |
527070 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19321268 ns |
18222497.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4719958.5 ns |
4417458 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
647817 ns |
634411 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57375 ns |
58417 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46542 ns |
46333 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39375 ns |
39500 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80395.5 ns |
84083 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39845 ns |
39770 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1345580.5 ns |
1341281.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1133854.5 ns |
1100583.5 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
81551 ns |
75935.5 ns |
1.07 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1904875 ns |
1901542 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1944500 ns |
1921833.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1978417 ns |
1955833 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1870416.5 ns |
1881792 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
222036.5 ns |
221320 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34272722 ns |
33766076 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11239500 ns |
11588792 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1033441 ns |
1036440 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
418958 ns |
415958 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
417625 ns |
420042 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
421458 ns |
419875 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
418084 ns |
418708 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
210091.5 ns |
210156.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7195821 ns |
7606443 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
539833 ns |
522750 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
286323 ns |
287858 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
678021 ns |
764709 ns |
0.89 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
778708 ns |
781812 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
773959 ns |
753417 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
673417 ns |
678791.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1061649.5 ns |
1059447 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
44767025 ns |
43854665.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6653271 ns |
6323063 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
927159 ns |
916300 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3439333.5 ns |
3425978.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3391104 ns |
3451792 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3461875 ns |
3458979.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3470958 ns |
3412708 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
172710 ns |
170950 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8280433.5 ns |
8189493 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1462521 ns |
1396875 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
436115 ns |
435150 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6126375 ns |
6194166.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6222354 ns |
6230791.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6240750 ns |
6222854 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6176479.5 ns |
6218875 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1002674 ns |
1001834 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
47673707 ns |
49254606 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8127937.5 ns |
8528604 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1567646 ns |
1556125 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
472792 ns |
472667 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
340625 ns |
339875 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
253583 ns |
253208 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
902333 ns |
902000 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46656 ns |
46534 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
827389.5 ns |
886552 ns |
0.93 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
448875 ns |
478875 ns |
0.94 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
251102 ns |
249963 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2326542 ns |
2333750 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2032958 ns |
2036625 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1761771 ns |
1763167 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3281417 ns |
3203312 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
256650 ns |
258879 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
13683336.5 ns |
13032420 ns |
1.05 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2212583.5 ns |
2178375 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
787003 ns |
787818 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57209 ns |
57542 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46042 ns |
45875 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39417 ns |
39458 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82417 ns |
83791 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
29161 ns |
28376 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1330970 ns |
1391893 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1148563 ns |
1124083 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78581 ns |
77840.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2035708.5 ns |
2032250 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2089812.5 ns |
2093187.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2097687 ns |
2091917 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1920999.5 ns |
1972229.5 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
240945.5 ns |
235913 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
37146267 ns |
35452366 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11479375 ns |
11558395.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1222283 ns |
1056250.5 ns |
1.16 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57708 ns |
57708 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46458 ns |
46625 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39417 ns |
39875 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80250 ns |
83916.5 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
50765 ns |
49455 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
827004 ns |
809068 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1086000 ns |
1084875 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
73451 ns |
72105.5 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1889042 ns |
1921083 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1942292 ns |
1945916.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1982354.5 ns |
1974729.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1850625 ns |
1864791 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
247266.5 ns |
238800.5 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
16809144 ns |
17238198 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9912833.5 ns |
10023791.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1067510 ns |
934629 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
333 ns |
1.25 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
333 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
35925 ns |
34886 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1169075 ns |
1200155 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
462667 ns |
279833 ns |
1.65 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
47970 ns |
48281 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
6792 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6667 ns |
6208.5 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7500 ns |
7000 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6708 ns |
6667 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
213300 ns |
212384.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22271987 ns |
19751565 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5441708 ns |
5078916.5 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
373894 ns |
379104 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
33056 ns |
32763 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1173203 ns |
1167700 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
260750 ns |
253542 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
40680 ns |
41150 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2792 ns |
3833 ns |
0.73 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3417 ns |
3041 ns |
1.12 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3167 ns |
3375 ns |
0.94 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2792 ns |
3125 ns |
0.89 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
194789 ns |
190584.5 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
7289420 ns |
7912209 ns |
0.92 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
961353.5 ns |
1265542 ns |
0.76 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
155561 ns |
153656.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
422146 ns |
454937 ns |
0.93 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
422458 ns |
454750 ns |
0.93 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
431375 ns |
458229 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
446500.5 ns |
427188 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
139235.5 ns |
138010.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5896513 ns |
5819207 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2058104 ns |
2011000 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
381574 ns |
325693 ns |
1.17 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3767291.5 ns |
3801708.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3755874.5 ns |
3811125 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3827959 ns |
3821292 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3647292 ns |
3815375 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
723602 ns |
710674 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33144724.5 ns |
32043185 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11368791 ns |
10832625.5 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1330629 ns |
1491590 ns |
0.89 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49871458.5 ns |
49856479 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35510750 ns |
35516042 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
26029104.5 ns |
26022291 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
97438834 ns |
97102959 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1593681 ns |
1594251.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1009851 ns |
1009650 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154550187.5 ns |
154623520.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112345583 ns |
112350625 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
88949792 ns |
89065125 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
299491791.5 ns |
296081125 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6536119 ns |
6489845.5 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5557747 ns |
5556104 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
18146 ns |
17312.5 ns |
1.05 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
17041 ns |
16834 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
14917 ns |
14291.5 ns |
1.04 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15042 ns |
15167 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
22030 ns |
21687 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1119955.5 ns |
1157478.5 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
224333.5 ns |
218167 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
26051 ns |
27541 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
10708 ns |
11042 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
9062.5 ns |
9000.5 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
8083 ns |
7875 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17291 ns |
17416.5 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
266847 ns |
261161 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
9633341 ns |
9552185 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1603229 ns |
1560042 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
153121 ns |
155181 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8750 ns |
8125 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8791.5 ns |
8084 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10417 ns |
10083.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8437.5 ns |
8542 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
128922 ns |
116504 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3499288 ns |
3349407.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
831208 ns |
798667 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
239042 ns |
238952.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10750 ns |
9854 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9917 ns |
10229.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10687 ns |
10083 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9354.5 ns |
9958 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
639262.5 ns |
623888 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21817843 ns |
22194230 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5444791 ns |
4515667 ns |
1.21 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
670891.5 ns |
656976 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9771 ns |
9520.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9875 ns |
9125 ns |
1.08 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11666.5 ns |
11625 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8854 ns |
9479.5 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
124952.5 ns |
120769 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3245458 ns |
3531092 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
911292 ns |
888291 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
77221 ns |
79170 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
15125 ns |
14208 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13333 ns |
13208.5 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
15083 ns |
16333 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
16667 ns |
17000 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
609842 ns |
594781 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19612533.5 ns |
19851682 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4902166 ns |
4474458 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
355303 ns |
357348.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
459 ns |
500 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
625 ns |
459 ns |
1.36 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
584 ns |
458 ns |
1.28 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
35616 ns |
34855 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1099929 ns |
1184802 ns |
0.93 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
481667 ns |
423042 ns |
1.14 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
209262 ns |
209842 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10292 ns |
7709 ns |
1.34 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7708 ns |
7084 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9542 ns |
7708 ns |
1.24 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8250 ns |
8042 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
236913 ns |
231568.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21559240 ns |
22217593.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5704459 ns |
5660167 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
678127 ns |
679867 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
16208 ns |
16042 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
15041.5 ns |
15333 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
13333 ns |
13854 ns |
0.96 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10083 ns |
10375 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
22727 ns |
22215 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1191974 ns |
1158702.5 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
321687.5 ns |
205521 ns |
1.57 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
193902 ns |
194012 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
32167 ns |
31958 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32208.5 ns |
32145.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32458 ns |
32250 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32042 ns |
32250 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
281783 ns |
276502.5 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
10921459.5 ns |
11085623 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1840125 ns |
1721729 ns |
1.07 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
605236 ns |
605276.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
442417 ns |
474834 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
441208 ns |
445167 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
444667 ns |
486875 ns |
0.91 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
446792 ns |
474916 ns |
0.94 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194055 ns |
194410 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5844423 ns |
5748288 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1998937.5 ns |
2751937.5 ns |
0.73 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
325264 ns |
326354 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3799583.5 ns |
3823792 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3786145.5 ns |
3824042 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3841479.5 ns |
3849500 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3776729 ns |
3847584 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
555887 ns |
546410 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27559795 ns |
27926309 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10126917 ns |
10140750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1386964 ns |
1388348.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
782483459 ns |
782652917 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
542653375 ns |
542161792 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
417336604.5 ns |
420966458.5 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1561846646 ns |
1553203729.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22756973.5 ns |
22558411.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14049919 ns |
14062784.5 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
3017390750 ns |
2518008250 ns |
1.20 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1789540125 ns |
1785714792 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1525075250 ns |
1525039667 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
5275786875 ns |
4874366334 ns |
1.08 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
366208751 ns |
367235490 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
89425410 ns |
88231178 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
82209 ns |
77646 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
81334 ns |
75959 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
80375 ns |
82625 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76395.5 ns |
77291 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
214623 ns |
208602.5 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7766051 ns |
8336540 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
665062.5 ns |
525229 ns |
1.27 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
109531 ns |
109211 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
287479 ns |
199042 ns |
1.44 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
257000 ns |
262396 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
194709 ns |
276625 ns |
0.70 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
234083 ns |
287458 ns |
0.81 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1077150 ns |
1056833 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42088564.5 ns |
40754174 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6234000 ns |
6090583 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
648046 ns |
646691 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199633792 ns |
199913000 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
139392833 ns |
139280375 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
103930750 ns |
104140916 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
388502917 ns |
389020708 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5820718.5 ns |
5827400 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3416784 ns |
3419864.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
619755417 ns |
620313062.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
441375250 ns |
440225000 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
354920645.5 ns |
352767458 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1184933208 ns |
1182963541 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26386522.5 ns |
26862507 ns |
0.98 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21815610 ns |
21755438 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7166 ns |
7292 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6208 ns |
6083 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5416 ns |
5291 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9959 ns |
10041 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28498 ns |
28028 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1116372 ns |
1272660 ns |
0.88 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
611458 ns |
627458 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49170.5 ns |
48010 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219833 ns |
220750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221375 ns |
220521 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222000 ns |
221875 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
206458.5 ns |
209208.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
227567 ns |
222206 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31738279 ns |
29719216 ns |
1.07 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9196020.5 ns |
9434666.5 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
536326 ns |
527475 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7291.5 ns |
8458.5 ns |
0.86 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9375 ns |
9209 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10521 ns |
10375 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8417 ns |
8083 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
119937 ns |
119377.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3293949 ns |
3449983 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
889708 ns |
855000 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
71981 ns |
72520 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9229 ns |
8958.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7896 ns |
7500 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9000 ns |
10084 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
11083 ns |
10187.5 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
536667 ns |
521950 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19493979.5 ns |
18008002 ns |
1.08 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4818459 ns |
4315292 ns |
1.12 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
321554 ns |
321943 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
666 ns |
625 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
750 ns |
625 ns |
1.20 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
27275 ns |
26701 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1222604 ns |
1195571.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
497250 ns |
459104 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
50630 ns |
48701 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10875 ns |
10375 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9333 ns |
8479 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10750 ns |
11375 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15083 ns |
9375 ns |
1.61 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
258393 ns |
252977 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23478126 ns |
24052360 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
6003417 ns |
5702709 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
398169 ns |
397983.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
106542 ns |
106500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
99625 ns |
98125 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
86895.5 ns |
87479.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146499.5 ns |
147229 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
25487 ns |
24863 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1171426 ns |
1228355 ns |
0.95 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
273000 ns |
263458.5 ns |
1.04 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
189472 ns |
190212 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
516500 ns |
478667 ns |
1.08 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
478500 ns |
509250 ns |
0.94 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
478520.5 ns |
518562.5 ns |
0.92 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
478667 ns |
520417 ns |
0.92 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
238705 ns |
234381 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11556495 ns |
11772054 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2283167 ns |
2148312.5 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
621496.5 ns |
621156 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5208 ns |
5375 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6917 ns |
5167 ns |
1.34 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7833 ns |
7500 ns |
1.04 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
4791.5 ns |
4833.5 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
16844 ns |
16136 ns |
1.04 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
73180.5 ns |
79061 ns |
0.93 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
13000 ns |
14083 ns |
0.92 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10916.5 ns |
10208.5 ns |
1.07 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
10250 ns |
10292 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
17500.5 ns |
16708 ns |
1.05 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
219010.5 ns |
213958 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
371483 ns |
374963 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
40958 ns |
40000 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
50709 ns |
50584 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
51875 ns |
52458.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
15166.5 ns |
13895.5 ns |
1.09 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
20635.5 ns |
19866 ns |
1.04 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
86331 ns |
87035.5 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
37292 ns |
38625 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
32417 ns |
30646 ns |
1.06 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
29333 ns |
30791.5 ns |
0.95 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
58104.5 ns |
57666 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
197497 ns |
192524 ns |
1.03 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
408484 ns |
416745 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1750 ns |
1604.5 ns |
1.09 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
2000 ns |
1791 ns |
1.12 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2208 ns |
2042 ns |
1.08 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1729.5 ns |
1708 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
21528 ns |
21123 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1119527 ns |
1140764 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
330250 ns |
294500 ns |
1.12 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
33530 ns |
30391 ns |
1.10 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2083 ns |
2042 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2208 ns |
2125 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2333 ns |
2292 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2167 ns |
2208 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
208397 ns |
205122.5 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
8834014.5 ns |
8519681 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1669875 ns |
1638500 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
137426.5 ns |
139726.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5833 ns |
5709 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4667 ns |
5104 ns |
0.91 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6625 ns |
5750 ns |
1.15 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5770.5 ns |
4271 ns |
1.35 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
149631.5 ns |
146388.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5835867 ns |
5488369.5 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
758875 ns |
465291 ns |
1.63 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
71411 ns |
72161 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8708 ns |
8479.5 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8625 ns |
8209 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9375 ns |
8750 ns |
1.07 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8958 ns |
9042 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
903242 ns |
884256.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
37421660 ns |
38177021 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
6164958 ns |
5496125 ns |
1.12 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
388963 ns |
394569 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56750 ns |
56791 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57708 ns |
57625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57000 ns |
56875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57708 ns |
58166 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
38228 ns |
37427.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1180922 ns |
1210467.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
755458 ns |
468667 ns |
1.61 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
205452.5 ns |
208482 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
449666.5 ns |
487354.5 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
464874.5 ns |
501250 ns |
0.93 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
464667 ns |
492208.5 ns |
0.94 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
453208 ns |
437438 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
274665.5 ns |
267413 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26139340 ns |
26782051.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8078333 ns |
8248375 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
799638 ns |
839679 ns |
0.95 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3310209 ns |
3311333.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2329479.5 ns |
2340166.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
1770896 ns |
1769958 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6317959 ns |
6319645.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
205935 ns |
205610 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
202792 ns |
202712 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11454916.5 ns |
11497979 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8308854.5 ns |
8319667 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
6560334 ns |
6588125 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21092271 ns |
21221896 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
744389 ns |
736463 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1059296 ns |
1065445 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5667 ns |
5562.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4958.5 ns |
4666.5 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6792 ns |
6437.5 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5958 ns |
6104 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
142184.5 ns |
139569.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5394650.5 ns |
5734965.5 ns |
0.94 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
792375 ns |
826042 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
61591 ns |
59531 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9145.5 ns |
9333.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
7000 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7875 ns |
11875 ns |
0.66 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9791.5 ns |
8708 ns |
1.12 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
775371 ns |
764194 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
34345007 ns |
34028843.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5281125 ns |
5176312.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
379583 ns |
378403 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
97500 ns |
99625 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
97041 ns |
136708 ns |
0.71 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
99521 ns |
101312.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
144416 ns |
129709 ns |
1.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
153437 ns |
151420 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5924089.5 ns |
6034399 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2030687.5 ns |
1982667 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
210747 ns |
206692 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2013833 ns |
2031041 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2026084 ns |
2037417 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2035834 ns |
2036291 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2002500 ns |
2038584 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
724917.5 ns |
708221 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31735074 ns |
31488037 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10929291.5 ns |
11251291 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1125621 ns |
1126246 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
33542 ns |
33459 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
36125.5 ns |
36750 ns |
0.98 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
34458 ns |
33833 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
750 ns |
667 ns |
1.12 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
16054 ns |
15506 ns |
1.04 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
80321 ns |
86920 ns |
0.92 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
3500 ns |
4792 ns |
0.73 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2792 ns |
2709 ns |
1.03 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3125 ns |
3167 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2916 ns |
2291.5 ns |
1.27 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
143545.5 ns |
140769.5 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
358138 ns |
351474 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7125 ns |
7250 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6041 ns |
6000 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5375 ns |
5375 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9750 ns |
10000 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37536 ns |
36795 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1179967 ns |
1247042.5 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
669000.5 ns |
351333 ns |
1.90 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
51590.5 ns |
49030 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213270.5 ns |
213334 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221541.5 ns |
220166.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220812.5 ns |
228125 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
207375 ns |
206875 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
245557 ns |
244945 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27398743.5 ns |
24969632 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7962500 ns |
7965166.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
590596 ns |
578090.5 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3916 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3959 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3916 ns |
3958 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
21981 ns |
21762 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2171600 ns |
2067928.5 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
247458.5 ns |
245104 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
45951 ns |
45631 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14916 ns |
14875 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15000 ns |
14916 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14709 ns |
14667 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14917 ns |
14667 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
309306 ns |
310256.5 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11748235 ns |
11269459 ns |
1.04 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
1034396 ns |
1000292 ns |
1.03 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
196272 ns |
193502 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
128084 ns |
102917 ns |
1.24 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
102584 ns |
103667 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
103583 ns |
108625 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
104209 ns |
131875 ns |
0.79 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
136798 ns |
137366.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6093498 ns |
5955500.5 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2059854.5 ns |
1988958 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
203112 ns |
200842 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1919896.5 ns |
1926354.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1856625 ns |
1913500 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1919937.5 ns |
1917792 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1868750 ns |
1936729 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
694137 ns |
692519 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30885450 ns |
33116808.5 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10571000 ns |
11144584 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1230312 ns |
1078360.5 ns |
1.14 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17979.5 ns |
17708 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18958 ns |
22291.5 ns |
0.85 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22041 ns |
21250 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18625 ns |
19146 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
109512 ns |
109241 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3501819 ns |
3392625.5 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1337916.5 ns |
1271125 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76371 ns |
81331 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
218229.5 ns |
221229.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
252958.5 ns |
216791 ns |
1.17 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216958.5 ns |
230083.5 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
216458 ns |
216083.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
524772 ns |
522920 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19866707.5 ns |
19545470 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6113500 ns |
6165645.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
491915 ns |
476780 ns |
1.03 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
25458 ns |
26250 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
29770.5 ns |
31250 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
26708.5 ns |
27875 ns |
0.96 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1459 ns |
1292 ns |
1.13 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16298 ns |
16312 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
83171 ns |
87751 ns |
0.95 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
5250 ns |
6625 ns |
0.79 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5104 ns |
4645.5 ns |
1.10 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5292 ns |
4917 ns |
1.08 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4541 ns |
4792 ns |
0.95 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
208683 ns |
207882.5 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
387234 ns |
402074 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
304958 ns |
305938 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
304709 ns |
305917 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
310083.5 ns |
307521 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
305125 ns |
305375 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
230658.5 ns |
230214 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7814600 ns |
7500239 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1113875 ns |
643000 ns |
1.73 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
278313 ns |
280903 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
530375 ns |
538541 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
531208 ns |
549750 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
532709 ns |
542666 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
529417 ns |
529708 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1090787 ns |
1085631 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43485820.5 ns |
44253871 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6297104 ns |
6154687.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
884370 ns |
872599 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19542 ns |
19021 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19812.5 ns |
19833.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23875.5 ns |
22542 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19666 ns |
21917 ns |
0.90 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
113863.5 ns |
114174 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3373534.5 ns |
3531348.5 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1442083 ns |
1449271 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79591 ns |
81471 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222250 ns |
218834 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
214333.5 ns |
227542 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
213833 ns |
219708 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213584 ns |
212708 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
765292 ns |
761865.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
24070784 ns |
24050167 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7204250 ns |
7412916.5 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
545746 ns |
543136 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6709 ns |
7125.5 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7312.5 ns |
6479 ns |
1.13 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8708 ns |
8458 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6667 ns |
6084 ns |
1.10 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
141798 ns |
141785 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5523548 ns |
5370056 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
816958 ns |
777458 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
68790 ns |
69581 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9792 ns |
12958 ns |
0.76 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10166 ns |
9583.5 ns |
1.06 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11083.5 ns |
10687.5 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9584 ns |
9625 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
837915.5 ns |
832452.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38080103 ns |
38810557 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5622750 ns |
5231375 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
401534 ns |
395184 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6000 ns |
5145.5 ns |
1.17 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5791 ns |
4812.5 ns |
1.20 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7458 ns |
6958 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4125 ns |
6833 ns |
0.60 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
145690.5 ns |
144967.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5623588 ns |
5514807.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
838375 ns |
829125 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
70361 ns |
70250 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7375 ns |
7770.5 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7667 ns |
7333 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8291 ns |
7667 ns |
1.08 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
7208 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
793785.5 ns |
790491 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
39186335.5 ns |
37869840 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
6086875 ns |
5670687 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
404344 ns |
398424.5 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14528416 ns |
14518959 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10140541 ns |
10120000 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
7739520.5 ns |
7708791.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27755500 ns |
27832250 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
533007 ns |
532409 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
396764 ns |
399949.5 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46273458.5 ns |
46375083.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33442312.5 ns |
33404583.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
26625187.5 ns |
26627416.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85281333 ns |
85835750 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2648843 ns |
2644453 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3280893 ns |
3278895 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
65042 ns |
66042 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
68437.5 ns |
66125 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
71167 ns |
70520.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
67833 ns |
67875 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
119287 ns |
119873.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3455134 ns |
3330724 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1439375 ns |
1410021 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
231657.5 ns |
229907.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
443062.5 ns |
453292 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
484834 ns |
441208 ns |
1.10 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
441417 ns |
450208 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
440834 ns |
445541 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
732137 ns |
732886.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25953439 ns |
26274297 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7552375 ns |
7781500 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
813538 ns |
794638 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
667 ns |
0.75 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32502 ns |
32132 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1227556 ns |
1164338 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
474604 ns |
431645.5 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
51870 ns |
49160 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
8292 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9729 ns |
8708 ns |
1.12 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9979 ns |
9250 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8583 ns |
8959 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
286691.5 ns |
286401.5 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21308278 ns |
21940598 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5559917 ns |
5096125 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
380464 ns |
388934 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9834 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9834 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9875 ns |
9833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9792 ns |
9875 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23253 ns |
23178 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2074064 ns |
1908743.5 ns |
1.09 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
224792 ns |
222541 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
216413 ns |
217383 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
46250 ns |
45875 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
46209 ns |
45917 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46375 ns |
46167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
46000 ns |
45875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
291933.5 ns |
293089 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11156765 ns |
10988297.5 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
1508167 ns |
982875 ns |
1.53 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
625531.5 ns |
621107 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56209 ns |
56250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57208 ns |
57125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
56458 ns |
56334 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57750 ns |
57792 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28351 ns |
28527 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1247340 ns |
1186883 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
717520.5 ns |
578645.5 ns |
1.24 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
212262.5 ns |
204943 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
495167 ns |
448333 ns |
1.10 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
471209 ns |
494125 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
466354 ns |
507583 ns |
0.92 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
453729.5 ns |
439437 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
247898 ns |
247232 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31410562 ns |
33216066 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9721459 ns |
9499166 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
892599 ns |
891519.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
646125 ns |
652937.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
611208 ns |
647333 ns |
0.94 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
659458 ns |
662854 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
622229 ns |
668500 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
207532 ns |
207996 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8263342 ns |
8125052.5 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1391958.5 ns |
1384354.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
254943 ns |
233282 ns |
1.09 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2234625 ns |
2235042 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2230250 ns |
2238979 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2254374.5 ns |
2248959 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2230084 ns |
2260792 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
981398 ns |
984096 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49241405.5 ns |
45382984 ns |
1.09 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6989291 ns |
8132833.5 ns |
0.86 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1255398 ns |
1370494 ns |
0.92 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20458 ns |
20958 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19354.5 ns |
20000 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23583.5 ns |
22667 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19500 ns |
22083 ns |
0.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
113732 ns |
113160 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3641860 ns |
3278898 ns |
1.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1478625 ns |
1472792 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
81541 ns |
81561 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
231562.5 ns |
222313 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
222667 ns |
257542 ns |
0.86 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221667 ns |
232250 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
227771 ns |
228000.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
732151 ns |
734156.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26497381.5 ns |
27357269 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7749334 ns |
7692750 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
560606 ns |
559476 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
541 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23209 ns |
23248 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1215660.5 ns |
1222462 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
487375 ns |
466625 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
51290 ns |
51870 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9166.5 ns |
9167 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9667 ns |
9208 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10375 ns |
9292 ns |
1.12 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8917 ns |
9312.5 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
269153 ns |
268568 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23827343 ns |
24289416 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6253667 ns |
6049709 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
410245 ns |
410500 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9334 ns |
10333 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8583.5 ns |
8458 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10979 ns |
10354 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7875 ns |
8333 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
119974 ns |
120393.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3326234.5 ns |
3445203 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
884291 ns |
832874.5 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
74671 ns |
72921 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7437.5 ns |
7583 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7937.5 ns |
8208 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8042 ns |
7417 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7562.5 ns |
7770.5 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
511775.5 ns |
511772 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17975530 ns |
16339001 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4375354.5 ns |
3959271 ns |
1.11 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
328823 ns |
328364 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1333 ns |
1458 ns |
0.91 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1666.5 ns |
1542 ns |
1.08 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2208 ns |
1833 ns |
1.20 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1334 ns |
1541 ns |
0.87 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
21678 ns |
21725 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1152981 ns |
1136020 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
320854 ns |
296000 ns |
1.08 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
194212 ns |
194712 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3250 ns |
3250 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3417 ns |
3250 ns |
1.05 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3625 ns |
3500 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3375 ns |
3209 ns |
1.05 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
219466 ns |
220221.5 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10834184 ns |
9698879 ns |
1.12 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1832687.5 ns |
1612667 ns |
1.14 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
596936 ns |
596166 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
150437 ns |
148167 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
128250 ns |
127709 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
108166.5 ns |
107958.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
225041.5 ns |
225958 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
24160 ns |
24338 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1310090.5 ns |
1138772 ns |
1.15 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
273792 ns |
270854.5 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
40781 ns |
40151 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
180083 ns |
156125 ns |
1.15 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
123167 ns |
127209 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
87791 ns |
100750 ns |
0.87 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
262687.5 ns |
256666.5 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
218451.5 ns |
218905 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10657598 ns |
10030041 ns |
1.06 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2067708.5 ns |
2003417 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
220982 ns |
240417.5 ns |
0.92 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7208 ns |
7292 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6083 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5375 ns |
5375 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
10375 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32820 ns |
32865 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1163489 ns |
1134920.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
617459 ns |
562875 ns |
1.10 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50661 ns |
52191 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
232458 ns |
230854.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
229042 ns |
270500 ns |
0.85 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228729.5 ns |
264875 ns |
0.86 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219875.5 ns |
213771 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
265376 ns |
263381.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26768413 ns |
28212764 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8263416.5 ns |
8517000 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
529665 ns |
607266 ns |
0.87 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
15083 ns |
14958 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
14958.5 ns |
15500 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
17125 ns |
16500 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
14833 ns |
15625 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
140181 ns |
140749.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5308940 ns |
5465169 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
888979.5 ns |
787125 ns |
1.13 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
238352 ns |
238512 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23438 ns |
22583 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24208 ns |
23500 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25166.5 ns |
24084 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24375 ns |
23167 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
870581 ns |
875101 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
41063907 ns |
37582744 ns |
1.09 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5887062.5 ns |
5600270.5 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
700012 ns |
692048 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9625 ns |
9125 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9542 ns |
9250.5 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11979.5 ns |
10521 ns |
1.14 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9125 ns |
9209 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
124080.5 ns |
124561 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3416043 ns |
3393331 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
822708 ns |
802083 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
79851 ns |
79030 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13750 ns |
13750 ns |
1 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14667 ns |
14125 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15021 ns |
14125 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13625 ns |
13917 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
671966 ns |
670894 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
19977653 ns |
20295661 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5398125 ns |
5274042 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
371888.5 ns |
375405 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8958 ns |
9208.5 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9916 ns |
9167 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11812.5 ns |
10438 ns |
1.13 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9041 ns |
9584 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
122970.5 ns |
122339.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3293970 ns |
3319433.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
925417 ns |
882875 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
73870 ns |
75581 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12417 ns |
12333.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12708 ns |
12645.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13167 ns |
12708 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12584 ns |
12708 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
556881 ns |
557225 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19063398.5 ns |
18661226 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4862458 ns |
4435167 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
347418.5 ns |
345844 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
30417 ns |
30292 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
34729 ns |
34021.5 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
29916.5 ns |
30854.5 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
2167 ns |
1791 ns |
1.21 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16379 ns |
16303 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
82421 ns |
82211 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5084 ns |
5270.5 ns |
0.96 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5250 ns |
5354 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5167 ns |
5375 ns |
0.96 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6500 ns |
6625 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
140549.5 ns |
140733 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
368674 ns |
394064.5 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
26252 ns |
26135 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1216332 ns |
1123770.5 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
474208 ns |
474625 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
47830 ns |
50311 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6500 ns |
6375 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6750 ns |
6145.5 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6916.5 ns |
6416 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6417 ns |
6416 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
187727.5 ns |
187828 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22632657 ns |
23626156 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
6029520.5 ns |
5544437.5 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
391454 ns |
395104 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
2042 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
2042 ns |
2000 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2084 ns |
1959 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
1958 ns |
2000 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
26767 ns |
26544 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1182321 ns |
1165809 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
489208 ns |
461708.5 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
208282 ns |
209972 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16250 ns |
15792 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16792 ns |
16375 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17750 ns |
17000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16270.5 ns |
16084 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
275397.5 ns |
275962 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24766157 ns |
24890960.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6216125 ns |
5972833 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
716018 ns |
713667.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
200667 ns |
178250 ns |
1.13 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
178687.5 ns |
184187.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
152375 ns |
153417 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
147249.5 ns |
147459 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
203130 ns |
204372 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7922362.5 ns |
7857309.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1454125 ns |
1392667 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
176432 ns |
196752 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1327938 ns |
1326895.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1323854.5 ns |
1320625 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1334542 ns |
1330833 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1328041 ns |
1334750 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
915965.5 ns |
917280 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
46188516 ns |
46023181 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6707708 ns |
6714958.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1115346.5 ns |
1108992 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26042 ns |
25229.5 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24792 ns |
26583 ns |
0.93 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28458 ns |
26833 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24583 ns |
25917 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
237159 ns |
239791.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7575003.5 ns |
7972748 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1212208 ns |
980542 ns |
1.24 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
104651 ns |
116941 ns |
0.89 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
117583 ns |
179917 ns |
0.65 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
172521 ns |
141604.5 ns |
1.22 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
118500 ns |
127354.5 ns |
0.93 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
117062 ns |
118604 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1090625 ns |
1092585 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43917712 ns |
43816902.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6372458 ns |
6033333 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
621286 ns |
606086 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
291 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23137 ns |
22970 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1238877 ns |
1175116 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
498542 ns |
456125 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48881 ns |
48591 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6417 ns |
6625 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6833 ns |
6750 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7167 ns |
6542 ns |
1.10 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6417 ns |
6459 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
205267.5 ns |
204628 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23886381 ns |
23603781 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6302042 ns |
6092458 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
402254.5 ns |
397554 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7041.5 ns |
6125 ns |
1.15 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5541.5 ns |
6334 ns |
0.87 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7625 ns |
6709 ns |
1.14 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6000 ns |
5937.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
146785.5 ns |
147027 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5822475 ns |
5559804 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
746708.5 ns |
583167 ns |
1.28 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
241962 ns |
237472 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9958.5 ns |
9666.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10500 ns |
10041 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10417 ns |
10041 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9979 ns |
9854 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
912783 ns |
910526.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
42603631 ns |
39406121 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6373125 ns |
5909375 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
691307 ns |
686288 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
666 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
667 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
666 ns |
667 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
667 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22725 ns |
22655 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1963806 ns |
2037996 ns |
0.96 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
336042 ns |
222583 ns |
1.51 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
216303 ns |
215862 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4583 ns |
4584 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4792 ns |
4584 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4916 ns |
4625 ns |
1.06 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4625 ns |
4625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
229275 ns |
232442.5 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10608963 ns |
9881227 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1723167 ns |
1690521 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
602246 ns |
600181 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7791 ns |
8562.5 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8416.5 ns |
7937.5 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10375 ns |
9771 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8229.5 ns |
8520.5 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
122398.5 ns |
122197 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3507632 ns |
3361719 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
910208 ns |
761542 ns |
1.20 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
78850.5 ns |
76241 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8104 ns |
8792 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8833 ns |
8459 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9583 ns |
8875 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8208 ns |
8750 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
595869 ns |
595652 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22073835 ns |
20278296 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5319000 ns |
4718125 ns |
1.13 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
356484 ns |
354274 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126292 ns |
125917 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
128916 ns |
128958 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
97208.5 ns |
96959 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
183042 ns |
181416 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46271 ns |
46106 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
97571 ns |
96666 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
333542 ns |
317875 ns |
1.05 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
347479 ns |
346375 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
167291.5 ns |
178979 ns |
0.93 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
625999.5 ns |
569062.5 ns |
1.10 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
192528.5 ns |
191966 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
496135 ns |
487875 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398041 ns |
397125 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288000 ns |
288292 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
215416 ns |
215791 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756458 ns |
757959 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43488.5 ns |
43243.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1364389 ns |
1345812 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
410625 ns |
404062.5 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
83870 ns |
83381 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1469145.5 ns |
1459854 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1135500 ns |
1136645.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
863146 ns |
865270.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2442125 ns |
2359813 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
247561 ns |
259216 ns |
0.96 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
11237365 ns |
11177773 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1833625 ns |
1833666 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
354774 ns |
349653.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
659187.5 ns |
642333 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
640354.5 ns |
649875 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
662145.5 ns |
660416.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
548312.5 ns |
623542 ns |
0.88 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
203446 ns |
202604 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7976375 ns |
7957177 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1383791.5 ns |
1348791.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
265813 ns |
265108 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2448916.5 ns |
2448583 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2438562 ns |
2452104 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2469417 ns |
2473833 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2434958 ns |
2455791 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1005073 ns |
1005284.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50692046 ns |
50767854.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9506625 ns |
10026166 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1461465 ns |
1511186 ns |
0.97 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
32333.5 ns |
32375 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
35166.5 ns |
35749.5 ns |
0.98 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
33042 ns |
34312.5 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
1125 ns |
916 ns |
1.23 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15934.5 ns |
15700 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
81141 ns |
81140 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3083 ns |
3166 ns |
0.97 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3250 ns |
3083 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3500 ns |
3125 ns |
1.12 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3083 ns |
3000 ns |
1.03 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
139173.5 ns |
139352.5 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
343023.5 ns |
344664 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
406750 ns |
405583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
408334 ns |
408750 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
402208 ns |
403083 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
420125 ns |
422042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43496 ns |
43343.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1528026.5 ns |
1354478 ns |
1.13 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1169000 ns |
1109583 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
239982 ns |
240442 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3872041 ns |
3869125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3991083.5 ns |
3994396 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4002833 ns |
3999708 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3774729 ns |
3774354.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
244269 ns |
244251 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
35724955 ns |
35978667 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11845709 ns |
11608750 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1455475 ns |
1245273.5 ns |
1.17 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3958 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33788 ns |
34866 ns |
0.97 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1302459 ns |
1227111 ns |
1.06 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
181708 ns |
175291 ns |
1.04 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
40650 ns |
42710 ns |
0.95 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15708 ns |
15750 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15958 ns |
15667 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15792 ns |
15500 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15667 ns |
15542 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
255478 ns |
256386 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
8672466 ns |
8908913 ns |
0.97 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
889209 ns |
872958 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
173551.5 ns |
174412 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404708 ns |
404166 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
296104 ns |
295666 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
221291 ns |
221625 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760584 ns |
760500 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
112862 ns |
113218 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1002276 ns |
1016425 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
423187 ns |
393437 ns |
1.08 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89911 ns |
90851 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1475917 ns |
1473333 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1161354.5 ns |
1161666 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
887896 ns |
888166.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2464979 ns |
2383791 ns |
1.03 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
241481.5 ns |
241468.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
11763790 ns |
11846004 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1944292 ns |
1877938 ns |
1.04 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
358523 ns |
360704 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
459 ns |
500 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
459 ns |
1.27 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
459 ns |
542 ns |
0.85 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
26062.5 ns |
25943 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1278854 ns |
1192515 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
460042 ns |
470937.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
208442 ns |
208143 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7333 ns |
7458 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7958 ns |
7583 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7958 ns |
7458 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7417 ns |
7709 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
211704 ns |
214477.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24805137 ns |
25777295.5 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5538000 ns |
5998979.5 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
700917 ns |
700287 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
832145.5 ns |
831271 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
616959 ns |
617041 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
471750 ns |
470000 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1544042 ns |
1545709 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
130757 ns |
129860.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
184082 ns |
169171.5 ns |
1.09 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2688499.5 ns |
2689145.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1996249.5 ns |
2013250 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1538625 ns |
1538125 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4933916 ns |
4941375 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
245136 ns |
241461 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
883529 ns |
867019 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32166 ns |
31985 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1203462 ns |
1142400.5 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
460354 ns |
453291.5 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48821 ns |
48580 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6375 ns |
6250 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6750 ns |
6375 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7000 ns |
6416 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6333 ns |
6166 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
225402.5 ns |
224593 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19737803.5 ns |
21127237.5 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5669541 ns |
5053916 ns |
1.12 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
369024 ns |
372504 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2394958 ns |
2423917 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2408625 ns |
2397291.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2394708 ns |
2403792 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2370458 ns |
2371125 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
201622 ns |
203214 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7737337 ns |
8123069 ns |
0.95 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1450791 ns |
1393562 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
380674 ns |
332763.5 ns |
1.14 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4647583 ns |
4645250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4653208.5 ns |
4645125 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4673334 ns |
4654250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4514458.5 ns |
4658042 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
912510 ns |
910071 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
47986504.5 ns |
48057492 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6739083.5 ns |
6619584 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1262608 ns |
1416215 ns |
0.89 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6833 ns |
7438 ns |
0.92 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
9771 ns |
7083 ns |
1.38 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7250 ns |
6958 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7542 ns |
6979 ns |
1.08 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23275 ns |
23722 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1207675 ns |
1176238 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
265333 ns |
263000 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
37750 ns |
34150 ns |
1.11 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
64978.5 ns |
68020.5 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
71125 ns |
50312 ns |
1.41 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
52833 ns |
53292 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
70666.5 ns |
32583 ns |
2.17 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
218519 ns |
218170 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10524819 ns |
10824043 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2194937.5 ns |
2030958 ns |
1.08 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
240683 ns |
244333 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
21375.5 ns |
21437 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
26271 ns |
25333 ns |
1.04 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
22917 ns |
23479.5 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5709 ns |
6083 ns |
0.94 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
16707.5 ns |
16786.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
85561 ns |
91501 ns |
0.94 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
11812.5 ns |
12208.5 ns |
0.97 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10625 ns |
10083 ns |
1.05 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
9417 ns |
9458.5 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18104.5 ns |
17854.5 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
227822 ns |
228126 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
392324 ns |
376824 ns |
1.04 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406375 ns |
406500 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
296875 ns |
297312.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
223583 ns |
223791 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762417 ns |
762958 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46850 ns |
46683 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1392045 ns |
1412498.5 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
490875 ns |
476666.5 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
90551 ns |
89121 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1484020.5 ns |
1499875 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1170375 ns |
1167833.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
895687 ns |
894271 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2469958 ns |
2389834 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
291935 ns |
292932.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
11212854.5 ns |
13048501 ns |
0.86 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2109562.5 ns |
2098166 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
377974 ns |
380285 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
434000 ns |
433875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
436708 ns |
436334 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
430250 ns |
430709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
446334 ns |
448020.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54345 ns |
54564 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1014127 ns |
1024914 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1104583.5 ns |
1099208.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
235673 ns |
236522.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3886750 ns |
3897208 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4017812.5 ns |
4021833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4032979.5 ns |
4027708 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3781312.5 ns |
3812146 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
264536.5 ns |
264154 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31087542.5 ns |
31494055 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10495333 ns |
10517749.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1395074 ns |
1245028 ns |
1.12 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8708 ns |
8750 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
7666 ns |
7666 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
6958 ns |
6834 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12500 ns |
12459 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24046 ns |
24707 ns |
0.97 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2146721 ns |
2085760.5 ns |
1.03 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
230625 ns |
225250 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
216742 ns |
215337.5 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45125 ns |
45042 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45208 ns |
45125 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
44875 ns |
45083 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45334 ns |
45187.5 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
346890 ns |
350283.5 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
13295582 ns |
11134325 ns |
1.19 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1934541.5 ns |
1805125 ns |
1.07 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
671457 ns |
662902 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
106854.5 ns |
93959 ns |
1.14 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
119958 ns |
129416 ns |
0.93 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
87875 ns |
87916.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
87250 ns |
125062.5 ns |
0.70 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189626 ns |
189645 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5844625 ns |
5972246.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1928125.5 ns |
1906021.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
187382 ns |
201947 ns |
0.93 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2014874.5 ns |
2011375 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2017709 ns |
2017791 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2027542 ns |
2029459 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2000000 ns |
2017916.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
539260 ns |
537811 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28910116 ns |
27667805 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9321083.5 ns |
9734479.5 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1102071 ns |
1103102 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
force-pushed
the
ap/dropout_enz
branch
4 times, most recently
from
September 5, 2024 13:57
78b8546
to
08e559d
Compare
avik-pal
force-pushed
the
ap/dropout_enz
branch
from
September 5, 2024 14:37
08e559d
to
8b54f89
Compare
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.