This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
10 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1afc1c7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5479.5
ns5750
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6375
ns6187.5
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8000
ns7979
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6375
ns6958.5
ns0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
119198
ns119461
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
2649209
nslayernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
704000
ns723417
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
417764
ns417664
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9812
ns9834
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9625
ns9792
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10042
ns9916
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9541
ns10166
ns0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
551456
ns551816
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
16841216
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
2645125
ns2364708
ns1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
659636
ns695047
ns0.95
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1395.5
ns1458
ns0.96
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1687.5
ns1687.5
ns1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1875
ns1917
ns0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
2521
ns1250
ns2.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
21867
ns21782
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI
1304894
nsbias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
212604
ns189208
ns1.12
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
30820.5
ns30960
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4209
ns3958.5
ns1.06
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4312.5
ns4167
ns1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
3917
ns4000
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4375
ns4334
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
146279
ns148046.5
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI
8894773.5
nsbias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
1523375
ns1745084
ns0.87
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
148982
ns148342
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57542
ns56083
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46584
ns39917
ns1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39875
ns47000
ns0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83708
ns82750
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36787
ns37366
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
582007
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
985625
ns1348187.5
ns0.73
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
84391
ns80291
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2036583
ns2017708
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2086750
ns2083959
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2079917
ns2090792
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1987312.5
ns1999604
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
227214
ns232635
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
7854957
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7818750
ns7104833
ns1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
967560
ns1540007
ns0.63
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
154083
ns143708
ns1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
146958
ns173750.5
ns0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149979.5
ns165562.5
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
165187.5
ns165979
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166381
ns166570
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7795058
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1464583
ns1701792
ns0.86
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
207072
ns205502.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1110895.5
ns1100292
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1103209
ns1114709
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1118687
ns1122042
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1109562.5
ns1119916
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
711437
ns713685
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33922938.5
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6051917
ns7357125
ns0.82
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1036360
ns1039502
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5208
ns4458
ns1.17
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4271
ns4291
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5375
ns6208
ns0.87
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4584
ns4416
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
94268
ns94296
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
5136056
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
711583
ns782083.5
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
69481
ns69431
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8667
ns8542
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8500
ns8834
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8917
ns9083
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8333
ns8583
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
603970
ns608245
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
33683319.5
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5821292
ns5666604.5
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
389889
ns384864
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17729.5
ns17229
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20042
ns17250
ns1.16
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20584
ns22250
ns0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20416.5
ns18312.5
ns1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
66995
ns68096
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
2897295
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1301292
ns1292667
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
73931
ns74070.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
211625
ns218583
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
218875
ns244459
ns0.90
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
218667
ns213333
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
224875
ns220875
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
357740
ns359693
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
14308445
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5704396
ns7278917
ns0.78
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
473855
ns475315
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
625
ns708
ns0.88
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
666
ns584
ns1.14
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns916.5
ns0.82
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
666
ns583
ns1.14
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
20965
ns20807.5
ns1.01
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI
1157358.5
nsbias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
283542
ns297208
ns0.95
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
32571
ns33001
ns0.99
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1375
ns1375
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1375
ns1458
ns0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1500
ns1583
ns0.95
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1334
ns1417
ns0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
125947
ns126203
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI
8433349.5
nsbias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
1594979.5
ns1457625
ns1.09
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
138471
ns138172
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7334
ns7333
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns5375
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5333
ns6083
ns0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10417
ns10291
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23836
ns24430
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1232101.5
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
583125
ns351229
ns1.66
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46460
ns47101
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
227708
ns219208
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
235583
ns261791
ns0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
264667
ns228625
ns1.16
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
248583
ns223750
ns1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
190580
ns194664
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
29562269.5
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8564854.5
ns11964250
ns0.72
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
611281
ns617187
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4084
ns4125
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4084
ns4167
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4125
ns4084
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23789
ns23689
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI
2018577
nsdense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
219791.5
ns203375
ns1.08
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
50370
ns48541
ns1.04
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16958
ns16958
ns1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
17083
ns16583
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17083
ns17250
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16666
ns16917
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
197449
ns196884
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI
9693737.5
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
940458
ns1560667
ns0.60
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
176226.5
ns174782
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
509500
ns509333
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
405083
ns332250
ns1.22
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
332459
ns404250
ns0.82
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
865125
ns865708
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113130
ns114284.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI
391060
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
451416
ns392875
ns1.15
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
248703
ns248273
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2324333
ns2318021
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2025375.5
ns1745083
ns1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1752833.5
ns2021000
ns0.87
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3200583
ns3274791.5
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
244865
ns244508
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI
11656548
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
1966229
ns2001875
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
761317.5
ns763478
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6250
ns5833
ns1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6145.5
ns7167
ns0.86
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7729
ns7271
ns1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6375
ns6124.5
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
93009
ns92855.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
5406797
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
758167
ns861271
ns0.88
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
60110
ns60401
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10646
ns11375
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10542
ns11750
ns0.90
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11084
ns12229
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10375
ns11125
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
660576
ns638820
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
38819677
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5487104
ns6435375
ns0.85
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
416424
ns416514.5
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns541
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
541
ns541
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23635
ns23671
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI
2221310
nsdense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
319750
ns318791
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
53401
ns53351
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2083
ns2167
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2083
ns2084
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2084
ns2166
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
232566
ns222818.5
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI
11381984
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
1912541.5
ns1967167
ns0.97
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
186466.5
ns180782
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8375
ns8708
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8750
ns8833
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10438
ns9895.5
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8958
ns8709
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
104173
ns100619
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
3244842
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
896708
ns898521
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
74231
ns74410.5
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17708
ns17375
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17750
ns17167
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18187.5
ns19375
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18041.5
ns18250
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
610296
ns574738
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
17126722
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5229458
ns5654917
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
387209
ns389229
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns625
ns0.80
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns500
ns1.25
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
541
ns667
ns0.81
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
35555
ns36237
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
1100087
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
438541
ns463667
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
47930
ns48401
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9312
ns8437.5
ns1.10
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8125
ns9312
ns0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9792
ns9875
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9146
ns9708
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
256000
ns254845
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
19311232
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4774937.5
ns5087792
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
378844
ns375784
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397000
ns395833.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288125
ns215750
ns1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
215667
ns288166
ns0.75
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756875
ns756000
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111981
ns112957
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI
320003
nsdense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
365500
ns299833
ns1.22
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
78230
ns76681
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1460875
ns1455646
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1135291.5
ns862000
ns1.32
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
862687.5
ns1130021
ns0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2357291
ns2442563
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
209166.5
ns210541
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI
9267436
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1516312.5
ns1636104.5
ns0.93
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
323643
ns325573.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6667
ns7000
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6959
ns7084
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8958.5
ns8125
ns1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7334
ns7041
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
144567
ns136948
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
5867002
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
707270.5
ns760125
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
70660
ns68820
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15395.5
ns14625
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
12417
ns15042
ns0.83
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14250
ns14958.5
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13312
ns15625
ns0.85
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
958993.5
ns931253.5
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
40369162
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5752729.5
ns6306249.5
ns0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
433804
ns436305
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24416
ns25542
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
26417
ns27334
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28687
ns28354
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
26874.5
ns31542
ns0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
201880.5
ns200462.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8100056
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
896833
ns1129500
ns0.79
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
114876.5
ns112942
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
148834
ns149250
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
104708
ns131583.5
ns0.80
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
153500
ns106479
ns1.44
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
116979
ns153208
ns0.76
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1086710
ns1062590
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
41151661
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5843229.5
ns5978292
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
594985
ns590197
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
73958
ns76250
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
76791.5
ns74291.5
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
80166
ns77333
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
75417
ns76792
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
207189
ns209030.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7362606
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
519687.5
ns638458
ns0.81
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
126391.5
ns130572
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
297334
ns216500
ns1.37
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221667
ns297395.5
ns0.75
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
288917
ns212146
ns1.36
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221041.5
ns306208
ns0.72
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1119401
ns1140320
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
41008184.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6497687.5
ns7480542
ns0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
694627
ns697363
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
16417
ns15833
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
16583
ns17291.5
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
17792
ns17875
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
16708
ns16687.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
147421
ns150183
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
5759467
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
427292
ns779979
ns0.55
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
237703
ns237943
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
24833.5
ns26458.5
ns0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27042
ns25708
ns1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27166.5
ns27625
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27125
ns27750
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
984196
ns987976
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
40719457
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5828333
ns7131041.5
ns0.82
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
714022
ns701547
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11562.5
ns10396
ns1.11
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
10375
ns11563
ns0.90
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12083
ns12833
ns0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11083
ns10875.5
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
124895.5
ns125970.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
3575871
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
912833
ns910812.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
242943
ns241512
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21125
ns21083
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21917
ns21604.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
22000
ns23041.5
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21416
ns21541.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
706086.5
ns709336
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
21428227.5
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5387146
ns5733333
ns0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
673547
ns676248
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
64000.5
ns62667
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
63500
ns63771
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
66166
ns65667
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
62584
ns67667
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
105629.5
ns107292
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3434086.5
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1323250
ns1352583.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
237572
ns240373
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
448750
ns444083
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
437958
ns448875
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
446666
ns440458
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
449583
ns445833.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
517219
ns521267
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
21208755
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5978042
ns8808750
ns0.68
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
730458
ns728812.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6958.5
ns6958.5
ns1
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6833
ns7291
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8041
ns8771
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7771
ns7104
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
145909.5
ns147758.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
5602766
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
628395.5
ns763583
ns0.82
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
58991
ns60941
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14042
ns15125
ns0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15750
ns14417
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
13917
ns15334
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13479
ns15958
ns0.84
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
954313
ns958359.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
38432249.5
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5549500
ns6378396
ns0.87
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
404584
ns409474
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6160416
ns6155291
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6378167
ns3225687.5
ns1.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
3224791.5
ns6379541
ns0.51
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11924000
ns11906125
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301800.5
ns351844
ns0.86
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
294983
ns301554
ns0.98
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19104958
ns19041833.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
19957229
ns11118520.5
ns1.79
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
11123708.5
ns19989395.5
ns0.56
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36532604
ns36469125
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1023618
ns1015731
ns1.01
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1158122
ns1151512
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
917
ns959
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
958
ns958
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
958
ns959
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
959
ns958
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23554
ns23791
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI
2143802
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
316188
ns317417
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
215672
ns215032
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3625
ns3667
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3667
ns3667
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3666
ns3750
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3666
ns3708
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
283503
ns283833
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI
11257238
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2086333.5
ns2116208
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
637297
ns634877
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8000
ns7167
ns1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7958
ns7833.5
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9042
ns9291
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7854
ns7500
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
120818.5
ns122503
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
3517154
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
776959
ns866646
ns0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
67641
ns66931
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11729.5
ns11709
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12250
ns11834
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12334
ns13291
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12458.5
ns11875
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
643501
ns651319
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
21447178
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5189125.5
ns5038083
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
365334
ns365314
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
291
ns291
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
291
ns250
ns1.16
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22596
ns22923
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI
1951713
nsdense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
225750
ns208979.5
ns1.08
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
52251
ns50651
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3041
ns3000
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3208
ns2959
ns1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3375
ns3250
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
3042
ns2959
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
204741
ns206218
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI
9227567
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
1619250
ns1699541.5
ns0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
172842
ns158851.5
ns1.09
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11250
ns10375
ns1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11334
ns11854.5
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13125
ns12417
ns1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11458
ns12333
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
121547.5
ns123182.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
3353104
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
869041
ns877125
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
243193
ns241463
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
22000
ns22062
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20583
ns21625
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21167
ns21708
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20791
ns20084
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
598450
ns605852.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
19931223.5
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
4695229
ns5025000
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
652706.5
ns667502
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4417
ns4584
ns0.96
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4416
ns4417
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4416
ns4375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24359
ns24334
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI
2166080
nsdense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
223833
ns208417
ns1.07
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
52541
ns54130
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16667
ns16375
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16500
ns16375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16375
ns16667
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16333
ns16875
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
331128
ns333246
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI
12599810
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
1647875.5
ns1768771
ns0.93
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
212037.5
ns214042.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
1959
ns2084
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2083
ns2000
ns1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
1958
ns2166
ns0.90
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
1958
ns2041
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
35684
ns36196
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
1146851
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
441458.5
ns473000
ns0.93
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
206802
ns205752
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
16645.5
ns17667
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
16750
ns18937.5
ns0.88
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
16562.5
ns17625
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
17208.5
ns16896
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
294264.5
ns297235
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
20813859
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5292083
ns5572167
ns0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
703797.5
ns694748
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
59583.5
ns55979.5
ns1.06
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
63625
ns60709
ns1.05
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
62625
ns65812.5
ns0.95
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51292
ns51583
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66405
ns66558
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
103511
ns120591.5
ns0.86
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
199395.5
ns185895.5
ns1.07
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
157250
ns146354
ns1.07
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
133937.5
ns136208
ns0.98
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
317729
ns297104
ns1.07
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
216342
ns218976.5
ns0.99
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
579316
ns584106
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82458.5
ns112833.5
ns0.73
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
85271
ns86417
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
90209
ns89416
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
140417
ns81000
ns1.73
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192334
ns191966
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5533381
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1893708
ns1945000
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
170101.5
ns209467.5
ns0.81
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1851687.5
ns1912250
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1882334
ns1923916
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1926500
ns1917917
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1891958.5
ns1922250
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
532324
ns536309
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
25979046
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9683125
ns11093750
ns0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1080090
ns935284.5
ns1.15
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns291
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns250
ns1.17
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21761
ns21820
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI
2115738
nsdense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
346875
ns327833.5
ns1.06
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
45220
ns46181
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1750
ns1791
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
253104
ns254627
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI
9490240.5
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
1088979
ns1640833
ns0.66
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
187502
ns187212
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8084
ns8209
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8438
ns9083
ns0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10875
ns9896
ns1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8209
ns8417
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
119061
ns120586.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
3459549.5
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
880209
ns873250
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
237872
ns236722
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10167
ns10292
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9208
ns8958
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9500
ns9917
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9167
ns8666
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
527070
ns532717.5
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
18222497.5
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
4417458
ns4452292
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
634411
ns646767
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58417
ns56750
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46333
ns39708
ns1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39500
ns47166
ns0.84
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84083
ns83125
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
39770
ns40431
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1341281.5
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1100583.5
ns1093666
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
75935.5
ns77971
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1901542
ns1903833
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1921833.5
ns1979312
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1955833
ns1983896
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1881792
ns1849208
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
221320
ns224788
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33766076
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11588792
ns14363791.5
ns0.81
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1036440
ns1042991
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
415958
ns415042
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
420042
ns418584
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
419875
ns420291
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
418708
ns420459
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
210156.5
ns212100.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7606443
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
522750
ns1065709
ns0.49
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
287858
ns286133
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
764709
ns742875
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
781812
ns758958
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
753417
ns691062.5
ns1.09
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
678791.5
ns742624.5
ns0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1059447
ns1063422.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
43854665.5
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6323063
ns7312146
ns0.86
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
916300
ns924920
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3425978.5
ns3442959
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3451792
ns3441833
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3458979.5
ns3417500
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3412708
ns3453000
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
170950
ns174858
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8189493
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1396875
ns1420583
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
435150
ns452865
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6194166.5
ns6180375
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6230791.5
ns6232875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6222854
ns6229979
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6218875
ns6252666
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1001834
ns1007257
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
49254606
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8528604
ns9641124.5
ns0.88
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1556125
ns1560736
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
472667
ns471375
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
339875
ns253334
ns1.34
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
253208
ns341708
ns0.74
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
902000
ns902583
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46534
ns46913
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI
886552
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
478875
ns338020.5
ns1.42
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
249963
ns250492
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2333750
ns2320416
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2036625
ns1761167
ns1.16
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1763167
ns2033167
ns0.87
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3203312
ns3279375
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
258879
ns260626
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI
13032420
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2178375
ns2319917
ns0.94
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
787818
ns785678
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57542
ns56166
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
45875
ns39417
ns1.16
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39458
ns46584
ns0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83791
ns82917
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28376
ns28863
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1391893
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1124083
ns1130625
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
77840.5
ns79170.5
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2032250
ns2020083
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2093187.5
ns2062917
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2091917
ns2078437.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1972229.5
ns2004145.5
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
235913
ns238429
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35452366
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11558395.5
ns15264270.5
ns0.76
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1056250.5
ns1057241
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57708
ns56292
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46625
ns39833
ns1.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39875
ns47416
ns0.84
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83916.5
ns82875
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
49455
ns50090
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
809068
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1084875
ns1054834
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
72105.5
ns74900
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921083
ns1924167
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1945916.5
ns1968250
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1974729.5
ns1980792
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1864791
ns1891208
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
238800.5
ns243592
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
17238198
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10023791.5
ns12800042
ns0.78
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
934629
ns1070466
ns0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns292
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
34886
ns35236
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
1200155
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
279833
ns461750
ns0.61
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
48281
ns50011
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6792
ns6709
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6208.5
ns6520.5
ns0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7000
ns7625
ns0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6667
ns6541
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
212384.5
ns216284
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
19751565
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5078916.5
ns5088292
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
379104
ns373774
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns291
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32763
ns32446
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI
1167700
nsdense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
253542
ns248500
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
41150
ns40510
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
3833
ns2917
ns1.31
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3041
ns3250
ns0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
3375
ns3083
ns1.09
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
3125
ns3458
ns0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
190584.5
ns191592.5
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI
7912209
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
1265542
ns1031291.5
ns1.23
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
153656.5
ns153502
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
454937
ns423917
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
454750
ns473500
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
458229
ns427833
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
427188
ns424125
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
138010.5
ns138519
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5819207
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2011000
ns2048875
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
325693
ns380684
ns0.86
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3801708.5
ns3799062.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3811125
ns3822458
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3821292
ns3802667
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3815375
ns3823563
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
710674
ns717031.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
32043185
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10832625.5
ns12950229
ns0.84
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1491590
ns1325953
ns1.12
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49856479
ns49840813
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35516042
ns25988833
ns1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
26022291
ns35525750
ns0.73
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
97102959
ns96904729.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1594251.5
ns1593190
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1009650
ns1014101
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154623520.5
ns153775938
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112350625
ns89008896
ns1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
89065125
ns112384750
ns0.79
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
296081125
ns296752479
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6489845.5
ns6476290
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5556104
ns5534451
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
17312.5
ns15062.5
ns1.15
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
16834
ns15625
ns1.08
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
14291.5
ns16875
ns0.85
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15167
ns15333
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
21687
ns21010
ns1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI
1157478.5
nsbias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
218167
ns204959
ns1.06
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
27541
ns27230
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
11042
ns11083
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
9000.5
ns7583
ns1.19
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
7875
ns9209
ns0.86
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17416.5
ns17188
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
261161
ns264057
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI
9552185
nsbias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
1560042
ns1736125.5
ns0.90
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
155181
ns152581.5
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8125
ns7417
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8084
ns8833
ns0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10083.5
ns10041.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8542
ns8292
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
116504
ns117259.5
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
3349407.5
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
798667
ns887417
ns0.90
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
238952.5
ns236902.5
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9854
ns9708.5
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10229.5
ns9292
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10083
ns10791.5
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9958
ns9584
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
623888
ns631614
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
22194230
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
4515667
ns5189583
ns0.87
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
656976
ns668942
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9520.5
ns8812.5
ns1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9125
ns9583
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11625
ns11042
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9479.5
ns9250
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
120769
ns122641
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
3531092
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
888291
ns876791.5
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
79170
ns74481
ns1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
14208
ns13708
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13208.5
ns14979
ns0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
16333
ns14416
ns1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
17000
ns13625.5
ns1.25
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
594781
ns601521.5
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
19851682
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
4474458
ns4885250
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
357348.5
ns353174
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
500
ns458
ns1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
459
ns500
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
458
ns584
ns0.78
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
34855
ns35180
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI
1184802
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
423042
ns441166
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
209842
ns206562
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7709
ns7042
ns1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7084
ns10458
ns0.68
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7708
ns8042
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8042
ns7125
ns1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
231568.5
ns233713.5
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI
22217593.5
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5660167
ns5300958.5
ns1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
679867
ns658707
ns1.03
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
16042
ns12666
ns1.27
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
15333
ns13833
ns1.11
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
13854
ns15667
ns0.88
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10375
ns10270.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
22215
ns22010
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI
1158702.5
nsbias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
205521
ns186625
ns1.10
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
194012
ns191282
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
31958
ns32042
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
32145.5
ns32020.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
32250
ns32458
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
32250
ns31854.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
276502.5
ns278049
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11085623
nsbias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
1721729
ns1885500
ns0.91
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
605276.5
ns606396.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
474834
ns438291
ns1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
445167
ns484125
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
486875
ns446062.5
ns1.09
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
474916
ns477208
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194410
ns194398.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5748288
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2751937.5
ns1968250
ns1.40
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
326354
ns375174
ns0.87
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3823792
ns3825292
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3824042
ns3837396
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3849500
ns3828687.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3847584
ns3836875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
546410
ns549907
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
27926309
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10140750
ns12010500
ns0.84
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1388348.5
ns1226382.5
ns1.13
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
782652917
ns836787979.5
ns0.94
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
542161792
ns426008000
ns1.27
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
420966458.5
ns542930250
ns0.78
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1553203729.5
ns1533058916
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22558411.5
ns22531506
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14062784.5
ns14059203
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2518008250
ns3617643875
ns0.70
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1785714792
ns1519606625
ns1.18
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1525039667
ns1791220042
ns0.85
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4874366334
ns4771769708
ns1.02
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
367235490
ns370760684
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
88231178
ns89879564
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
77646
ns75354.5
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75959
ns77417
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
82625
ns80167
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
77291
ns76625
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
208602.5
ns210924.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
8336540
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
525229
ns1045583.5
ns0.50
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
109211
ns110131.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
199042
ns231500
ns0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
262396
ns195167
ns1.34
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
276625
ns244583
ns1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
287458
ns234875
ns1.22
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1056833
ns1060035
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
40754174
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6090583
ns6603312.5
ns0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
646691
ns643791.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199913000
ns199256958.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
139280375
ns103813958.5
ns1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
104140916
ns139098125
ns0.75
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
389020708
ns388864875
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5827400
ns5820038
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3419864.5
ns3424485
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
620313062.5
ns615907583.5
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
440225000
ns354224562
ns1.24
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
352767458
ns440166291.5
ns0.80
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1182963541
ns1188432875
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26862507
ns26804213.5
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21755438
ns21815881
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7333
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6083
ns5416
ns1.12
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5291
ns6291
ns0.84
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10041
ns10458
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28028
ns28403
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1272660
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
627458
ns361437.5
ns1.74
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48010
ns48715.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220750
ns213333.5
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220521
ns221708
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221875
ns220916
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
209208.5
ns205750
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
222206
ns226122
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
29719216
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9434666.5
ns11493583.5
ns0.82
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
527475
ns541195.5
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8458.5
ns7291
ns1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9209
ns8417
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10375
ns10770.5
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8083
ns8583
ns0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
119377.5
ns119656
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
3449983
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
855000
ns855542
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
72520
ns72200
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8958.5
ns7667
ns1.17
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7500
ns9395.5
ns0.80
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10084
ns8375
ns1.20
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10187.5
ns7542
ns1.35
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
521950
ns526844.5
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
18008002
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
4315292
ns4384667
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
321943
ns322463
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns459
ns1.36
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
458
ns458
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns500
ns1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
625
ns416
ns1.50
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
26701
ns27306
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
1195571.5
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
459104
ns483625
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
48701
ns48601
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10375
ns9917
ns1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8479
ns10167
ns0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
11375
ns9542
ns1.19
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9375
ns8667
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
252977
ns256488
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
24052360
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5702709
ns5936416
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
397983.5
ns396784
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
106500
ns108542
ns0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
98125
ns85333
ns1.15
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
87479.5
ns100208
ns0.87
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
147229
ns146625
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
24863
ns25074
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI
1228355
nsbias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
263458.5
ns244333
ns1.08
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
190212
ns190632
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
478667
ns479625
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
509250
ns518583.5
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
518562.5
ns481000
ns1.08
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
520417
ns478125
ns1.09
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
234381
ns235150
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI
11772054
nsbias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
2148312.5
ns2164333
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
621156
ns622586
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5375
ns5500
ns0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
5167
ns5750
ns0.90
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
7500
ns6666.5
ns1.13
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
4833.5
ns4125
ns1.17
batchedmm(16, Bsize=32)/forward/GPU/CUDA
16136
ns16723
ns0.96
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
79061
ns78130
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
14083
ns11812
ns1.19
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
10208.5
ns11916
ns0.86
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
10292
ns11000
ns0.94
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
16708
ns16500
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
213958
ns216336
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
374963
ns370958.5
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
40000
ns35917
ns1.11
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
50584
ns50500
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
52458.5
ns52709
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13895.5
ns13541
ns1.03
batchedmm(16, Bsize=128)/forward/GPU/CUDA
19866
ns20359
ns0.98
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
87035.5
ns79931
ns1.09
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
38625
ns36625
ns1.05
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
30646
ns29625
ns1.03
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
30791.5
ns31458
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57666
ns57209
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
192524
ns195413
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
416745
ns409364
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1604.5
ns1959
ns0.82
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1791
ns1792
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2042
ns2125
ns0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1708
ns1792
ns0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
21123
ns21014.5
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI
1140764
nsbias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
294500
ns324459
ns0.91
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
30391
ns33550
ns0.91
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2042
ns2209
ns0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns2125
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2292
ns2417
ns0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2208
ns2291
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
205122.5
ns207244.5
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI
8519681
nsbias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
1638500
ns1670895.5
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
139726.5
ns137121
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5709
ns4583
ns1.25
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5104
ns4750
ns1.07
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5750
ns6333
ns0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4271
ns4917
ns0.87
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
146388.5
ns147827
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
5488369.5
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
465291
ns771709
ns0.60
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
72161
ns71711
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8479.5
ns8270.5
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8209
ns8666
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8750
ns8792
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9042
ns8125
ns1.11
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
884256.5
ns888135.5
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
38177021
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5496125
ns6483625
ns0.85
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
394569
ns391164
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56791
ns56875
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57625
ns56875
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
56875
ns57750
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58166
ns58292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
37427.5
ns37890
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1210467.5
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
468667
ns379312.5
ns1.24
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
208482
ns205582
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
487354.5
ns448479
ns1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
501250
ns465229
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
492208.5
ns464687.5
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
437438
ns433500
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
267413
ns270782
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26782051.5
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8248375
ns10306000
ns0.80
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
839679
ns801818
ns1.05
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3311333.5
ns3291000
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2340166.5
ns1770084
ns1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
1769958
ns2335292
ns0.76
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6319645.5
ns6297083.5
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
205610
ns206316
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
202712
ns203322
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11497979
ns11333854.5
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8319667
ns6594562.5
ns1.26
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
6588125
ns8324937.5
ns0.79
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21221896
ns21089229
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
736463
ns735605
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1065445
ns1072271
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5562.5
ns5625
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4666.5
ns5667
ns0.82
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6437.5
ns7500
ns0.86
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6104
ns6750
ns0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
139569.5
ns139700
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
5734965.5
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
826042
ns867541.5
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
59531
ns56260
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9333.5
ns7500
ns1.24
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7000
ns14625
ns0.48
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11875
ns7375
ns1.61
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8708
ns7000
ns1.24
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
764194
ns766028
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
34028843.5
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5176312.5
ns5998084
ns0.86
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
378403
ns380414
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
99625
ns117604
ns0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
136708
ns125375
ns1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
101312.5
ns102396
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
129709
ns98145.5
ns1.32
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
151420
ns152876
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
6034399
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1982667
ns2030624.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
206692
ns185692
ns1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2031041
ns2021875
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2037417
ns2037125
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2036291
ns2013542
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2038584
ns2033354
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
708221
ns716061.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31488037
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11251291
ns13591542
ns0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1126246
ns1265732.5
ns0.89
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
33459
ns29833
ns1.12
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
36750
ns34167
ns1.08
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
33833
ns35542
ns0.95
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
667
ns625
ns1.07
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15506
ns15704
ns0.99
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
86920
ns71560.5
ns1.21
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
4792
ns2583
ns1.86
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2709
ns4583
ns0.59
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3167
ns3000
ns1.06
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2291.5
ns2209
ns1.04
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
140769.5
ns143464
ns0.98
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
351474
ns351354
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7208
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns5334
ns1.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5375
ns6166
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10000
ns1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36795
ns37164
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1247042.5
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
351333
ns334396
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49030
ns49180
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213334
ns212895.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220166.5
ns222000
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228125
ns221041.5
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206875
ns205979
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
244945
ns249374
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
24969632
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7965166.5
ns9656333
ns0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
578090.5
ns581561
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3916
ns3959
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3959
ns4000
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3958
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21762
ns21939
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI
2067928.5
nsdense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
245104
ns227375
ns1.08
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
45631
ns45671
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14875
ns14916
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14916
ns14708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14667
ns15000
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14667
ns14875
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
310256.5
ns314728.5
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI
11269459
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
1000292
ns1635750
ns0.61
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
193502
ns192832
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
102917
ns109166
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
103667
ns132541
ns0.78
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
108625
ns109875
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
131875
ns102125
ns1.29
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
137366.5
ns138355.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5955500.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1988958
ns2016354
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
200842
ns188667
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1926354.5
ns1918396
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1913500
ns1939229
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1917792
ns1913584
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1936729
ns1937625
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
692519
ns700104
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
33116808.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11144584
ns13264020.5
ns0.84
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1078360.5
ns1233652.5
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17708
ns17667
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22291.5
ns18458
ns1.21
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21250
ns22270.5
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19146
ns18250
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
109241
ns110588.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3392625.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1271125
ns1374104.5
ns0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81331
ns81891
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221229.5
ns216417
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216791
ns249771
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230083.5
ns216541.5
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
216083.5
ns217312.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
522920
ns527304
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
19545470
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6165645.5
ns8411584
ns0.73
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
476780
ns488925
ns0.98
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
26250
ns24063
ns1.09
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
31250
ns28500
ns1.10
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
27875
ns29459
ns0.95
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1292
ns1334
ns0.97
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16312
ns16479
ns0.99
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
87751
ns82590
ns1.06
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
6625
ns4708.5
ns1.41
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4645.5
ns4708
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
4917
ns5208
ns0.94
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4792
ns4875
ns0.98
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
207882.5
ns210198
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
402074
ns398304
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
305938
ns304792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
305917
ns305542
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
307521
ns311083
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
305375
ns306375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
230214
ns232191.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7500239
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
643000
ns1156396
ns0.56
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
280903
ns279563
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
538541
ns530625
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
549750
ns542459
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
542666
ns542000.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
529708
ns535875
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1085631
ns1096065
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
44253871
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6154687.5
ns6678000
ns0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
872599
ns873778.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19021
ns20083
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19833.5
ns20187.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22542
ns23187
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21917
ns20959
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
114174
ns115290.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3531348.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1449271
ns1265792
ns1.14
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81471
ns80731
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
218834
ns212042
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227542
ns224625
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219708
ns214333
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212708
ns213708.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
761865.5
ns758025
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
24050167
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7412916.5
ns10158583
ns0.73
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
543136
ns542975
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7125.5
ns6458
ns1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6479
ns6917
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8458
ns8542
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6084
ns6417
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
141785
ns143078
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
5370056
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
777458
ns869500
ns0.89
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
69581
ns69771
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12958
ns10709
ns1.21
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9583.5
ns9771
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10687.5
ns10729.5
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9625
ns10291
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
832452.5
ns834187
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
38810557
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5231375
ns6274750
ns0.83
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
395184
ns396084
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5145.5
ns5333
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4812.5
ns4958
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6958
ns7125
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6833
ns5958
ns1.15
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
144967.5
ns146313.5
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
5514807.5
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
829125
ns875000
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
70250
ns67660
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7770.5
ns7667
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7500
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
ns7625
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7208
ns7459
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
790491
ns797995
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
37869840
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5670687
ns6580999.5
ns0.86
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
398424.5
ns400804
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14518959
ns14350958
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10120000
ns7722625
ns1.31
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
7708791.5
ns10132750
ns0.76
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27832250
ns27757125
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
532409
ns532327
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
399949.5
ns403538.5
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46375083.5
ns45806208
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33404583.5
ns26766750.5
ns1.25
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
26627416.5
ns33520000
ns0.79
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85835750
ns85306916
ns1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2644453
ns2661047
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3278895
ns3296413
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
66042
ns66000
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
66125
ns67333
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
70520.5
ns69854
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
67875
ns67375
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
119873.5
ns120529
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3330724
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1410021
ns1329083.5
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
229907.5
ns228112
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
453292
ns444083
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
441208
ns444083
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
450208
ns441292
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
445541
ns442521.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
732886.5
ns736542.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
26274297
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7781500
ns10732062.5
ns0.73
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
794638
ns809398
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
667
ns542
ns1.23
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns667
ns0.81
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns542
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32132
ns32886
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI
1164338
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
431645.5
ns466834
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
49160
ns49230
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8292
ns9375
ns0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8708
ns9250
ns0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9250
ns9500
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8959
ns8125
ns1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
286401.5
ns290314.5
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI
21940598
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5096125
ns5519708
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
388934
ns387394
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9792
ns9875
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9875
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9833
ns9833
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9875
ns9791
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23178
ns23928
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI
1908743.5
nsdense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
222541
ns204979.5
ns1.09
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
217383
ns214872
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45875
ns46000
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45917
ns45667
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
46167
ns46666
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45875
ns46250
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
293089
ns293307
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI
10988297.5
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
982875
ns1595562.5
ns0.62
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
621107
ns621217
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56250
ns56333
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
57125
ns56792
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
56334
ns57083
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57792
ns57834
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28527
ns29516
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1186883
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
578645.5
ns704333.5
ns0.82
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
204943
ns205082
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
448333
ns455021
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
494125
ns465375
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
507583
ns473000
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
439437
ns434208.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
247232
ns252003
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
33216066
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9499166
ns12166125
ns0.78
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
891519.5
ns893508.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
652937.5
ns624416
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
647333
ns662083
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
662854
ns619083
ns1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
668500
ns633895.5
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
207996
ns212333
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8125052.5
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1384354.5
ns1471333
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
233282
ns236152
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2235042
ns2220834
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2238979
ns2250000
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2248959
ns2213792
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2260792
ns2240750
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
984096
ns990521.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
45382984
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8132833.5
ns9717333
ns0.84
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1370494
ns1376089
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20958
ns19000
ns1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20000
ns19979
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22667
ns22333.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22083
ns22250
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
113160
ns114382.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
3278898
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1472792
ns1244584
ns1.18
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81561
ns81450
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222313
ns222479
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
257542
ns224959
ns1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
232250
ns221208
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
228000.5
ns218917
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
734156.5
ns738666.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
27357269
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7692750
ns10456396
ns0.74
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
559476
ns562856
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns584
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
541
ns667
ns0.81
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns542
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23248
ns23746
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI
1222462
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
466625
ns488062.5
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
51870
ns49670
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9167
ns9541.5
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9208
ns9792
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9292
ns9833
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9312.5
ns9291.5
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
268568
ns272510
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI
24289416
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6049709
ns6224583.5
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
410500
ns407824
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10333
ns7708
ns1.34
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8458
ns8687.5
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10354
ns11166.5
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8333
ns9666
ns0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
120393.5
ns121220
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI
3445203
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
832874.5
ns860208
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
72921
ns72661
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7583
ns7708
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8208
ns7250
ns1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7417
ns8125
ns0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7770.5
ns7334
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
511772
ns516336
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI
16339001
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
3959271
ns4339813
ns0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
328364
ns328244
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1458
ns1458
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1542
ns1375
ns1.12
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns2041.5
ns0.90
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1541
ns1583
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
21725
ns21646
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI
1136020
nsbias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
296000
ns305020.5
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
194712
ns191511.5
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3250
ns3334
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3250
ns3375
ns0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3500
ns3459
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3209
ns3458
ns0.93
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
220221.5
ns224911
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI
9698879
nsbias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
1612667
ns1768041
ns0.91
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
596166
ns595216
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
148167
ns145708.5
ns1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
127709
ns106562.5
ns1.20
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
107958.5
ns129292
ns0.83
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225958
ns225125
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
24338
ns24473.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI
1138772
nsbias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
270854.5
ns252375
ns1.07
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
40151
ns38390
ns1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
156125
ns143771
ns1.09
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
127209
ns88167
ns1.44
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
100750
ns110771
ns0.91
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
256666.5
ns250875
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
218905
ns220914.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI
10030041
nsbias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
2003417
ns2045709
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
240417.5
ns237933
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7250
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6083
ns5333
ns1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5375
ns5916
ns0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10375
ns10208
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32865
ns33448
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
1134920.5
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
562875
ns335833
ns1.68
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
52191
ns50340
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
230854.5
ns224250
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
270500
ns228375
ns1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
264875
ns236083.5
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213771
ns212562.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
263381.5
ns267943.5
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
28212764
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8517000
ns9170083
ns0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
607266
ns609306
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
14958
ns14458
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
15500
ns14812.5
ns1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
16500
ns16791.5
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
15625
ns15334
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
140749.5
ns141134
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI
5465169
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
787125
ns873104
ns0.90
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
238512
ns238182
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
22583
ns24083.5
ns0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23500
ns23875
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24084
ns24167
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23167
ns23625
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
875101
ns878285
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI
37582744
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5600270.5
ns6385188
ns0.88
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
692048
ns692226
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9125
ns8916
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9250.5
ns9687.5
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
10521
ns12125
ns0.87
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9209
ns10416
ns0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
124561
ns124959.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI
3393331
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
802083
ns918334
ns0.87
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
79030
ns75531
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13750
ns14000
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14125
ns13729
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14125
ns14708
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13917
ns13834
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
670894
ns676549
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI
20295661
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5274042
ns5573041
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
375405
ns373189
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9208.5
ns8062
ns1.14
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9167
ns9750
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10438
ns11916.5
ns0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9584
ns10187.5
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
122339.5
ns124116
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI
3319433.5
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
882875
ns883646
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
75581
ns69690
ns1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12333.5
ns12625
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12645.5
ns12750
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12708
ns13542
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12708
ns12312
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
557225
ns561116
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI
18661226
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4435167
ns4630937
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
345844
ns345083.5
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
30292
ns27208.5
ns1.11
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
34021.5
ns32333.5
ns1.05
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
30854.5
ns31958
ns0.97
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
1791
ns2041
ns0.88
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16303
ns16556
ns0.98
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
82211
ns82091
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5270.5
ns5229
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5354
ns4687.5
ns1.14
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5375
ns5334
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6625
ns6458
ns1.03
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
140733
ns142634
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
394064.5
ns367964
ns1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
250
ns334
ns0.75
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns250
ns1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
250
ns375
ns0.67
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns250
ns1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
26135
ns26682
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
1123770.5
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
474625
ns482271
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
50311
ns47990
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6500
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6145.5
ns6562.5
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6416
ns6709
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6416
ns6188
ns1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
187828
ns190767.5
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
23626156
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5544437.5
ns5874834
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
395104
ns394363.5
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2042
ns2042
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2000
ns1917
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
1959
ns2125
ns0.92
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
2000
ns2000
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
26544
ns27167
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI
1165809
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
461708.5
ns492292
ns0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
209972
ns210002
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15792
ns16833.5
ns0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16375
ns16417
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17000
ns17354.5
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16084
ns16458.5
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
275962
ns278278
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI
24890960.5
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5972833
ns6125604
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
713667.5
ns714427
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
178250
ns146500
ns1.22
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
184187.5
ns171396
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
153417
ns155584
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
147459
ns154167
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
204372
ns204804
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7857309.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1392667
ns1553583
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
196752
ns231362.5
ns0.85
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1326895.5
ns1324312.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1320625
ns1348021
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1330833
ns1319083.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1334750
ns1326542
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
917280
ns925557
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
46023181
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6714958.5
ns8602229.5
ns0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1108992
ns1014380
ns1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25229.5
ns23792
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
26583
ns25354
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
26833
ns28250
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
25917
ns24604.5
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
239791.5
ns238411
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI
7972748
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
980542
ns1139000
ns0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
116941
ns120312
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
179917
ns117854
ns1.53
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
141604.5
ns124667
ns1.14
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
127354.5
ns174458.5
ns0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
118604
ns118354
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1092585
ns1098934
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI
43816902.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6033333
ns7919042
ns0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
606086
ns614406
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
334
ns250
ns1.34
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22970
ns23522
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI
1175116
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
456125
ns491791.5
ns0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
48591
ns50790
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6625
ns6583
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6750
ns6375
ns1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6542
ns6833
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6459
ns6167
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
204628
ns207746.5
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI
23603781
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
6092458
ns5956667
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
397554
ns395954
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6125
ns5958
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6334
ns6041.5
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6709
ns7604.5
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5937.5
ns6500
ns0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
147027
ns147981.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
5559804
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
583167
ns774875
ns0.75
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
237472
ns239202
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9666.5
ns10000
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10041
ns10083
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10041
ns10667
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9854
ns9791.5
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
910526.5
ns916090
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
39406121
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5909375
ns7392292
ns0.80
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
686288
ns688747.5
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
666
ns708
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
667
ns666
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
667
ns666
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
667
ns625
ns1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22655
ns23031
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI
2037996
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
222583
ns209625
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
215862
ns215712
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4584
ns4833
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4584
ns4584
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4625
ns4833
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4625
ns4625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
232442.5
ns230125.5
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI
9881227
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
1690521
ns1700146
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
600181
ns599396
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8562.5
ns8396
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7937.5
ns8000
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9771
ns10125
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8520.5
ns9062.5
ns0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
122197
ns123106.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI
3361719
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
761542
ns907333
ns0.84
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
76241
ns76081
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8792
ns8792
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8459
ns8459
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8875
ns9041
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns8270.5
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
595652
ns600302.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI
20278296
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
4718125
ns4960583.5
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
354274
ns353604
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
125917
ns122750
ns1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
128958
ns95625
ns1.35
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
96959
ns130334
ns0.74
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
181416
ns183125
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46106
ns46375
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
96666
ns98981
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
317875
ns303292
ns1.05
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
346375
ns182750
ns1.90
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
178979
ns345917
ns0.52
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
569062.5
ns608729
ns0.93
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
191966
ns195364.5
ns0.98
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
487875
ns494734
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397125
ns396125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288292
ns215375
ns1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
215791
ns287708
ns0.75
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
757959
ns756000
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43243.5
ns43820
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI
1345812
nsdense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
404062.5
ns358000
ns1.13
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
83381
ns83390
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1459854
ns1446958.5
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1136645.5
ns863667
ns1.32
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
865270.5
ns1133375
ns0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2359813
ns2443417
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
259216
ns252085
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI
11177773
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1833666
ns1851958
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
349653.5
ns350863.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
642333
ns626459
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
649875
ns682479
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
660416.5
ns615000
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
623542
ns641167
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
202604
ns203045
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
7957177
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1348791.5
ns1359542
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
265108
ns254223
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2448583
ns2435250
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2452104
ns2470979.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2473833
ns2445042
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2455791
ns2415792
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1005284.5
ns1014910
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
50767854.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10026166
ns11589916
ns0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1511186
ns1478675
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
32375
ns29458.5
ns1.10
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
35749.5
ns33812.5
ns1.06
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
34312.5
ns34541
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
916
ns1042
ns0.88
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15700
ns15442
ns1.02
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
81140
ns85531
ns0.95
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3166
ns3250
ns0.97
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3083
ns3042
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3125
ns3416
ns0.91
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3000
ns3166
ns0.95
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
139352.5
ns142240.5
ns0.98
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
344664
ns360413
ns0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
405583
ns404291
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
408750
ns403708
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
403083
ns409042
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
422042
ns421875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
43343.5
ns44262
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1354478
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1109583
ns1119041
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
240442
ns242882
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3869125
ns3855208
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3994396
ns3997771
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3999708
ns3998125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3774354.5
ns3773938
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
244251
ns248524
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
35978667
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11608750
ns14976771
ns0.78
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1245273.5
ns1453704
ns0.86
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3959
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3875
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34866
ns34278.5
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI
1227111
nsdense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
175291
ns161167
ns1.09
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
42710
ns40280
ns1.06
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15750
ns15875
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15667
ns15583
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15500
ns16041
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15542
ns15791
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
256386
ns257529.5
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI
8908913
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
872958
ns864083.5
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
174412
ns168256.5
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404166
ns403417
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
295666
ns221375
ns1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
221625
ns295666
ns0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760500
ns760500
ns1
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113218
ns113952
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI
1016425
nsdense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
393437
ns335792
ns1.17
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
90851
ns88615.5
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1473333
ns1471958
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1161666
ns887791.5
ns1.31
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
888166.5
ns1157167
ns0.77
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2383791
ns2467666
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
241468.5
ns255583.5
ns0.94
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI
11846004
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
1877938
ns1946854
ns0.96
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
360704
ns360243.5
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
500
ns542
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
583
ns500
ns1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
459
ns584
ns0.79
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
542
ns500
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
25943
ns26902
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI
1192515
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
470937.5
ns486187.5
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
208143
ns208227.5
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7458
ns7667
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7583
ns7666
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7458
ns7916.5
ns0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7709
ns7250
ns1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
214477.5
ns219818
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI
25777295.5
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5998979.5
ns6151042
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
700287
ns686716.5
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
831271
ns825562.5
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
617041
ns468833
ns1.32
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
470000
ns620188
ns0.76
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1545709
ns1547479
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129860.5
ns131055
ns0.99
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
169171.5
ns231953
ns0.73
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2689145.5
ns2669042
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
2013250
ns1538125.5
ns1.31
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1538125
ns2006270.5
ns0.77
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4941375
ns4938583
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
241461
ns242713
ns0.99
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
867019
ns860168
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns291
ns1.29
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
333
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31985
ns32634
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI
1142400.5
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
453291.5
ns452000
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
48580
ns48761
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6250
ns6437.5
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6541.5
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6416
ns6750
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6166
ns6000
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
224593
ns228896
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI
21127237.5
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5053916
ns5302916
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
372504
ns369843
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2423917
ns2391250
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2397291.5
ns2400000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2403792
ns2405958
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2371125
ns2372125
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
203214
ns204395
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
8123069
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1393562
ns1597249.5
ns0.87
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
332763.5
ns377704
ns0.88
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4645250
ns4646708.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4645125
ns4648958
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4654250
ns4659021
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4658042
ns4685792
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
910071
ns915367
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
48057492
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
6619584
ns7426833
ns0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1416215
ns1261857
ns1.12
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
7438
ns7479
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7083
ns7125
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
6958
ns7959
ns0.87
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6979
ns7250
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
23722
ns23573
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI
1176238
nsbias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
263000
ns243500
ns1.08
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
34150
ns39571
ns0.86
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
68020.5
ns70291.5
ns0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
50312
ns45542
ns1.10
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
53292
ns63500
ns0.84
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
32583
ns33104
ns0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
218170
ns217821
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI
10824043
nsbias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
2030958
ns2084458
ns0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
244333
ns226612
ns1.08
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
21437
ns20396
ns1.05
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
25333
ns24479.5
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
23479.5
ns24854.5
ns0.94
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
6083
ns5500
ns1.11
batchedmm(2, Bsize=512)/forward/GPU/CUDA
16786.5
ns16892
ns0.99
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
91501
ns85151
ns1.07
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
12208.5
ns11958
ns1.02
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
10083
ns9000
ns1.12
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
9458.5
ns10958.5
ns0.86
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
17854.5
ns18167
ns0.98
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
228126
ns227664.5
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
376824
ns389024
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
406500
ns404791
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
297312.5
ns223500
ns1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
223791
ns296709
ns0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762958
ns762750
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46683
ns46360
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI
1412498.5
nsdense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
476666.5
ns340000
ns1.40
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
89121
ns88940
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1499875
ns1485750.5
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1167833.5
ns895812
ns1.30
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
894271
ns1165791.5
ns0.77
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2389834
ns2472333
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
292932.5
ns290272
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI
13048501
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2098166
ns2106583
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
380285
ns377424
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
433875
ns432770.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
436334
ns430583
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
430709
ns436958
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
448020.5
ns448209
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54564
ns54092
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
1024914
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1099208.5
ns1074083.5
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
236522.5
ns235772
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3897208
ns3888958
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4021833
ns4016791.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4027708
ns4025938
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3812146
ns3793958.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
264154
ns263523
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
31494055
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10517749.5
ns11929333
ns0.88
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1245028
ns1247352
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8750
ns8750
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
7666
ns6875
ns1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
6834
ns7667
ns0.89
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12459
ns12417
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24707
ns24084
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI
2085760.5
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
225250
ns211583
ns1.06
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
215337.5
ns216562
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45042
ns45125
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45125
ns44750
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45083
ns45375
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45187.5
ns45187.5
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
350283.5
ns347338.5
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI
11134325
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
1805125
ns1883625.5
ns0.96
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
662902
ns671931.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
93959
ns104146.5
ns0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
129416
ns86437
ns1.50
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
87916.5
ns92875
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
125062.5
ns126625
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
189645
ns189767
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI
5972246.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1906021.5
ns1966250
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
201947
ns183982
ns1.10
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2011375
ns2011000
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2017791
ns2025000
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2029459
ns2009458
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2017916.5
ns2016917
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
537811
ns535873.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI
27667805
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9734479.5
ns11961958.5
ns0.81
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1103102
ns982380
ns1.12
This comment was automatically generated by workflow using github-action-benchmark.