-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: update default rng for reactant (#1152)
* fix: update default rng for reactant * feat: handle RNGs in layers correctly
- Loading branch information
Showing
5 changed files
with
46 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
367680b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
367680b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error while trying to register: Version 1.4.3 already exists
367680b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register subdir=lib/MLDataDevices
367680b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/122251
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
367680b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4042
ns4083.5
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4125
ns4042
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4833.5
ns4917
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3958
ns3833
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60780
ns59941
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10500
ns11250
ns0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10333
ns10500
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10625
ns11541
ns0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10833
ns10958
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
423470
ns421187
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1084
ns1167
ns0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1125
ns1250
ns0.90
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1416
ns1417
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1208
ns1167
ns1.04
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18313
ns17939
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4042
ns4125
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4083
ns3958
ns1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4208
ns4292
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3625
ns4062.5
ns0.89
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
110716
ns108432
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57375
ns57333
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46292
ns46250
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46500
ns47041
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82709
ns82125
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37768
ns36736
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2006604.5
ns1991000.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2082209
ns2094313
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2011667
ns2094167
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2018937.5
ns1997041.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
196514.5
ns194384.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
141709
ns143854.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144000
ns143125
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145187
ns147041
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144208
ns144750
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165424.5
ns165602
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1001541.5
ns1114896
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1118791.5
ns1128937.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1097124.5
ns1128792
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1141417
ns1114542
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
532439
ns526049
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3667
ns3458
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3542
ns3416
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
3917
ns4145.5
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3541.5
ns3584
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
71776.5
ns70040
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9042
ns8917
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9584
ns9042
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns9459
ns0.90
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9042
ns8917
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
486557
ns447136
ns1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15125
ns15041
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17792
ns17541.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16916.5
ns17625
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15250
ns15917
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56432
ns54471
ns1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214500
ns217417
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
214625
ns213417
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215333.5
ns214979.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
216041
ns225771
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
280343
ns270355
ns1.04
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
667
ns791
ns0.84
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
584
ns625
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
708
ns708
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
667
ns667
ns1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17273.5
ns17190
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1583
ns1500
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1667
ns1500
ns1.11
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1667
ns1666
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1541
ns1500
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
103457
ns101385
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7000
ns7208
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5937.5
ns5916
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5709
ns5917
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9833
ns9875
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24396
ns23163
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222750
ns223083
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229041
ns228500
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
230041
ns230208
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213500
ns217000
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
171992
ns166961
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3916
ns3958
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4000
ns3958
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23948
ns23600
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16750
ns16792
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16583
ns16750
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17041
ns17041
ns1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16916
ns17000
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
165565.5
ns161078
ns1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
572458
ns577750
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
576208
ns572709
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
581250
ns574833
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
575042
ns575625
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113609
ns112893
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1419604
ns1420292
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1420333
ns1425209
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1421834
ns1426583
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1421062.5
ns1429020.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
216706.5
ns211317.5
ns1.03
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1089896
ns1077500
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
966312
ns960792
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1351792
ns1350854.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1307959
ns1298750
ns1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA
276909
ns273506
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5979271
ns6004937.5
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4608000
ns4547292
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4925667
ns4929708.5
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5767000
ns5555333
ns1.04
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1097403.5
ns1074648
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns583
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23800
ns23430
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2167
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2084
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2167
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2084
ns2084
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
174099
ns173597
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4209
ns4292
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4042
ns3750
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5020.5
ns4917
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3667
ns3958
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
66593
ns65160
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10958
ns11209
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11167
ns11250
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12083
ns12208
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11167
ns11125
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
455844
ns447745.5
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6583
ns6166
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6417
ns6375
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7562.5
ns8125
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6333
ns6583
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
53149
ns52163
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17375
ns16750
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17250
ns18209
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18250
ns18500
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16458
ns17000
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
301789.5
ns298259.5
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
584
ns542
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33109.5
ns32532
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8542
ns8208
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8500
ns8667
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9375
ns9333
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8416.5
ns8083
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
161412.5
ns158900.5
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64666
ns64500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64583
ns64500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64459
ns64458
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64208
ns64375
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112066
ns111633.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
275959
ns274542
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
279333
ns287042
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
280167
ns274708
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
284791
ns280292
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
190816.5
ns186083
ns1.03
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3359666.5
ns3329333
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3020708
ns3017229
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3019708
ns3024687.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4044937.5
ns3956250
ns1.02
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
582824
ns577429
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7633375
ns7623958
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7444749.5
ns7210334
ns1.03
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7451687.5
ns7453270.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8276916.5
ns8209375
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1416070
ns1359043.5
ns1.04
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17541687.5
ns17513124.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17532229.5
ns17530146
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17547042
ns17518395.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14143625
ns14128813
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23437021
ns23645979.5
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33669000
ns33821104.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
36847792
ns37080041
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
35241729
ns34888834
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1852807
ns1866294
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
188072458
ns189046208
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
164284791
ns164619624.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
152400917
ns152711479
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
434137916
ns436948083
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13886569
ns13894254.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
288796896
ns289373791
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
251588375
ns251042625
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
296639417
ns296809167
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
474281875
ns474994229.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22000
ns22250
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22625
ns24542
ns0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24250
ns23188
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21812.5
ns22417
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
98991
ns96027
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
104791
ns116584
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103292
ns113125
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104708
ns117833
ns0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103625
ns103854
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
514494
ns510213
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5917
ns5833
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5834
ns5917
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6459
ns6812.5
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6167
ns6292
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
69465
ns68158.5
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14417
ns14875
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15250
ns14812.5
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15459
ns14875
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14666
ns15042
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
483934.5
ns478636.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
2986042
ns3009146
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2014792
ns2061334
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2274354.5
ns2279208
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4589125
ns4871541.5
ns0.94
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
584502
ns589315.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23505916.5
ns23547375
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18035749.5
ns17982875.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
16922042
ns16893209
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
34856104.5
ns34849958
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2763874
ns2772744
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33341541.5
ns33314834
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27602208
ns27464208
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27326333
ns27410208
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41263417
ns41078500
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72791.5
ns72375
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73208
ns74375
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
83958
ns75166
ns1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
83208
ns75167
ns1.11
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
103702
ns102682
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
286979.5
ns286145.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
206625.5
ns210021.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
322750
ns315000
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
322333
ns218458
ns1.48
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
559306
ns553543
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11458.5
ns11875
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11666.5
ns11708
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12333
ns13334
ns0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11958
ns13125
ns0.91
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
73645.5
ns71259
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26208.5
ns26833.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27000
ns26375
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27416
ns27417
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26645.5
ns25854.5
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
483328.5
ns477064.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11917
ns12041.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
14750
ns12229.5
ns1.21
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13708
ns13958
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12708
ns12584
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
54699.5
ns53895.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25375
ns25875
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25500
ns25834
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26333
ns26125
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27875
ns25667
ns1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
308185.5
ns305285
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
182041.5
ns179417
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
181583
ns179417
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183167
ns181041
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182167
ns180042
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
58753
ns58113
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
592604
ns590084
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
583041
ns585083
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
594209
ns591062.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
586791
ns584333
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
294181
ns289662.5
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6083
ns6083
ns1
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5958.5
ns5500
ns1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6833
ns7542
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6250
ns6604.5
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
72095.5
ns70599
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14375
ns14291
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13083
ns14209
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14791
ns14917
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14292
ns13062.5
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
473402.5
ns466681.5
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1210604.5
ns1223541.5
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1239854
ns1236625
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1297479
ns1285666.5
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1024875
ns1007959
ns1.02
batchedmm(512, Bsize=4)/forward/GPU/CUDA
300941
ns301986
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4097875.5
ns4226959
ns0.97
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4434062.5
ns4384249.5
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4563541
ns4572312.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3722313
ns3695104.5
ns1.01
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1037751.5
ns1047036
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1791
ns1833
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23494
ns24200
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4834
ns4875
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4834
ns4833
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns4875
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
188396
ns192268.5
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5625
ns5458
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5459
ns5542
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6500
ns6791.5
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5562.5
ns5792
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
54865
ns56595.5
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10583
ns10500
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10500
ns10416
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11125
ns11375
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10666
ns10875
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
324083
ns335979.5
ns0.96
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns334
ns0.87
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
375
ns333
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns334
ns0.87
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22774
ns23172
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2708
ns2833
ns0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2750
ns2709
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2959
ns3042
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2708
ns2791
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
158123.5
ns162255.5
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11375
ns11084
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11083
ns11000
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12125
ns13563
ns0.89
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11542
ns11458
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
56425.5
ns58685.5
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24583
ns24542
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24667
ns24542
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24833.5
ns25167
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25250
ns25000
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
289503
ns298266
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4167
ns4208
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4250
ns4250
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4250
ns4250
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24426.5
ns25307
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16417
ns16166
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16167
ns16292
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16334
ns16334
ns1
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16125
ns16084
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
194624
ns199542
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5709
ns5709
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5708
ns5917
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5834
ns5792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5834
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33182
ns33833
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20792
ns20292
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20645.5
ns20375
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
20792
ns20875
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20417
ns20250
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
174846
ns178083
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
423688
ns420500
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
381917
ns372625
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
480521
ns482833
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
104125
ns103292
ns1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66873.5
ns67723.5
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
934375
ns922417
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
984083
ns955208.5
ns1.03
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1186625
ns1180875
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
471042
ns379083
ns1.24
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
189890.5
ns192988
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
81458.5
ns136917
ns0.59
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80125
ns79854.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
81104.5
ns82750
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
136333
ns81167
ns1.68
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192847
ns194081
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1918292
ns1915042
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1908625
ns1919750
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1922750
ns1926125
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1953687.5
ns1915750
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
394765
ns401908.5
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns333
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21680
ns22364
ns0.97
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
167307.5
ns174295
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6625
ns6042
ns1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6333
ns6500
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7375
ns7812.5
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6667
ns6541
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59094.5
ns61489.5
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8958
ns9000
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8959
ns8792
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9417
ns9375
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9416
ns9459
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
303401
ns308375
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120415166.5
ns118419979.5
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173861833
ns173770000
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147873916
ns148397083
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
104464750
ns104919541
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5466659
ns5493586
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
607892187.5
ns611739750.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
555380583
ns553521958
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
449180562.5
ns449841709
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
624687437
ns631089333.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34960099
ns38209825
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
655676042
ns652096250
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
664719854.5
ns661126562.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
586317000.5
ns580970687.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
854444125
ns848782167
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57541
ns58667
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47500
ns47500
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46625
ns48250
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
85500
ns83625
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37532
ns37628
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1919792
ns1919312.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1980000
ns1980333.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1978083.5
ns1982541.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1915584
ns1895625
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
173336.5
ns176341
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
266563
ns266208
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
285125
ns265334
ns1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
286313
ns288604
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
267916
ns268167
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
130327.5
ns130454.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
588541
ns664646
ns0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
688375
ns671062.5
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
691667
ns665875
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
713875
ns597542
ns1.19
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
704236.5
ns690208
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2209792
ns2192312.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2211250
ns2179542
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2214666
ns2181333.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2251125
ns2207146
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133526
ns134808
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5473459
ns5469791
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5495771
ns5472958.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5506084
ns5499916
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5555625
ns5442583.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
758118
ns720984
ns1.05
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
641209
ns644667
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
638417
ns644084
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
648750
ns642042
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
647250
ns644167
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46678
ns47636.5
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1823542
ns1819917
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1728500
ns1720500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1721125
ns1721792
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2101541
ns2100000
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
220988
ns224071
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58375
ns57667
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47291
ns46666
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46667
ns46583
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84417
ns83750
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28560
ns28795
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2021604
ns2029583
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2078542
ns2087375
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2089792
ns2087791.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2018458
ns1991416.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
188289
ns190320
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13165083
ns13371041.5
ns0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12437062.5
ns12439187.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12496625
ns12491875
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15241708
ns15195833.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
511138.5
ns516777
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47044896
ns47119104.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41734229
ns41727062.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41006041
ns41051417
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58474250
ns58599458
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2887641
ns2892052.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74158583
ns74212666
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
68293166
ns67877750
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90787478.5
ns90536499.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76120020.5
ns98549792
ns0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58708
ns58375
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47417
ns46459
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47333
ns47708
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81500
ns83958
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48467.5
ns47165
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1906541
ns1919583.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1966979
ns1980791
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1972250
ns1979229.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1919083.5
ns1886958
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
194955.5
ns193816.5
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
417
ns375
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
31682
ns32624
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
5979.5
ns5833
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
5959
ns6083
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6417
ns6416.5
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6250
ns5833
ns1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
173280.5
ns171378.5
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31661
ns32204
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2583
ns2583
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2625
ns2625
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2834
ns2875
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2584
ns2625
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
162166.5
ns159764
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
285912791.5
ns286393770.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
341793875
ns340253500
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314064437.5
ns313806270.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
269291750
ns268566520.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7104649.5
ns7103110
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1013628833
ns1012043792
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
955735416
ns955581708
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
855387437.5
ns855297583
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1263250834
ns1259239875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33975753
ns33847341
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1379120562.5
ns1418325958.5
ns0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1314342812
ns1338395020.5
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1634956500
ns1636087292
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1372311479
ns1775858125
ns0.77
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1410229
ns1409833
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1415750
ns1414458.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1412896
ns1465562.5
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1460375
ns1413458.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127578
ns127951
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5011584
ns5027250
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5015500
ns5036354
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5020521
ns5030437.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5052375
ns5027250.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
577903.5
ns479205.5
ns1.21
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
171180458
ns170869291
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
128541250
ns128735708
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
109850250
ns105431542
ns1.04
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
169107792
ns167706958
ns1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4873683
ns4877746.5
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
624949333
ns511068334
ns1.22
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
491287250
ns490911792
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
454790833
ns385742875
ns1.18
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
648542167
ns650161000
ns1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16059874
ns16340937
ns0.98
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8910395.5
ns9003042
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8995792
ns8983042
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7901000
ns7909375
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
9817770.5
ns9604229.5
ns1.02
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1593491
ns1611438.5
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
35975583
ns36334167
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
37440812.5
ns37265291.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33423291.5
ns33553354
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
38560271
ns37555333
ns1.03
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6452757.5
ns6454550
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47625
ns47333
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47583
ns47500
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47625
ns47625
ns1
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47375
ns47417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18605
ns18252
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50250
ns50417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50417
ns50666
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50625
ns50625
ns1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50459
ns50250
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
218596.5
ns164880
ns1.33
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6416
ns6417
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6625
ns6792
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7209
ns7583.5
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7000
ns6792
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
120537.5
ns76692.5
ns1.57
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9667
ns10125
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9583
ns9750
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10625
ns10250
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10209
ns9875
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
676959
ns448214.5
ns1.51
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5584
ns5666
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6167
ns5791
ns1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7146
ns7583
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5562.5
ns6042
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
144983
ns81735
ns1.77
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12875
ns13208
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13084
ns12709
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13875
ns13375
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12959
ns13417
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
555671
ns399198.5
ns1.39
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
959
ns959
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
959
ns1000
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32054
ns32447
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7666
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7875
ns7708
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8167
ns7958
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7958.5
ns8166
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
215727.5
ns187787.5
ns1.15
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23166.5
ns23167
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23292
ns23209
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23458
ns23250
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23334
ns23292
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18589.5
ns18320.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52625
ns52917
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52500
ns52167
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52958
ns52917
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52333
ns52875
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
299146
ns214503.5
ns1.39
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1401500
ns1398125
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1396145.5
ns1402146
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1398562.5
ns1406437.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1435792
ns1448937.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195172
ns196187.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5009646
ns5003458
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4800875
ns5029708
ns0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5005896
ns5015042
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5025041.5
ns5005729.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
612010.5
ns509817
ns1.20
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3032250
ns3051834
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2072292
ns2076520.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2300667
ns2302500
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4921042
ns4658291.5
ns1.06
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
580134
ns581685
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24343228.5
ns24315708
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18906020.5
ns18877250
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17758521.5
ns17822166
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35734042
ns35790999.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2830179
ns2842698
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33956916.5
ns33982916.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28347958
ns28228208.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28079666
ns27940958
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
42065000
ns41757334
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144437916
ns143078500
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
147635291
ns146668125
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
125109916
ns127355624.5
ns0.98
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173674875
ns171841729.5
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22545545
ns22550146
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
908256562.5
ns1234730083.5
ns0.74
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1584608041.5
ns1060723417
ns1.49
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
749118208
ns1027004875
ns0.73
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
669868292
ns674561583
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118395391
ns117659213
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
81333
ns74125
ns1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75042
ns73146
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
77166
ns76000
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73625
ns85834
ns0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
243285.5
ns175925
ns1.38
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
287145.5
ns215750
ns1.33
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
285833
ns192541.5
ns1.48
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
283104.5
ns284542
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
279041
ns285708
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1239705
ns952026.5
ns1.30
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35487666
ns35486000
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36325875
ns36428646.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32416604
ns32475229
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40654875
ns40408041.5
ns1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5840513
ns5831517
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
146753459
ns146000771
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
153140083.5
ns154808750
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
135055542
ns137043083.5
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
286267791
ns285556542
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34875869
ns34852076.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120929708.5
ns121592083
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174008000
ns174639125
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147856792
ns148027541
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
102357166.5
ns105917833
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5458379
ns5344344
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
472290792
ns468650958
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
468203875
ns466713000
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
437903521
ns437158458
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
743156542
ns744371959
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32279044
ns35992005
ns0.90
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
709215666.5
ns712765167
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
641585354.5
ns641204167
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
623424125.5
ns624084979.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
853935458
ns856208084
ns1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1289084
ns1270583
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
912625
ns995709
ns0.92
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
959625
ns995875
ns0.96
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2066167
ns2037625
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
576350.5
ns569478
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2954792
ns2961229.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2624645.5
ns2647792
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2616708
ns2621500
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3750458
ns3709750
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1708662
ns1587708.5
ns1.08
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5780625
ns5785812.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5802646
ns5824083
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5793708
ns5785375
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2916792
ns2904896
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7292
ns7250
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns6125
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6167
ns6042
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9917
ns10042
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24959.5
ns24479.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212666.5
ns223812.5
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219979.5
ns222667
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220458
ns220792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
244353.5
ns240666
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
249958
ns212315.5
ns1.18
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
296320791
ns296229125
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
216911667
ns216728584
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
196230687
ns190254604.5
ns1.03
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
303909375
ns304954521
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7672082.5
ns7671461.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1231911312.5
ns1229817167
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
900530270.5
ns902846291.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
828047958
ns824304209
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1151206292
ns1157856750.5
ns0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26738113
ns26996841
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4833
ns5292
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5500
ns5291.5
ns1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6167
ns6375
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5000
ns5250
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
149363.5
ns112898
ns1.32
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7041
ns6875
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns6958
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7541
ns7583
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6917
ns7125
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
600699
ns535221.5
ns1.12
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns584
ns0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
584
ns584
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns541
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23466
ns23660
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8667
ns8625
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8417
ns9084
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9667
ns9417
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9125
ns8708
ns1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
211340
ns195936.5
ns1.08
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
368458
ns352958.5
ns1.04
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351459
ns352792
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352500
ns351479
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
352146
ns356708.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21302
ns20962
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
826271
ns775625
ns1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
824958.5
ns825833
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
792000
ns812229.5
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
830250.5
ns834959
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
269586
ns234827
ns1.15
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
340937.5
ns341562.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
343062.5
ns341958
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
454770.5
ns455917
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
14084
ns11083
ns1.27
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17990
ns17699
ns1.02
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
710583
ns712500
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
728458
ns739896
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1004208
ns1007854
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
27417
ns26459
ns1.04
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
239886
ns214680.5
ns1.12
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
383166.5
ns381042
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
350542
ns346750
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
443208
ns449187.5
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
31250
ns39042
ns0.80
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22514
ns22537
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
718250
ns733792
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
782083
ns788958
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1028417
ns1032500
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
105334
ns105583
ns1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
217107
ns200835.5
ns1.08
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3333
ns3791
ns0.88
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3708
ns3541
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3625
ns3708
ns0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3417
ns3708
ns0.92
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17516
ns17542
ns1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4104.5
ns4250
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4208
ns4167
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4291
ns4250
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4166
ns4250
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
232485
ns204574.5
ns1.14
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3333
ns3834
ns0.87
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3667
ns3667
ns1
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4084
ns4250
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4250
ns3625
ns1.17
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
176024.5
ns160115.5
ns1.10
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8291
ns8292
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8250
ns8166
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8250
ns8458
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8542
ns8333
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1051146
ns989699
ns1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204709
ns203375
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210709
ns212791
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210583
ns210666
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199833.5
ns200834
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34425
ns34428
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
647229
ns652624.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
649666.5
ns622667
ns1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
626208
ns631604.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
640479.5
ns632750
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
293508
ns280400.5
ns1.05
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
993750
ns994229.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1020395.5
ns1040292
ns0.98
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
958396
ns956020.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
887291
ns853917
ns1.04
batchedmm(128, Bsize=128)/forward/GPU/CUDA
206487.5
ns208023.5
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4504792
ns4502437.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4702583.5
ns4668229.5
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4449000
ns4455084
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
4321500
ns4280937
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
979904
ns935555
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3167
ns3292
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3541
ns3458
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4166
ns4042
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3333.5
ns3209
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
174711
ns159049
ns1.10
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7042
ns7291
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7042
ns7333
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7375
ns7334
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7083
ns6833
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
911927
ns850635.5
ns1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1650250
ns1640041
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1195333
ns1196604.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1375625
ns1383250
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2471000
ns2417500
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213276
ns215018
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12340062
ns12333396
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9568500
ns9592791.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9298896
ns9267625
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18088041
ns18011459
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1943838
ns1959459
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17384833.5
ns17332937.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14357854
ns14386792
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14387313
ns14369396.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21175104
ns21112291.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
100083
ns87708
ns1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
87750
ns88542
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
93416.5
ns92833
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
89625
ns116000
ns0.77
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125990
ns126352.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2026687.5
ns2022959
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2031083.5
ns2049666
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2031250
ns2035562.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2050458.5
ns2025938
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
951363
ns878938
ns1.08
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
2979
ns2750
ns1.08
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
2875
ns3209
ns0.90
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3520.5
ns3417
ns1.03
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
2521
ns2792
ns0.90
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16207
ns16283
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2666.5
ns2542
ns1.05
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2500
ns2708
ns0.92
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2875
ns2875
ns1
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2959
ns2834
ns1.04
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
179422.5
ns176848
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7083
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5958
ns6000
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6041
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10083
ns10042
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33838
ns34134
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225292
ns221583
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219750
ns220000
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220542
ns220417
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
244708
ns215333
ns1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
293649.5
ns285763.5
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3709
ns3750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3709
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22219
ns22875
ns0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14417
ns14500
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14375
ns14375
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14625
ns14458
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14583
ns14500
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
436265
ns410580
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
140000
ns92125
ns1.52
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
92458
ns92916
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
96792
ns96979
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
96792
ns138000
ns0.70
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125211.5
ns125660
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921583.5
ns1923792
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1923937.5
ns1935291
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1928188
ns1932916.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1942771
ns1920500
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
855373
ns861874.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
874041
ns873916
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
820458
ns826583
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1223417
ns1222000
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
972500
ns963750
ns1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA
272168
ns276546
ns0.98
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2804167
ns2791083
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2520875
ns2445687.5
ns1.03
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3337667
ns3347916
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3424895.5
ns3371375
ns1.02
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1501496.5
ns1487194.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16791.5
ns17250
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
14854.5
ns17959
ns0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18375
ns17875
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15229
ns17417
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
131230
ns130892
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
227959
ns218625
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
250729
ns260667
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216125
ns227792
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
262791
ns256083
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
582129.5
ns584591.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222062.5
ns222000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
219125
ns222667
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222041.5
ns222312.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221584
ns220833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
244344.5
ns243596.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
508270.5
ns501417
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
521083
ns496084
ns1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
498833
ns508541.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
565541.5
ns561833
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1195773
ns1202534
ns0.99
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
4479.5
ns3895.5
ns1.15
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
3583.5
ns4270.5
ns0.84
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
4750
ns5708
ns0.83
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
4625
ns4458.5
ns1.04
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16818
ns16584
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7208
ns7208.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7250
ns7000
ns1.04
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7333
ns7625
ns0.96
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7458.5
ns7500
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
180977.5
ns179332
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18583
ns17687
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17583.5
ns17917
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19958.5
ns18625
ns1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17333
ns18729
ns0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
132074.5
ns135434
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212166
ns211041
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212146
ns220417
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212917
ns212542
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218959
ns212271
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
814362
ns847267
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4042
ns3959
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4208
ns4209
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5000
ns4875
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4000
ns4291
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
175168.5
ns187480.5
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10250
ns10459
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9687.5
ns10541.5
ns0.92
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11083
ns10042
ns1.10
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10125
ns10125
ns1
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
961404
ns955985
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3041.5
ns3145.5
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3291
ns2937.5
ns1.12
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4375
ns4000
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3416.5
ns3167
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
193655
ns188520.5
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7208.5
ns7375
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7209
ns7209
ns1
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7542
ns7625
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7458
ns7333
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
972220
ns987324
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23356708
ns23406938
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34480833.5
ns35765125
ns0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37583875
ns37705500
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
35001895.5
ns34946604
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1828165
ns1830206.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184126958
ns183995333
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
166867125
ns165575375
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146311896
ns146468292
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
275288375
ns274483625
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16524063
ns16521685
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
276685520.5
ns276817937
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
252606729
ns246377395.5
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
231173396
ns231576042
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
324261749.5
ns325032833.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
184542
ns182896.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182833
ns184292
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185583
ns184958
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
184895.5
ns183167
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
166499.5
ns200810.5
ns0.83
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
634000
ns635333
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
585209
ns633354.5
ns0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
592708.5
ns600291
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630958
ns597271
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
926373.5
ns958799
ns0.97
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3858042
ns3842750
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3914708
ns3997500
ns0.98
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3549917
ns3542792
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4595104.5
ns4556625
ns1.01
batchedmm(128, Bsize=512)/forward/GPU/CUDA
532803
ns532425
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17337937.5
ns17396104
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17877583
ns18078958
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16422125
ns16589917
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
20130416.5
ns19981167
ns1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2619405
ns2633170
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32935
ns32094
ns1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8958
ns8917
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8875
ns8750
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9458
ns9041
ns1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9209
ns9042
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
248903
ns249030
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
649671041.5
ns652464437.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
390100166.5
ns394034604
ns0.99
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
355146542
ns326393417
ns1.09
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
750210500
ns748745833
ns1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12471745.5
ns12466975
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1883695042
ns1885107791.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1646365041
ns1638827875
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1513696187.5
ns1512914354
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2208789146
ns2208603583.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49495223
ns49231175.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1642208
ns1616792
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1192812.5
ns1200917
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1386104
ns1389625
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2519667
ns2477916.5
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215937.5
ns215338
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12672750
ns12691834
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9911875
ns9979354.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9658417
ns9689896
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18448708.5
ns18371271
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1992558.5
ns1985308
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17681874.5
ns17676916
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14694333
ns14722000
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14589750
ns14613667
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21582250
ns21413395.5
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26291
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26667
ns26250
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26291
ns26291
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23957
ns23721
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66959
ns67333
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67750
ns67333
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67250
ns67209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67459
ns67333
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
371563.5
ns367128.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203875
ns203542
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209500
ns208625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209125
ns209584
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200459
ns199792
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26219
ns25494
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
647500
ns604625
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
669416.5
ns670666.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
685542
ns632166.5
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
632166.5
ns630000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
324278
ns321975.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
675000
ns639021
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
541042
ns643458
ns0.84
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
637375
ns658750
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
666542
ns632750
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132249.5
ns131332
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2232250
ns2244229
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2239333.5
ns2277708.5
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2241084
ns2240167
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2299271.5
ns2235458.5
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1091764
ns1075922
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17833
ns17167
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17917
ns17916
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20584
ns18167
ns1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18709
ns18208
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
133803
ns130720.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
260333
ns258584
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
255395.5
ns227459
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
253687.5
ns232750
ns1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
230479
ns230791
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
901721
ns887768.5
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
541
ns625
ns0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
667
ns625
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
666
ns666
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23720
ns23104
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
8333.5
ns9750
ns0.85
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9666
ns9250
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10208
ns9208
ns1.11
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9750
ns9417
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
244421
ns242418
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5125
ns5208
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5750
ns5125
ns1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6584
ns6375
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5125
ns5375
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
195651
ns193804
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7083
ns7167
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7375
ns7250
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7750
ns7375
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7875
ns7042
ns1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
711373.5
ns706410
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2041
ns2125
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2250
ns2250
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2250
ns2209
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2208
ns2208
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18128
ns17672
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6542
ns6458
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6542
ns6291
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6625
ns6709
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6417
ns6500
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
296966
ns300575
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
751937.5
ns749459
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
746542
ns748959
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
750125
ns750854
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
751833.5
ns749167
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21365
ns20805
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
811458
ns775208
ns1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
810958
ns795916.5
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
790958
ns792791
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
813167
ns792792
ns1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
271261
ns274546.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7334
ns7208
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5958
ns5917
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5917
ns5959
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10250
ns10250
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33874
ns33244
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
258396
ns219625
ns1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
269104
ns240291
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
253416
ns237583
ns1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
245208
ns260042
ns0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
333723
ns337443
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10250
ns10084
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10334
ns9583
ns1.08
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10625
ns10750
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10250
ns10167
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
213790.5
ns223296.5
ns0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24583
ns25125
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24500
ns24312.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24792
ns24917
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24916
ns24667
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1032950.5
ns1047460.5
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
107140583
ns106018062.5
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
117792062
ns118144520.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120863042
ns120409292
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117603375
ns117468833
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2946778
ns2652084
ns1.11
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
393794791.5
ns373672500
ns1.05
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
359678396
ns359102771.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
357838334
ns356068521.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
545418083.5
ns543525042
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15489580
ns15230726
ns1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
607837250
ns605345333
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
579716416
ns584604208
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
747642396
ns744606604.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
607166334
ns793208583.5
ns0.77
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7292
ns6500
ns1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6958
ns6375
ns1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7625
ns8062
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6834
ns7146
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
206235.5
ns216878
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13709
ns13625
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14167
ns13625
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14500
ns14125
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14292
ns14084
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
968613
ns1010131
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5625
ns5625
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6250
ns6000
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6875
ns7895.5
ns0.87
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5750
ns5958
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
204166
ns211472.5
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12625
ns12583
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12583
ns12333
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13000
ns12708
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12292
ns12709
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
694587
ns725788
ns0.96
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5917
ns5583
ns1.06
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5458
ns5875
ns0.93
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
5875
ns6583.5
ns0.89
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5958
ns6167
ns0.97
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16951
ns17002
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15583
ns15916
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15375
ns15250
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15625
ns16125
ns0.97
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15708
ns15834
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
185517
ns187784.5
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns334
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
22862.5
ns23531
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6209
ns6167
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6208
ns6292
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6542
ns6459
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6375
ns6084
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
223995
ns228744
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5834
ns5834
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5917
ns5916
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6000
ns5959
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5833
ns5959
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
23989
ns24273
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
20833
ns20833
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
20583
ns20750
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21625
ns21292
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21375
ns21041
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
246983.5
ns251207.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
169125
ns185375
ns0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144292
ns144625
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
148291.5
ns147917
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
189062.5
ns144417
ns1.31
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166865
ns166909.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1326271
ns1321833
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1323042
ns1350479
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1320500
ns1337166
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1341500
ns1323625
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1189366
ns1251196
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23000
ns24833
ns0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23479
ns25041
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24875
ns23958
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24750
ns24271
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
254630.5
ns315591
ns0.81
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
130167
ns131292
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
128375
ns118396
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
123229
ns176916
ns0.70
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
131062.5
ns129458
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1279498
ns1353120
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns417
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23209
ns23127
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6042
ns6125
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6416
ns6459
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6792
ns6333
ns1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6458
ns6125
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
238830
ns245064.5
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4333
ns4208
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4542
ns4875
ns0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4708
ns5125
ns0.92
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4791
ns4667
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
217579.5
ns228957.5
ns0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9666
ns9875
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10042
ns9875
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns10334
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10208
ns10208
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1231902.5
ns1285818.5
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1584
ns1584
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1584
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22989
ns23344
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5667
ns5750
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5625
ns5709
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6041
ns6000
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5750
ns5666
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
258706.5
ns264086.5
ns0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6877625
ns6807541.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6431167
ns6433375
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6497166
ns6489875
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7600437.5
ns7649521
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213793
ns214938
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24074875
ns24073959
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21241875
ns21296000
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21023583.5
ns21044062.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29822125.5
ns29805771
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2088714.5
ns2104181
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37413209
ns37247625
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34256250
ns34089791
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45704562.5
ns45725979.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38148271
ns49397750
ns0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5416
ns5500
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6104.5
ns5708
ns1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6667
ns6541
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6167
ns5708
ns1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
206549
ns208256
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7917
ns8084
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8229.5
ns8125
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8584
ns8375
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8542
ns8375
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
962776
ns991485
ns0.97
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1560583
ns1509000
ns1.03
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1259145.5
ns1282542
ns0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1626291.5
ns1634916.5
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2161625
ns2162000.5
ns1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA
280818.5
ns271116.5
ns1.04
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7902229
ns7902209
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6567125
ns6449312.5
ns1.02
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7147750
ns7195708
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10485771
ns10462229
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1771472.5
ns1752716.5
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
373687.5
ns371187.5
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
370583
ns374208
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
462021
ns461250
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
23584
ns22208
ns1.06
batchedmm(128, Bsize=4)/forward/GPU/CUDA
45539
ns42428.5
ns1.07
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
728750
ns745437.5
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
804208.5
ns815833
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1065312.5
ns1062958
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
96666.5
ns117396
ns0.82
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
226465
ns283256.5
ns0.80
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397333
ns397208
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288042
ns288667
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288417
ns287875
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
751375
ns750917
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44356
ns43636
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
672167
ns667000
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
531292
ns531375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
528292
ns531417
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
975666
ns974083
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
193617.5
ns188745
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
669291
ns644833
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
642666
ns648750
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
644708.5
ns644479
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
687208
ns652458.5
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132960
ns131347.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2454209
ns2445334
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2456687
ns2500021
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2455291
ns2463250
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2470521
ns2463375
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1122477
ns1238313
ns0.91
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
3541
ns3417
ns1.04
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
3208
ns3625
ns0.88
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4458
ns4250
ns1.05
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
2958
ns3437.5
ns0.86
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16816
ns16066
ns1.05
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5292
ns5375
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5333
ns5292
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5625
ns5750
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5750
ns5583
ns1.03
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
187435
ns182995
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458000
ns1458042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1498250
ns1499750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1497083
ns1503250
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1439583
ns1437708
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40900
ns40191
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5127041
ns5113291
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5298083.5
ns5287958
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5287583
ns5307041.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5015875
ns4985125
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
198989
ns196599
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3709
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3709
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3709
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34297
ns33557
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15125
ns15125
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15083.5
ns15167
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15375
ns15416
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15166
ns15208
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
348507
ns349206
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71250
ns71125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71333
ns71542
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
70959
ns71209
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71209
ns71041
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113569.5
ns113114
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
317792
ns317667
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
319125
ns324125
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
319500
ns318292
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
319875
ns317625
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
197937.5
ns193277
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
959
ns958
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1000
ns1041
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1084
ns1083
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns1125
ns0.89
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23702
ns23048
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7750
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7750
ns8270.5
ns0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8334
ns8250
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7958
ns8041
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
249887
ns245757.5
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
504875
ns502770.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
484208
ns484500
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
564708
ns561750
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
236458
ns219917
ns1.08
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130159
ns129178
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1379479.5
ns1387645.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1446458.5
ns1473958
ns0.98
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1730646
ns1779041.5
ns0.97
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
884667
ns862917
ns1.03
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
273315.5
ns273950
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns333
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns334
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns416
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
334
ns333
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32089
ns31657.5
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6083
ns6125
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6000
ns6208
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6500
ns6541
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6083
ns6042
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
250296.5
ns251419
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1723562.5
ns1733792
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1725958.5
ns1721208
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1731208
ns1724250
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1767667
ns1773541
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168954.5
ns168671
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4352187.5
ns4114542
ns1.06
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4302209
ns4392834
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4360250
ns4368208.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4366750
ns4369208.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1065222
ns1291475.5
ns0.82
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6916
ns6834
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6750
ns6667
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
6875
ns7999.5
ns0.86
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6958
ns7041
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20747
ns20138.5
ns1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
67792
ns51250
ns1.32
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
48292
ns32625
ns1.48
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
32958
ns73833
ns0.45
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51583
ns51084
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
198224
ns340107
ns0.58
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
18375
ns17833
ns1.03
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
17625
ns18083
ns0.97
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
18542
ns18875
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
18291
ns18208
ns1.00
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18190
ns18400
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53292
ns53250
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53541
ns53041
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53500
ns53375
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53500
ns53542
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
306993
ns319083.5
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75375
ns75166
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75208
ns75625
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75000
ns75291.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75458
ns75083
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46432
ns47469
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
323792
ns324958
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
324916
ns342000
ns0.95
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
325000
ns325000
ns1
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
327375
ns324542
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
209114
ns211595
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1485167
ns1484959
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1524792
ns1526854.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1525000
ns1527250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1466042
ns1462542
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51777
ns51799
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5115209
ns5111083.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5290000
ns5312417
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5261979.5
ns5299333.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5012167
ns4982354
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
202581
ns204934
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28208
ns28208
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28208
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28208
ns28187.5
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28208
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24112
ns24742
ns0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66333
ns66500
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66375
ns66709
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66667
ns66500
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67041
ns66541
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
467729
ns484630.5
ns0.97
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1491583.5
ns1480583.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1128834
ns1136563
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1128084
ns1136750
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2260833.5
ns2265937.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
577757.5
ns579622.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3056208
ns3074562.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2732395.5
ns2788145.5
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2734709
ns2743021
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3843875
ns3819500.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1892225.5
ns1931643
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7896000
ns7902458
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
7928041.5
ns7834062.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7897562.5
ns7920375
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4840958
ns4826312.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
81709
ns77625
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81062.5
ns81167
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85084
ns84041.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
90541
ns111396
ns0.81
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194858.5
ns193746
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2012792
ns2012875
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2022916.5
ns2046292
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2012625
ns2031354
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2042500
ns2015417
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
690147
ns746361.5
ns0.92
This comment was automatically generated by workflow using github-action-benchmark.