Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompatHelper: bump compat for Flux in [weakdeps] to 0.15, (keep existing compat) #1124

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Dec 6, 2024

This pull request changes the compat entry for the Flux package from 0.14.25 to 0.14.25, 0.15.
This keeps the compat entries for earlier versions.

Note: I have not tested your package with this new compat entry.
It is your responsibility to make sure that your package tests pass before you merge this pull request.

@avik-pal avik-pal force-pushed the compathelper/new_version/2024-12-06-00-19-20-030-01136868030 branch from fa03127 to 325951b Compare December 6, 2024 00:19
Copy link
Contributor Author

github-actions bot commented Dec 6, 2024

Benchmark Results (ASV)

main 325951b... main/325951ba971b51...
basics/overhead 0.123 ± 0.0011 μs 0.124 ± 0.0042 μs 0.996
time_to_load 1.26 ± 0.0027 s 1.26 ± 0.0065 s 1

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link
Contributor Author

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 325951b Previous: ef0d450 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4187.5 ns 4208 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4750 ns 4834 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5417 ns 5375 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4458 ns 4083 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61500 ns 58557 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10500 ns 10625 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10542 ns 10542 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10541 ns 11375 ns 0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10292 ns 10083 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 434414 ns 415171 ns 1.05
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1083 ns 1334 ns 0.81
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1125 ns 1209 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1250 ns 1333.5 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1250 ns 1208 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18684 ns 17961 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3791 ns 4084 ns 0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3958 ns 3959 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4292 ns 4333 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4125 ns 4000 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 112466 ns 107003.5 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 71000 ns 70834 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 64000 ns 64375 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 63958 ns 64500 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 78459 ns 80375 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38059 ns 36906 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2053104.5 ns 2031562.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2089167 ns 2088542 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2094875 ns 2093958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1997229.5 ns 1926833 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198929 ns 192315 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 193958.5 ns 196625 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 182917 ns 195542 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 189667 ns 185209 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 183062.5 ns 182375 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166083 ns 166552 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1146437.5 ns 1111896 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1124750 ns 1118729.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1124917 ns 1119708 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1114208.5 ns 1130333.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 536148.5 ns 514050 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3604.5 ns 3500 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4042 ns 3416 ns 1.18
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4145.5 ns 4459 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3583 ns 3416.5 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 68781 ns 67303.5 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8875 ns 9084 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9458 ns 9750 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8875 ns 9625 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8875 ns 8625 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 500066 ns 472568 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15333.5 ns 15020.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15833 ns 14666 ns 1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17041.5 ns 18625 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15083.5 ns 14875 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56024 ns 53079 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216375 ns 224750 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215271 ns 215104.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214417 ns 215917 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214479 ns 215083 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 278999 ns 267364.5 ns 1.04
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 584 ns 750 ns 0.78
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 666 ns 709 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 750 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 812.5 ns 750 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17806 ns 17115 ns 1.04
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1542 ns 1500 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1792 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1750 ns 1500 ns 1.17
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 105021 ns 99326.5 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 9334 ns 7833 ns 1.19
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 8125 ns 7291 ns 1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7958 ns 7083 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns 9958 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24322 ns 23212 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221437.5 ns 233458.5 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 236500 ns 228125 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228459 ns 228666 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 228125 ns 214125 ns 1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 173657.5 ns 164950.5 ns 1.05
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3916 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3875 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23898.5 ns 23508 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16750 ns 16959 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16917 ns 17042 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17583 ns 17083 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16917 ns 16708 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 166893 ns 160457.5 ns 1.04
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 610520.5 ns 611125 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 601167 ns 609042 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 604333 ns 606834 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 605666 ns 605520.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113814 ns 113172 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1429334 ns 1423834 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1422125 ns 1422458 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1427166 ns 1424292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1422708 ns 1420334 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 217800 ns 209423.5 ns 1.04
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1075458 ns 1082229.5 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 969416.5 ns 970792 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1352521 ns 1346208 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1319166.5 ns 1300333 ns 1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA 282523.5 ns 270348.5 ns 1.05
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6011438 ns 5996021 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4574438 ns 4506125 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4957125 ns 4914416 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5725208.5 ns 5507375 ns 1.04
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1113189 ns 1074060 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 24154 ns 23487 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2209 ns 2167 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2125 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2291 ns 2167 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 180828 ns 168855 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4500 ns 4167 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4666 ns 4334 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5000 ns 5041 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4229.5 ns 3667 ns 1.15
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 67149 ns 64100 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11625 ns 11291 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11666 ns 11875 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11875 ns 12291 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10916.5 ns 11000 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 464242.5 ns 442842 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6625 ns 6042 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6583 ns 6104.5 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7166 ns 7209 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5917 ns 5708 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 53689.5 ns 51573 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17791 ns 17041.5 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18270.5 ns 17292 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18937.5 ns 17625 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16875 ns 17250 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 312733.5 ns 299598.5 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33904.5 ns 32513 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8917 ns 8458 ns 1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8958 ns 9000 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9458 ns 9084 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8417 ns 8458 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 161406.5 ns 155298 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 96583 ns 96666 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 96041 ns 96708 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 96500 ns 96292 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 91292 ns 96375 ns 0.95
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112280 ns 111447.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 285125 ns 278125 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 275916 ns 275250 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 279000 ns 274583.5 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 275041 ns 277584 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 192737.5 ns 190076 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3398084 ns 3409792 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3118416 ns 3047666 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3027854 ns 3023958 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4090167 ns 3959958 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 577236 ns 579376.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7633396 ns 7632583 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7448020.5 ns 7497667 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7461875 ns 7451520.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8159958 ns 8199583 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1336808 ns 1349456 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17544583 ns 17500916.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17552542 ns 17545437.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17549333 ns 17599584 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14130000 ns 14108083 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24110562.5 ns 23772875 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33971375 ns 34134729 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37615667 ns 37435375 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34505479 ns 34708708 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1844860 ns 1860458 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 316909854.5 ns 316659729.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 235311521 ns 235623563 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 196995687.5 ns 195619437 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 279801417 ns 279867979.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13918562 ns 13932935 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 274414625 ns 273833833 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 265951833 ns 267231583 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 256018875 ns 255610333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 330216209 ns 329098667 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22125 ns 21375 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22646 ns 22125 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23542 ns 25292 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21959 ns 21125 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95421 ns 94977 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103875 ns 103542 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 105354.5 ns 103791 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 103854 ns 105125 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 115292 ns 103250 ns 1.12
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 511347 ns 500332.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6333 ns 5875 ns 1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6125 ns 6417 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6333 ns 6750 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6084 ns 6000 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 69796 ns 68160.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14709 ns 14500 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16042 ns 15000 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15792 ns 16500 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14750 ns 14584 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 489291 ns 477825.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3123667 ns 3101458 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2118374.5 ns 2118542 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2311791 ns 2321249.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 5004083 ns 4650021 ns 1.08
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 589263 ns 585427 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23622375 ns 23564209 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18685708 ns 18768041 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17945875 ns 17974229 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35440521 ns 35659708 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2898295.5 ns 2760352.5 ns 1.05
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33848562.5 ns 34076750.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27636625.5 ns 27653896 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28626542 ns 28752229 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 40805979.5 ns 40853625 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 71979.5 ns 74667 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 72708 ns 71833.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 73833 ns 73521 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72604.5 ns 71770.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 105033.5 ns 100115 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 210291.5 ns 292083 ns 0.72
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 318083 ns 224167 ns 1.42
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 316833 ns 297708 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 275083 ns 205792 ns 1.34
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 559668 ns 537710 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11833 ns 11750 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11750 ns 11416 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12583.5 ns 12542 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12000 ns 12270.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 73102 ns 71148.5 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26292 ns 26208 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27417 ns 26875 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27834 ns 27625 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26834 ns 26500 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 487396 ns 468928 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12417 ns 12250 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12583 ns 12166 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13459 ns 13500 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12250 ns 12042 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 54541.5 ns 52398 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26541 ns 25250 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26083 ns 26125 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26709 ns 26042 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26000 ns 26000 ns 1
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 312239.5 ns 301242 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179625 ns 179104.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180084 ns 179750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 182020.5 ns 180583 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 180458.5 ns 178625 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 59147.5 ns 55842.5 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 593312 ns 582584 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 583041 ns 591917 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 603459 ns 594313 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582292 ns 583166 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 287774 ns 280084 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6042 ns 5958 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6333 ns 6000 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7312.5 ns 6500 ns 1.13
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6187.5 ns 5625 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70958 ns 70229 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14312.5 ns 13875 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15000 ns 14542 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15458 ns 15187.5 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14375 ns 14458 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 467345.5 ns 456073.5 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1257541.5 ns 1235292 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1308291 ns 1304042 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1375979.5 ns 1374021 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1091750 ns 1092083 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302420 ns 302409 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4231312 ns 4120521 ns 1.03
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4443833 ns 4446875 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4589250 ns 4623750 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3708520.5 ns 3716729.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1044852 ns 1039016 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1875 ns 1792 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1917 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 24020 ns 23753 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4917 ns 4833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4958 ns 4917 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5042 ns 4875 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4916 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 193691 ns 186693 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6166 ns 5959 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6042 ns 6000 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6875 ns 7083 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5645.5 ns 5667 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 56339 ns 54622.5 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11334 ns 11167 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11625 ns 11541 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11834 ns 11250 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10500 ns 10542 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 339285.5 ns 325703 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 334 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 375 ns 334 ns 1.12
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 375 ns 292 ns 1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 334 ns 333 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23097 ns 22898 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2792 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3042 ns 3041 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2834 ns 3041 ns 0.93
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2750 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 161548 ns 157339 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11375 ns 11625 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12125 ns 12083 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12750 ns 12417 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11417 ns 11229.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 58287.5 ns 55735 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24750 ns 24959 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25166 ns 25042 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25083 ns 25042 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25041 ns 25042 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 302525 ns 288122.5 ns 1.05
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4209 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4208 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4209 ns 4208 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24886 ns 24760 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16208 ns 16333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16167 ns 16333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16708 ns 16500 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16250 ns 16459 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 200919.5 ns 193221.5 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5792 ns 5791 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5791 ns 5792 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5750 ns 5791 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 35005 ns 33178 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20750 ns 20750 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21083 ns 20708 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21500 ns 20916 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20709 ns 20708 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 178740 ns 172900.5 ns 1.03
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 416958 ns 420188 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 386792 ns 386937.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 484624.5 ns 482833 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 138875 ns 106250 ns 1.31
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67016 ns 67134 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 938167 ns 865417 ns 1.08
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 980375 ns 948604 ns 1.03
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1193208 ns 1189500 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 457062.5 ns 411770.5 ns 1.11
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 190892.5 ns 190610 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 140688 ns 136750 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 139083 ns 133396 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 144500 ns 133166.5 ns 1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 78146 ns 138854 ns 0.56
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193931.5 ns 192824 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1928521 ns 1917250 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1927791.5 ns 1912124.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1927541 ns 1920250 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1916479.5 ns 1942521 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 406874.5 ns 395139 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22097 ns 22003 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 176834 ns 168855 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6625 ns 6812.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7083.5 ns 6750 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7917 ns 8187.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6833 ns 6334 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 63646 ns 59378.5 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9459 ns 9312.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9458 ns 9209 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9500 ns 9333 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9083 ns 9083 ns 1
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 320974.5 ns 305200.5 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 113058083 ns 112669000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173084604 ns 174180000 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 143396333 ns 143189875 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 111140500 ns 112387917 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5478656 ns 5463061 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 623531250 ns 616937396 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 557544917 ns 558474917 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 453377687.5 ns 448891770.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 625106375 ns 624388062.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34926627 ns 38238112 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 670253666 ns 665577792 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 665374020.5 ns 667381166.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 602557458.5 ns 616459979 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 744888292 ns 747251209 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 62792 ns 62750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 53333 ns 53834 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 53167 ns 53458 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81875 ns 82125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38096 ns 37037 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1931958 ns 1926667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1976500 ns 1974291 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1982812.5 ns 1980021 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1883666 ns 1901875 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 176530.5 ns 171617 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 268354 ns 265333 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 266084 ns 269750 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 266770.5 ns 269083.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267083 ns 264854.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 138471 ns 124229 ns 1.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 684875 ns 687584 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 704895.5 ns 678833 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 680063 ns 680125 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 693625 ns 635854 ns 1.09
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 733589 ns 697446 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2236667 ns 2242458 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2232792 ns 2097875 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2245895.5 ns 2254458 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2260750 ns 2199750.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133706 ns 132519 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5517709 ns 5507312 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5515458 ns 5516959 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5507792 ns 5495292 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5478667 ns 5486271 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 757656.5 ns 737355 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 682208 ns 678417 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 665834 ns 671291 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 667625 ns 668458 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 676792 ns 682958 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47666 ns 46914 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1824500 ns 1824791.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1715667 ns 1728375 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1715792 ns 1718604.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2083667 ns 2080500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 226213.5 ns 221890.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 72292 ns 70750 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 52625 ns 53125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 52875 ns 52916 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82333 ns 82375 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 29129 ns 28168 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2054083 ns 2031792 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2092958 ns 2096833.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2095458 ns 2088000 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2000313 ns 2001083.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191172 ns 187289.5 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13413292 ns 13449750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12504667 ns 12528021.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12610375 ns 12554687.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15299333 ns 15230083 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 514224 ns 513617 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 46987896 ns 46862979 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41518750 ns 41543521 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40802062.5 ns 40829437.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 57892000 ns 58532271 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3029248.5 ns 2896866 ns 1.05
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 98175875 ns 74392375 ns 1.32
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 90780729.5 ns 90893292 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 92033792 ns 92732000 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 75986750 ns 76658749.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 73292 ns 70625 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 61667 ns 64875 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 53292 ns 64625 ns 0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82208.5 ns 81917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47360.5 ns 47851 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1930000 ns 1923187.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1976916.5 ns 1983437.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1977875 ns 1973333 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1892000 ns 1883833 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190114 ns 193982.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 416 ns 292 ns 1.42
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32602 ns 32956 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6125 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6708 ns 6416 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6375 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6084 ns 5875 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 175347.5 ns 176118.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 291 ns 1.14
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32038 ns 32831 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2792 ns 2667 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2916 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2667 ns 2625 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 165406 ns 165694 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 277154687.5 ns 278326104 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 340341375 ns 340448937.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 310234625 ns 308909437.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 279111562.5 ns 278977666.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7096722 ns 7109405 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 999677125 ns 997951584 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 940767958 ns 940941292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 833671583 ns 832217625 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1012000792 ns 1009333917 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34086228 ns 33893371 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1815872125 ns 1394325042 ns 1.30
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1685763250 ns 1705224209 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1637202792 ns 1693911291 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1305223708 ns 1308776729 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1463833 ns 1456667 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1432479 ns 1462958 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1452084 ns 1454521 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1468708.5 ns 1451416.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127733 ns 127922 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5038771 ns 5012417 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5036500 ns 5028750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5032417 ns 5027959 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5015521 ns 5027187.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 597041 ns 506424 ns 1.18
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 158379834 ns 157716375 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 128622999.5 ns 136859042 ns 0.94
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 164768458 ns 164218250 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 151616791 ns 151479417 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4983504 ns 4879107 ns 1.02
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 627145292 ns 634203459 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 492688333 ns 607766083 ns 0.81
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 467716375 ns 456653750 ns 1.02
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 650997542 ns 653815125 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16031412 ns 17510307 ns 0.92
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8989292 ns 8926646 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 9014166 ns 9038916.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7920250 ns 7947771 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10062500 ns 10104354 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1605825.5 ns 1594648 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36825604 ns 36795042 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37713917 ns 38004792 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 34552167 ns 34295916.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37842541.5 ns 37862042 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6453497 ns 6452447 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47250 ns 47334 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47583 ns 47417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47709 ns 47625 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47250 ns 47042 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 17936 ns 18361 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50250 ns 50042 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50417 ns 50292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50625 ns 50542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50333 ns 50292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 211767 ns 194710.5 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7041.5 ns 6750 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6834 ns 6875 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7791 ns 7709 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6791.5 ns 6541 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 101571.5 ns 94841 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns 9542 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10375 ns 10209 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns 10292 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10042 ns 9958 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 554325.5 ns 543786 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 5917 ns 1
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6083 ns 6292 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7437.5 ns 6750 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5666 ns 5666 ns 1
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 110518.5 ns 105080 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13209 ns 12583 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13292 ns 13750 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13709 ns 13375 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13041 ns 13375 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 509430.5 ns 521491.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1083 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1084 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 33078 ns 33226 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8250 ns 8125 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8333 ns 8500 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8333 ns 7875 ns 1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8145.5 ns 8041 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 210419.5 ns 215927 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23291 ns 23125 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23042 ns 23209 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23437.5 ns 23250 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23333 ns 23250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18552 ns 18682 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52250 ns 52250 ns 1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52750 ns 53125 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52791.5 ns 52833 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52250 ns 52250 ns 1
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 290907 ns 310779 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1451833 ns 1455520.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1403125 ns 1461770.5 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1459625 ns 1464563 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1445291 ns 1420375.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196080 ns 196494.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5025083.5 ns 5004917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5017937.5 ns 4928042 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4999374.5 ns 5012292 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4864042 ns 5010708.5 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 613757 ns 619791 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3125458.5 ns 3153125 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2143333 ns 2140000 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2332708.5 ns 2307083.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4916959 ns 4612500 ns 1.07
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 580570.5 ns 580901 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24461333.5 ns 24408833 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19674833 ns 19732667 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18932416.5 ns 19045729.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36440292 ns 36515125 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3020876 ns 2842137 ns 1.06
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34147666 ns 34057083.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28226125 ns 28326333 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28040083 ns 28024667 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42650334 ns 42838792 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 141275708 ns 140571271 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 143424625 ns 143484104 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 121020437.5 ns 120774500 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 186571750 ns 187527416 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22544817 ns 22777810 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2534668458.5 ns 1387998541 ns 1.83
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1999737917 ns 2164279542 ns 0.92
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 2143878500 ns 1082658958.5 ns 1.98
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 823358604 ns 828842208.5 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 116555809 ns 118414466 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 71625 ns 79708.5 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73166.5 ns 72542 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75458 ns 75520.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 83709 ns 73458 ns 1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 302677.5 ns 238954.5 ns 1.27
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 287916 ns 286459 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 191896 ns 295292 ns 0.65
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 193875 ns 302292 ns 0.64
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 282458.5 ns 240521 ns 1.17
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1280743 ns 1217040 ns 1.05
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35380770.5 ns 35202521 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35682750 ns 35899625 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 31317625 ns 31197042 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 39667312.5 ns 39929583.5 ns 0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5842485 ns 5845222 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148257437.5 ns 147855667 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 151961833.5 ns 153555375 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 136251833.5 ns 134579979 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 150216771.5 ns 150196958.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34881249 ns 34892998 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 112925875 ns 114292542 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173003479.5 ns 173321542 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 143333750 ns 143543334 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 93518125 ns 93943084 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5460899.5 ns 5434556 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 471090667 ns 473131708 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 516727250 ns 515810125.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 443934041.5 ns 442518292 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 612554750.5 ns 614699291.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32275777 ns 35179278 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 805778542 ns 804964083 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 655573354 ns 656838729.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 583320500 ns 594341604 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 734069042 ns 735687542 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1368041 ns 1353083 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 919584 ns 1020917 ns 0.90
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 919958 ns 995292 ns 0.92
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2054875 ns 2104875 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 566946 ns 569348 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 3006395.5 ns 2979875 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2633438 ns 2615833 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2441625 ns 2614124.5 ns 0.93
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3438792 ns 3699541.5 ns 0.93
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1769601 ns 1670621 ns 1.06
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5817666.5 ns 5794812.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5798541.5 ns 5833354.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5755562.5 ns 5800917 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2898500 ns 2911437.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8042 ns 7875 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 7083 ns 7000 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 7083 ns 7000 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10500 ns 10583 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24997 ns 24801 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225729 ns 222541.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220854 ns 221250 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223458 ns 220833.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 208541.5 ns 217041.5 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 255805 ns 245776 ns 1.04
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 454522209 ns 451162917 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 206014875 ns 205123625.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 172978167 ns 178414666.5 ns 0.97
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 453775750 ns 454897875 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7670617.5 ns 7671486 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1106555666.5 ns 1093247396 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 920762020.5 ns 925248250 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 917863666 ns 837547083 ns 1.10
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1163983167 ns 1163363584 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26581282 ns 26761104.5 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5416 ns 5500 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5916 ns 5458 ns 1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6812 ns 6875 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 5291.5 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 152662.5 ns 149694 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 6833.5 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7750 ns 7395.5 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7792 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 6875 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 605121.5 ns 579102 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 541 ns 583 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24120 ns 23601 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9292 ns 9166 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9500 ns 9042 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9541 ns 9250 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8875 ns 10166.5 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 215400.5 ns 199458 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 371667 ns 354500 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 375417 ns 352375 ns 1.07
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 351583 ns 355687.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 352146 ns 357479.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21526 ns 21220 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 803229 ns 824396 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 803291 ns 778375 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 782000 ns 777666 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 823583 ns 821813 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 255289 ns 231309.5 ns 1.10
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 334750 ns 331125 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 337375 ns 344833 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 452125 ns 453000 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 11145.5 ns 10292 ns 1.08
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17940 ns 18084 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 715917 ns 709750 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 725833.5 ns 741354 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1008166.5 ns 1003291.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26375 ns 26479 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 229391 ns 223194.5 ns 1.03
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 372084 ns 370292 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 347875.5 ns 353396 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 437562.5 ns 439292 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 43937.5 ns 29916.5 ns 1.47
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22253 ns 22856 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 735458.5 ns 727458 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 779000 ns 790208 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1027500 ns 1034916 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 97750.5 ns 90395.5 ns 1.08
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 203262.5 ns 197661 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3542 ns 3417 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3750 ns 3625 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3917 ns 3750 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3500 ns 3417 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17828 ns 17539 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4333 ns 4208 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4458 ns 4375 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4334 ns 4250 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4166 ns 4125 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 250213.5 ns 213017 ns 1.17
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3792 ns 3729 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3959 ns 4083 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4167 ns 4958 ns 0.84
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 3417 ns 1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 189460 ns 159837 ns 1.19
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8375 ns 8167 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8958.5 ns 8583 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8583 ns 8667 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8458 ns 8375 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1128440.5 ns 1042725 ns 1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 206000 ns 205667 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 212583 ns 213208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211625 ns 213500 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 201167 ns 200458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34618 ns 34523 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 644375 ns 645542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 634375 ns 671042 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 623645.5 ns 621458.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 634438 ns 580854.5 ns 1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 324576 ns 298737.5 ns 1.09
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1261583 ns 1234437.5 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1252458 ns 1277666 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1193020.5 ns 1190750 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1154583 ns 1152750 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207043 ns 206763.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4562458 ns 4518542 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4633334 ns 4787042 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4473209 ns 4473666.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4541812.5 ns 5146541 ns 0.88
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 928231 ns 931436.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 3667 ns 0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4208 ns 3667 ns 1.15
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 3895.5 ns 4041 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3437.5 ns 2959 ns 1.16
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 198397 ns 185683 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7250 ns 7167 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7333 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7125 ns 7667 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7125 ns 6833 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 954038 ns 942579 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1639667 ns 1642000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1208084 ns 1207250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1372541 ns 1390000 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2354041.5 ns 2427938 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212523.5 ns 212907.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12357250 ns 12368250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9563604 ns 9590500 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9298917 ns 9295438 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18074854 ns 18019000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1941189 ns 1954764 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17373646 ns 17359458 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14399791 ns 14385104 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14361833 ns 14370541 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21107083.5 ns 21035500 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 137812.5 ns 134083.5 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 85520.5 ns 139416.5 ns 0.61
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 90834 ns 134958 ns 0.67
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85708 ns 131334 ns 0.65
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126331.5 ns 125600 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2035479 ns 2022916.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1955875 ns 2047021 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2015750 ns 2034334 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2032000 ns 2039125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 946151.5 ns 948556 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 1125 ns 1458 ns 0.77
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 1792 ns 1792 ns 1
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3416.5 ns 3520.5 ns 0.97
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1625 ns 1229.5 ns 1.32
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15769 ns 16310 ns 0.97
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2750 ns 2542 ns 1.08
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2791 ns 2792 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2834 ns 2875 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2792 ns 2834 ns 0.99
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 179385 ns 182763.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8042 ns 7958 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6875 ns 6875 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6917 ns 6875 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10500 ns 10583 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33770 ns 33908 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 249667 ns 225041 ns 1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229500 ns 221625 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220625 ns 220833 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206542 ns 215291 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 298866.5 ns 320916 ns 0.93
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22007 ns 22605 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14417 ns 14500 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14583 ns 14625 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14417 ns 14500 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14417 ns 14500 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 433380 ns 456450 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 143979.5 ns 142749.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 135729.5 ns 91312 ns 1.49
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 94667 ns 142292 ns 0.67
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 134792 ns 138792 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125748 ns 125035 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1934750 ns 1919500 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1925812.5 ns 1942104 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1653333 ns 1929000 ns 0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1911042 ns 1927250 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 914085 ns 877064 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 878291 ns 877458.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 825334 ns 825458.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1230104 ns 1230104 ns 1
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 951084 ns 955479 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 268596 ns 269410 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2838667 ns 2816333 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2541125.5 ns 2528771 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3337917 ns 3342458 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3402687.5 ns 3349729.5 ns 1.02
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1617436.5 ns 1555391.5 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14958 ns 14833 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15500 ns 14875 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16958 ns 18500 ns 0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15500 ns 16875 ns 0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 151571 ns 131035 ns 1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 263125 ns 227209 ns 1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 256666 ns 215791 ns 1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223750 ns 216958 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255125 ns 225250 ns 1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 665282.5 ns 594103.5 ns 1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 218833.5 ns 221333 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221375 ns 222875 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 221166 ns 222583 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 223792 ns 219042 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 309310.5 ns 242007 ns 1.28
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 509062.5 ns 548917 ns 0.93
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 501208 ns 511041.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 505458 ns 509917 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 504145.5 ns 508458 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1356900.5 ns 1234181 ns 1.10
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 3875 ns 4083 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4083.5 ns 4041 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 4188 ns 4417 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 4375 ns 3666.5 ns 1.19
batchedmm(16, Bsize=4)/forward/GPU/CUDA 17142 ns 17140 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7500 ns 7209 ns 1.04
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7292 ns 7459 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7291 ns 7333.5 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7416.5 ns 7417 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 185412 ns 183429.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17708 ns 18833 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17167 ns 16666 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18416 ns 21083 ns 0.87
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15958 ns 18396 ns 0.87
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 186970.5 ns 131942 ns 1.42
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218833 ns 245395.5 ns 0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 247292 ns 212292 ns 1.16
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213625 ns 214833 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 211500 ns 213708 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 962747 ns 833743 ns 1.15
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4917 ns 4208 ns 1.17
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5166 ns 4833 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4833 ns 4916.5 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4166.5 ns 3854.5 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 195909 ns 208168.5 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10416 ns 10333 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10875 ns 10459 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10625 ns 11084 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10375 ns 10145.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1010229 ns 994315 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3875 ns 3458 ns 1.12
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3917 ns 3791 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4458 ns 4042 ns 1.10
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3333 ns 3167 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 219391 ns 209797 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7917 ns 7416 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7833 ns 7459 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7834 ns 8083.5 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7083 ns 7459 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1030231.5 ns 997101.5 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23757250 ns 23443625 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35471625 ns 34805208 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 38290895.5 ns 37298500 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34483875 ns 34536209 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1833809 ns 1851929 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 185699542 ns 185954395.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 160028771 ns 159888645.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146263209 ns 144873209 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 437740041 ns 438754792 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16480986 ns 16496173 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 271222229 ns 269927937.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 260376333 ns 259799312.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 300106000 ns 298856875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 486503812 ns 487045354.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182791 ns 189541.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 190667 ns 182167 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184333 ns 183416.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181792 ns 182375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 208501 ns 187318 ns 1.11
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 637249.5 ns 636187.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 634645.5 ns 597458.5 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 592520.5 ns 588459 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 632229 ns 596146 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1057588 ns 944443 ns 1.12
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3863020.5 ns 3952375 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3940187.5 ns 4007646 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3577667 ns 3594292 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4857375 ns 4885708 ns 0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA 533313.5 ns 552348.5 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 18144146 ns 18061833 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 18426875 ns 18498208.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16971458 ns 17053770.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 19835208 ns 19733813 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2625144.5 ns 2636788.5 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 541 ns 500 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 33403 ns 32315 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9375 ns 9145.5 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9417 ns 9625 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9500 ns 9291 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9042 ns 8792 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 256811 ns 247143.5 ns 1.04
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 502531666.5 ns 497882542 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 468706708 ns 466893292 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 359169771 ns 356555750 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 597742458 ns 601192353.5 ns 0.99
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12474979 ns 12465773.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1889533104 ns 1887759917 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1628715000 ns 1627534167 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1500257208.5 ns 1505961604 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2135455146 ns 2123318791.5 ns 1.01
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49477206.5 ns 49303078 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1655875 ns 1652917 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1207208 ns 1209833 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1417834 ns 1397667 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2430000 ns 2460062.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215864 ns 214417 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12748479.5 ns 12745021 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9953749.5 ns 9950208 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9718749.5 ns 9693541 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18387500 ns 18371500 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2012401 ns 2028129 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17706542 ns 17681833 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14704250 ns 14711375 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14671854 ns 14648250 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21442208 ns 21429709 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26209 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26250 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26208 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26167 ns 26166 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24383 ns 23744 ns 1.03
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67125 ns 67208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67458 ns 67208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66875 ns 67166 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67000 ns 66916 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 387270.5 ns 365755.5 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 206833 ns 206375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 212000 ns 212666 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211917 ns 211542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200041 ns 200291 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26798 ns 25711 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 657750 ns 655729 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 667438 ns 632000 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 670645.5 ns 673667 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 587959 ns 630708 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 332552 ns 322192 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 693500 ns 683459 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 541604.5 ns 682708 ns 0.79
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 688312.5 ns 691916.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 688417 ns 680834 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131360 ns 130902.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2267792 ns 2242354.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2254250 ns 2244709 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2249042 ns 2244875.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2240291 ns 2229125 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1150961 ns 1093705 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19875 ns 20396 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16875 ns 16833 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18042 ns 23020.5 ns 0.78
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16959 ns 19166 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 133829.5 ns 131648.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 270250 ns 265541.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227875 ns 232167 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 260125 ns 264625 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218417 ns 259979 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 989553.5 ns 939947 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 541 ns 1.16
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23299 ns 23249 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9542 ns 9583.5 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9750 ns 9708 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10042 ns 10041 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9583 ns 9541 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 245463 ns 242690 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5417 ns 5542 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6042 ns 5709 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7125 ns 6667 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 5250 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 207180 ns 206130.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7125 ns 6709 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7417 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7875 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7333 ns 6708 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 736539 ns 735324.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2250 ns 2000 ns 1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2167 ns 2229.5 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2250 ns 2125 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2292 ns 2292 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18095 ns 17909 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6750 ns 6375 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6666 ns 6792 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6875 ns 6875 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6334 ns 6208 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 305699.5 ns 303359 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 776770.5 ns 751688 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 749458.5 ns 779292 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 780583 ns 779395.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 778083.5 ns 776146 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21274 ns 20845 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 801416 ns 796792 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 810687.5 ns 791166 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 813917 ns 808708 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 810584 ns 775292 ns 1.05
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 270452 ns 267264 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 8042 ns 8000 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 7000 ns 6687.5 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6917 ns 6958 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10625 ns 10458 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32850 ns 32932 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 266167 ns 261062.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 266062.5 ns 237583 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 269771 ns 271396 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255895.5 ns 252646 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 335080 ns 331767 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10334 ns 10250 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10334 ns 10542 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10708 ns 11208 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10167 ns 10250 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 224193 ns 218675.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24666 ns 25000 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24708 ns 24625 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 26208 ns 25583 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25000 ns 24416 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1065472 ns 1056250 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106377708 ns 106355042 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117402750 ns 117397229.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120815791 ns 120585312.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117612208.5 ns 117183084 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2661070 ns 2657952 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 375259750 ns 374187771 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 347518125 ns 350821292 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 360594625 ns 361003333 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 481637083 ns 479876375 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15260454.5 ns 15234863.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 791338708.5 ns 604863708 ns 1.31
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 771555583 ns 773786667 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 811283708 ns 812604291 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 768279167 ns 770323375 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6750 ns 6833 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7229.5 ns 7084 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8125 ns 8062.5 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6875 ns 6250 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 213488 ns 213616 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13917 ns 13458 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14125 ns 13875 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 13959 ns 14416 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13375 ns 13625 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1023254 ns 1017707 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6000 ns 6208 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6458 ns 6042 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7167 ns 7145.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5666 ns 5417 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 210071 ns 208255 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12541 ns 11958 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12792 ns 12729.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12916 ns 13250 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12000 ns 12500 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 725821.5 ns 723959 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5500 ns 6209 ns 0.89
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5834 ns 6375 ns 0.92
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6250 ns 6375 ns 0.98
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 6000 ns 5500 ns 1.09
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16767 ns 16943 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15666 ns 15250 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15458 ns 15625 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15584 ns 15625 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15541 ns 15500 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 185594.5 ns 186257 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23662 ns 23245 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6500 ns 6375 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6667 ns 6375 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6625 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6125 ns 6187.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 227826.5 ns 225046 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 5750 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5834 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5875 ns 5833 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5792 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24912 ns 24205 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21084 ns 20875 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21334 ns 21417 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21750 ns 21541.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21250 ns 21229.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 249065 ns 246651 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 149583 ns 194166.5 ns 0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 193250 ns 200521 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 189875 ns 190666.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 183271 ns 185562 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167107 ns 166320.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1347374.5 ns 1329104.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1326834 ns 1324792 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1334167 ns 1328041 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1143875 ns 1337729.5 ns 0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1251168.5 ns 1221500 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21812.5 ns 24687.5 ns 0.88
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 21834 ns 22000 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25500 ns 25667 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21917 ns 21250 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 319912.5 ns 254624.5 ns 1.26
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 145292 ns 130791 ns 1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 178958 ns 132062.5 ns 1.36
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 183625 ns 179458 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 178729 ns 179520.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1370058 ns 1317432 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23172 ns 22902 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6459 ns 6208 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6667 ns 6709 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833 ns 6917 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6292 ns 6291 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 243330.5 ns 240780 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4542 ns 4875 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4416 ns 4542 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5500 ns 5500 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4542 ns 4417 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 230541.5 ns 229531.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns 10083 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10250 ns 10375 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10416 ns 10583 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 10416 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1274677 ns 1276460 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1667 ns 1583 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1584 ns 1584 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22978 ns 22954 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5625 ns 5792 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5917 ns 5958 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5875 ns 5875 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5708 ns 5584 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 262617.5 ns 258626 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6869583 ns 6841563 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6385187.5 ns 6377645.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6569104 ns 6542167 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7620979 ns 7612146 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212993 ns 213873 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24088541 ns 24061541 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21276291 ns 21280959 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21151250 ns 21049937 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29764104.5 ns 29725708.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2094452 ns 2091556 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48969166 ns 37658500 ns 1.30
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45455437 ns 45669958 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45774146 ns 45878312.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 37954083.5 ns 38309416.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6250 ns 5917 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6125 ns 6042 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7125 ns 6958.5 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5666 ns 5542 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 204070.5 ns 210091 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8250 ns 8041 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 8250 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8459 ns 8500 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7708 ns 8250 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 984301.5 ns 992082 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1573917 ns 1552375 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1268875 ns 1278292 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1643334 ns 1634959 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2167478.5 ns 2176750 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 271090 ns 269882.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7942625 ns 7890000 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6611271 ns 6564479 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7244083.5 ns 7223979 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10441166.5 ns 10470041 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1758088 ns 1748953.5 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 371479.5 ns 375500 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 374416.5 ns 379708 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 454791 ns 454583 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 30291 ns 34834 ns 0.87
batchedmm(128, Bsize=4)/forward/GPU/CUDA 44073.5 ns 46336 ns 0.95
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 745166 ns 739834 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 811042 ns 821979 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1067042 ns 1062042 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 77167 ns 119270.5 ns 0.65
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 274366 ns 274066 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 414167 ns 412125 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 305917 ns 305917 ns 1
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 306000 ns 305916 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 758584 ns 757958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43580 ns 44006 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 659625 ns 658583 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 524708 ns 525792 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 524583 ns 523167 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973250 ns 973083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 189177.5 ns 189089 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 692125 ns 672875 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 672375 ns 676521 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 676292 ns 644292 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 674917 ns 672333 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131547 ns 131017.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2463791.5 ns 2466812.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2466917 ns 2456312.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2455459 ns 2425417 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2437166 ns 2465333 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1168757 ns 1103271 ns 1.06
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 2125 ns 2333 ns 0.91
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 4958.5 ns 2875 ns 1.72
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4458 ns 4500 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 3125 ns 3167 ns 0.99
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15381 ns 16213 ns 0.95
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5417 ns 5208 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5625 ns 5625 ns 1
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5667 ns 5667 ns 1
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5583 ns 5459 ns 1.02
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 184580 ns 184737.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1480542 ns 1481125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1517458 ns 1519875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1517583 ns 1522875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1449000 ns 1453417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40149 ns 40096 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5141521 ns 5124333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5293812.5 ns 5295937.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5299875 ns 5290354 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5011041 ns 4993187.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195851 ns 194429.5 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3666 ns 3666 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3666 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3666 ns 3625 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3666 ns 3667 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33274 ns 33150 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15333 ns 15208 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15500 ns 15375 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15458 ns 15416 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15166 ns 15250 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 349588.5 ns 349182 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 93375 ns 93000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 94000 ns 103209 ns 0.91
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 92917 ns 92958 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 102709 ns 92833 ns 1.11
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112783 ns 113197 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317750 ns 315959 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 316292 ns 319270.5 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 317584 ns 317000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317584 ns 317333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 193819 ns 191577 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1084 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 1000 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 24048 ns 23307 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8333 ns 7792 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8375 ns 8375 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8583 ns 8125 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7750 ns 8000 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 246693 ns 244539 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 536187.5 ns 531791 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 512687.5 ns 517334 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 577375.5 ns 578729.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 240479 ns 256916 ns 0.94
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129151.5 ns 130622 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1428937.5 ns 1386812.5 ns 1.03
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1472834 ns 1483208.5 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1798541.5 ns 1776708 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 869958 ns 871125 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 274083.5 ns 273552 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31689 ns 31822 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6542 ns 5958 ns 1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6542 ns 6459 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6416 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 5958 ns 6167 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 249959 ns 246678.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1773333 ns 1774479 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1768771 ns 1782250.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1792416.5 ns 1777916 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1795771 ns 1766937 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168746.5 ns 169504.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4384292 ns 4354563 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4287458 ns 3899583 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4380250 ns 4361500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4188958 ns 4355333 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1121956.5 ns 1064911 ns 1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6812.5 ns 24479 ns 0.28
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7312.5 ns 7541 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7520.5 ns 7833 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7000 ns 22208.5 ns 0.32
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20861 ns 19777 ns 1.05
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 79833.5 ns 72854.5 ns 1.10
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 64104 ns 51667 ns 1.24
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 52000 ns 51833 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 70000 ns 70542 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 268016 ns 193123 ns 1.39
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17625 ns 17625 ns 1
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17917 ns 18250 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18250 ns 17708 ns 1.03
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17917 ns 17250 ns 1.04
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18127 ns 18352 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53250 ns 53000 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53334 ns 53250 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53458 ns 53542 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53333 ns 53375 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 312603 ns 317963.5 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 107625 ns 107500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 97000 ns 107125 ns 0.91
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 98375 ns 105625 ns 0.93
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 99458 ns 97584 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47074.5 ns 46786 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324291 ns 323417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 323334 ns 327750 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 324458.5 ns 322667 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 323375 ns 325000 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 210382 ns 207825 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1505584 ns 1504209 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1545458 ns 1545458 ns 1
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1545125 ns 1549042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1467020.5 ns 1478167 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52015 ns 51382 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5125208 ns 5122771 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5288334 ns 5291458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5293292 ns 5291125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4977583 ns 5000125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 202621 ns 200987.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28166 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28167 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28250 ns 28125 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28084 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24292 ns 24367 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66583.5 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66458 ns 66583 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67584 ns 66375 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66833 ns 66375 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 479926.5 ns 493214.5 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1516292 ns 1497500 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1145334 ns 1150584 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1158375 ns 1142791.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2105229 ns 2256875 ns 0.93
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 577247 ns 579142.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3108479 ns 3080625.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2741458 ns 2682000 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2737042 ns 2729917 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3824687.5 ns 3656583 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1965139 ns 1939352 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7930583 ns 7890875 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7916584 ns 7897375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7914458 ns 7904208 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4821083.5 ns 4815458 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 139291 ns 138395.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 137125 ns 78917 ns 1.74
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 139979.5 ns 132458.5 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 140687.5 ns 140084 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192231 ns 193872 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2033500 ns 2020209 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2035250 ns 1690750 ns 1.20
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2028999.5 ns 2025250 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2011666 ns 2006209 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 744838 ns 742900 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal closed this Dec 6, 2024
@avik-pal avik-pal deleted the compathelper/new_version/2024-12-06-00-19-20-030-01136868030 branch December 6, 2024 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant