Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

fix: dropout enzyme test fixes #153

Merged
merged 1 commit into from
Sep 5, 2024
Merged

fix: dropout enzyme test fixes #153

merged 1 commit into from
Sep 5, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Sep 5, 2024

No description provided.

@avik-pal avik-pal changed the title fix: looped dropout implementation on CPU fix: dropout enzyme test fixes Sep 5, 2024
Copy link

codecov bot commented Sep 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.11%. Comparing base (1afc1c7) to head (26a745f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #153      +/-   ##
==========================================
+ Coverage   76.11%   78.11%   +2.00%     
==========================================
  Files          38       38              
  Lines        1959     1956       -3     
==========================================
+ Hits         1491     1528      +37     
+ Misses        468      428      -40     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 26a745f Previous: 1afc1c7 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5875 ns 5479.5 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5291.5 ns 6375 ns 0.83
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7958 ns 8000 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5645.5 ns 6375 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 119440 ns 119198 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2765240 ns 2649209 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 742041 ns 704000 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 412334 ns 417764 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns 9812 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9917 ns 9625 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10875 ns 10042 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10041 ns 9541 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 555583 ns 551456 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 16647491 ns 16841216 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2610875 ns 2645125 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 681268 ns 659636 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1458 ns 1395.5 ns 1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3125 ns 1687.5 ns 1.85
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1750 ns 1875 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1417 ns 2521 ns 0.56
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21829 ns 21867 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1292057 ns 1304894 ns 0.99
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 215500 ns 212604 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 30965 ns 30820.5 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4229.5 ns 4209 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3625 ns 4312.5 ns 0.84
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4708 ns 3917 ns 1.20
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3834 ns 4375 ns 0.88
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 147051 ns 146279 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 9309598 ns 8894773.5 ns 1.05
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1600188 ns 1523375 ns 1.05
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 149932 ns 148982 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57500 ns 57542 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46500 ns 46584 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 40000 ns 39875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82792 ns 83708 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36927 ns 36787 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 563011 ns 582007 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1075125 ns 985625 ns 1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 80601 ns 84391 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2029208 ns 2036583 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2092917 ns 2086750 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2089208 ns 2079917 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1995416.5 ns 1987312.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 230122 ns 227214 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8058504 ns 7854957 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7868625 ns 7818750 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1282994 ns 967560 ns 1.33
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 153125 ns 154083 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 149500 ns 146958 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148396 ns 149979.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 145458 ns 165187.5 ns 0.88
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166122.5 ns 166381 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7788069 ns 7795058 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1449625 ns 1464583 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 190462 ns 207072 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1107375 ns 1110895.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1094750 ns 1103209 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1122771 ns 1118687 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1110937 ns 1109562.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 712693.5 ns 711437 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34722508 ns 33922938.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 5817083 ns 6051917 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1040261 ns 1036360 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5292 ns 5208 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5000 ns 4271 ns 1.17
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5875 ns 5375 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4791.5 ns 4584 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 94128 ns 94268 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5352782 ns 5136056 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 451041.5 ns 711583 ns 0.63
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 71450 ns 69481 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8833 ns 8667 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9084 ns 8500 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9167 ns 8917 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8333 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 610507 ns 603970 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 36416804.5 ns 33683319.5 ns 1.08
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 7166292 ns 5821292 ns 1.23
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 388814 ns 389889 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17645.5 ns 17729.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17875 ns 20042 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19833 ns 20584 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17708.5 ns 20416.5 ns 0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 67107 ns 66995 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2751251.5 ns 2897295 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1290999.5 ns 1301292 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 72801 ns 73931 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 211875 ns 211625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 245374.5 ns 218875 ns 1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214770.5 ns 218667 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 223833 ns 224875 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 359840 ns 357740 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 12166191 ns 14308445 ns 0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5640520.5 ns 5704396 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 475685 ns 473855 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 583 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 708 ns 666 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 875 ns 750 ns 1.17
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 708 ns 666 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20835 ns 20965 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1137595 ns 1157358.5 ns 0.98
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 295208.5 ns 283542 ns 1.04
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 34241 ns 32571 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1541 ns 1375 ns 1.12
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583 ns 1500 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1334 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 126497 ns 125947 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8400943 ns 8433349.5 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1686083 ns 1594979.5 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 138152 ns 138471 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7334 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6166 ns 6125 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5417 ns 5333 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9875 ns 10417 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23913.5 ns 23836 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1233546.5 ns 1232101.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 599042 ns 583125 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49130 ns 46460 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 260604 ns 227708 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 263083 ns 235583 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 264875 ns 264667 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 249916 ns 248583 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 192935 ns 190580 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29855996.5 ns 29562269.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8650417 ns 8564854.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 617941.5 ns 611281 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4083 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23226 ns 23789 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 1923166 ns 2018577 ns 0.95
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 225125 ns 219791.5 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48510 ns 50370 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16875 ns 16958 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17334 ns 17083 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16958 ns 17083 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 17000 ns 16666 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 194789 ns 197449 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10525168 ns 9693737.5 ns 1.09
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 951896 ns 940458 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 175272 ns 176226.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 512270.5 ns 509500 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405292 ns 405083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 332458 ns 332459 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 864917 ns 865125 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113343 ns 113130 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 392476 ns 391060 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 457875 ns 451416 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 247872 ns 248703 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2323375 ns 2324333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2034417 ns 2025375.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1753937.5 ns 1752833.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3275375 ns 3200583 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 242794 ns 244865 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 9219742 ns 11656548 ns 0.79
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1995000 ns 1966229 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 759198 ns 761317.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7000 ns 6250 ns 1.12
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6895.5 ns 6145.5 ns 1.12
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6916 ns 7729 ns 0.89
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6187.5 ns 6375 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 92518.5 ns 93009 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5487357.5 ns 5406797 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 855771 ns 758167 ns 1.13
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 59771 ns 60110 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11958 ns 10646 ns 1.12
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11667 ns 10542 ns 1.11
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11667 ns 11084 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10437 ns 10375 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 637059 ns 660576 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38550571.5 ns 38819677 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5722834 ns 5487104 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 411715 ns 416424 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23788 ns 23635 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2286830 ns 2221310 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 329896 ns 319750 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 53510 ns 53401 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2083 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2083 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2042 ns 2125 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 221806.5 ns 232566 ns 0.95
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 10957008 ns 11381984 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 2037167 ns 1912541.5 ns 1.07
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 186492 ns 186466.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8958 ns 8375 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9292 ns 8750 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10042 ns 10438 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8042 ns 8958 ns 0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 107667.5 ns 104173 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 2997805.5 ns 3244842 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 910042 ns 896708 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 74635.5 ns 74231 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17250 ns 17708 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18520.5 ns 17750 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18750 ns 18187.5 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17333.5 ns 18041.5 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 614632 ns 610296 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 15956679.5 ns 17126722 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5596125 ns 5229458 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 387124 ns 387209 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 708 ns 625 ns 1.13
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 541 ns 1.16
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 35673 ns 35555 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1212929.5 ns 1100087 ns 1.10
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 455250 ns 438541 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 45700 ns 47930 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8937.5 ns 9312 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9167 ns 8125 ns 1.13
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9916.5 ns 9792 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9104 ns 9146 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 257298 ns 256000 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19099554 ns 19311232 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5329750 ns 4774937.5 ns 1.12
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 377764 ns 378844 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396833.5 ns 397000 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288042 ns 288125 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215375 ns 215667 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 751604.5 ns 756875 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111805 ns 111981 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 321560.5 ns 320003 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 419417 ns 365500 ns 1.15
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 78301 ns 78230 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1453833 ns 1460875 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1134125 ns 1135291.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 860417 ns 862687.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2438959 ns 2357291 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 208454 ns 209166.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 9610442 ns 9267436 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1660083.5 ns 1516312.5 ns 1.09
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 323654 ns 323643 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7354.5 ns 6667 ns 1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7625 ns 6959 ns 1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8125 ns 8958.5 ns 0.91
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7166.5 ns 7334 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 141224.5 ns 144567 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5656682.5 ns 5867002 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 722312.5 ns 707270.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 59641 ns 70660 ns 0.84
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13292 ns 15395.5 ns 0.86
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14791.5 ns 12417 ns 1.19
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15749.5 ns 14250 ns 1.11
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14687.5 ns 13312 ns 1.10
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 935389 ns 958993.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41759340.5 ns 40369162 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6916250 ns 5752729.5 ns 1.20
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 432564 ns 433804 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26375 ns 24416 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25000 ns 26417 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28042 ns 28687 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 27062.5 ns 26874.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 201126.5 ns 201880.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7831227.5 ns 8100056 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1175458 ns 896833 ns 1.31
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 116261 ns 114876.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 151167 ns 148834 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 119479.5 ns 104708 ns 1.14
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 148999.5 ns 153500 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 152500 ns 116979 ns 1.30
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1087665 ns 1086710 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43137168 ns 41151661 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6444584 ns 5843229.5 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 593036 ns 594985 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 85833 ns 73958 ns 1.16
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76500 ns 76791.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76833 ns 80166 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76625 ns 75417 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 208103.5 ns 207189 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7493588 ns 7362606 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 540624.5 ns 519687.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 131571 ns 126391.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 313791 ns 297334 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 296834 ns 221667 ns 1.34
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 311958 ns 288917 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221937.5 ns 221041.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1133559 ns 1119401 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43223483.5 ns 41008184.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6691417 ns 6497687.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 701802 ns 694627 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16312 ns 16417 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17395.5 ns 16583 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18291 ns 17792 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 17167 ns 16708 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 147610 ns 147421 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5804228 ns 5759467 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 765792 ns 427292 ns 1.79
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239413 ns 237703 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25979 ns 24833.5 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27937.5 ns 27042 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27958 ns 27166.5 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 25937.5 ns 27125 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 986333 ns 984196 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 40747244 ns 40719457 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6169458 ns 5828333 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 715038 ns 714022 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 10417 ns 11562.5 ns 0.90
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11709 ns 10375 ns 1.13
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14083 ns 12083 ns 1.17
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10167 ns 11083 ns 0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 124659 ns 124895.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3568552 ns 3575871 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 924583 ns 912833 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239052 ns 242943 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22958.5 ns 21125 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22104.5 ns 21917 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22334 ns 22000 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 20917 ns 21416 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 707851 ns 706086.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21734391 ns 21428227.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5474542 ns 5387146 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 688247 ns 673547 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 62500 ns 64000.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 66291 ns 63500 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 67937 ns 66166 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67979 ns 62584 ns 1.09
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107328 ns 105629.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3446851 ns 3434086.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1322146 ns 1323250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 236942 ns 237572 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 441666.5 ns 448750 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 438625 ns 437958 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 448292 ns 446666 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 437333 ns 449583 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 518756.5 ns 517219 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21007823 ns 21208755 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6177354.5 ns 5978042 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 719262.5 ns 730458 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6958 ns 6958.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7791 ns 6833 ns 1.14
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9313 ns 8041 ns 1.16
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7603.5 ns 7771 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 146777.5 ns 145909.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5499279 ns 5602766 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 699541.5 ns 628395.5 ns 1.11
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 59140 ns 58991 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14166 ns 14042 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14084 ns 15750 ns 0.89
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14584 ns 13917 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15062.5 ns 13479 ns 1.12
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 958445 ns 954313 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 38754584 ns 38432249.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 6493041.5 ns 5549500 ns 1.17
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 408904 ns 404584 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6149000 ns 6160416 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6375812.5 ns 6378167 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 3225875.5 ns 3224791.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11899792 ns 11924000 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301882 ns 301800.5 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 293803 ns 294983 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19107145.5 ns 19104958 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19915333.5 ns 19957229 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 11120042 ns 11123708.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 37016645.5 ns 36532604 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1023438 ns 1023618 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1152527 ns 1158122 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 959 ns 917 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1000 ns 958 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 958 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 958 ns 959 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23409 ns 23554 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2081403 ns 2143802 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 337209 ns 316188 ns 1.07
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 213602 ns 215672 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3666 ns 3625 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3750 ns 3667 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3666 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3667 ns 3666 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 281249 ns 283503 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11252797 ns 11257238 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2188854 ns 2086333.5 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 646317 ns 637297 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8958 ns 8000 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8584 ns 7958 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9813 ns 9042 ns 1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7916.5 ns 7854 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 121231.5 ns 120818.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3432461 ns 3517154 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 881937 ns 776959 ns 1.14
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 66270 ns 67641 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11938 ns 11729.5 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 12375 ns 12250 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12792 ns 12334 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11833 ns 12458.5 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 646205 ns 643501 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 21341489 ns 21447178 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5427458 ns 5189125.5 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 369394 ns 365334 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 291 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22699 ns 22596 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2011541.5 ns 1951713 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 330708 ns 225750 ns 1.46
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 52201 ns 52251 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 3041 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3292 ns 3208 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3333 ns 3375 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3083 ns 3042 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 204312 ns 204741 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9376732 ns 9227567 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1703999.5 ns 1619250 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 172072 ns 172842 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11083 ns 11250 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12459 ns 11334 ns 1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 14083.5 ns 13125 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11354 ns 11458 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 122250.5 ns 121547.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3430038 ns 3353104 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 931916 ns 869041 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 239013 ns 243193 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22062.5 ns 22000 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21645.5 ns 20583 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 22708 ns 21167 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20875 ns 20791 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 602097 ns 598450 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20239973.5 ns 19931223.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4932625 ns 4695229 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 648807 ns 652706.5 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4334 ns 4375 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4458 ns 4417 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4417 ns 4416 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4416 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24546 ns 24359 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2228549 ns 2166080 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 229500 ns 223833 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 52241 ns 52541 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16584 ns 16667 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16625 ns 16500 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16541 ns 16375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16500 ns 16333 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 329742.5 ns 331128 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12273947 ns 12599810 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1797000 ns 1647875.5 ns 1.09
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 212507 ns 212037.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2083 ns 1959 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2167 ns 2083 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2167 ns 1958 ns 1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2084 ns 1958 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 35728 ns 35684 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1153804 ns 1146851 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 490000 ns 441458.5 ns 1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 206052 ns 206802 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 19624.5 ns 16645.5 ns 1.18
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 18416 ns 16750 ns 1.10
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 18459 ns 16562.5 ns 1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 17187.5 ns 17208.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 294948.5 ns 294264.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21393926 ns 20813859 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5546958 ns 5292083 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 703407 ns 703797.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59375 ns 59583.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 64104.5 ns 63625 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 62458.5 ns 62625 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51167 ns 51292 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66631 ns 66405 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 101111 ns 103511 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 206500 ns 199395.5 ns 1.04
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 158604 ns 157250 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 116854 ns 133937.5 ns 0.87
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 310812.5 ns 317729 ns 0.98
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 217149 ns 216342 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 580606 ns 579316 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 84375 ns 82458.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82375 ns 85271 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86542 ns 90209 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82375 ns 140417 ns 0.59
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192561 ns 192334 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5470941 ns 5533381 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1673458 ns 1893708 ns 0.88
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 171362 ns 170101.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1893750 ns 1851687.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1917895.5 ns 1882334 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1925896 ns 1926500 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1822750 ns 1891958.5 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 536315 ns 532324 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 26386495.5 ns 25979046 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8985104 ns 9683125 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1083051 ns 1080090 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21947.5 ns 21761 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2129791 ns 2115738 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 369791 ns 346875 ns 1.07
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 45251 ns 45220 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1750 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 252263 ns 253104 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9862510 ns 9490240.5 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1601917 ns 1088979 ns 1.47
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 185317 ns 187502 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8083 ns 8084 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9395.5 ns 8438 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11417 ns 10875 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8833 ns 8209 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 119822.5 ns 119061 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3293817 ns 3459549.5 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 930041.5 ns 880209 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 236242.5 ns 237872 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9709 ns 10167 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10125 ns 9208 ns 1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10541.5 ns 9500 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9375 ns 9167 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 528892 ns 527070 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19321268 ns 18222497.5 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4719958.5 ns 4417458 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 647817 ns 634411 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57375 ns 58417 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46542 ns 46333 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39375 ns 39500 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80395.5 ns 84083 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39845 ns 39770 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1345580.5 ns 1341281.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1133854.5 ns 1100583.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 81551 ns 75935.5 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1904875 ns 1901542 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1944500 ns 1921833.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1978417 ns 1955833 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1870416.5 ns 1881792 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 222036.5 ns 221320 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 34272722 ns 33766076 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11239500 ns 11588792 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1033441 ns 1036440 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 418958 ns 415958 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 417625 ns 420042 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 421458 ns 419875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 418084 ns 418708 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 210091.5 ns 210156.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7195821 ns 7606443 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 539833 ns 522750 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 286323 ns 287858 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 678021 ns 764709 ns 0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 778708 ns 781812 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 773959 ns 753417 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 673417 ns 678791.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1061649.5 ns 1059447 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44767025 ns 43854665.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6653271 ns 6323063 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 927159 ns 916300 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3439333.5 ns 3425978.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3391104 ns 3451792 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3461875 ns 3458979.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3470958 ns 3412708 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 172710 ns 170950 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8280433.5 ns 8189493 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1462521 ns 1396875 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 436115 ns 435150 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6126375 ns 6194166.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6222354 ns 6230791.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6240750 ns 6222854 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6176479.5 ns 6218875 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1002674 ns 1001834 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 47673707 ns 49254606 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8127937.5 ns 8528604 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1567646 ns 1556125 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 472792 ns 472667 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 340625 ns 339875 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 253583 ns 253208 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 902333 ns 902000 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46656 ns 46534 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 827389.5 ns 886552 ns 0.93
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 448875 ns 478875 ns 0.94
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 251102 ns 249963 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2326542 ns 2333750 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2032958 ns 2036625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1761771 ns 1763167 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3281417 ns 3203312 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 256650 ns 258879 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 13683336.5 ns 13032420 ns 1.05
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2212583.5 ns 2178375 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 787003 ns 787818 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57209 ns 57542 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46042 ns 45875 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39417 ns 39458 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82417 ns 83791 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 29161 ns 28376 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1330970 ns 1391893 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1148563 ns 1124083 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78581 ns 77840.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2035708.5 ns 2032250 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2089812.5 ns 2093187.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2097687 ns 2091917 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1920999.5 ns 1972229.5 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 240945.5 ns 235913 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 37146267 ns 35452366 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11479375 ns 11558395.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1222283 ns 1056250.5 ns 1.16
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57708 ns 57708 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46458 ns 46625 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39417 ns 39875 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80250 ns 83916.5 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 50765 ns 49455 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 827004 ns 809068 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1086000 ns 1084875 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 73451 ns 72105.5 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1889042 ns 1921083 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1942292 ns 1945916.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1982354.5 ns 1974729.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1850625 ns 1864791 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 247266.5 ns 238800.5 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 16809144 ns 17238198 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9912833.5 ns 10023791.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1067510 ns 934629 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35925 ns 34886 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1169075 ns 1200155 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 462667 ns 279833 ns 1.65
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 47970 ns 48281 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6792 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6667 ns 6208.5 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7000 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6708 ns 6667 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 213300 ns 212384.5 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 22271987 ns 19751565 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5441708 ns 5078916.5 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 373894 ns 379104 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 33056 ns 32763 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1173203 ns 1167700 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 260750 ns 253542 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 40680 ns 41150 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2792 ns 3833 ns 0.73
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3417 ns 3041 ns 1.12
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3167 ns 3375 ns 0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2792 ns 3125 ns 0.89
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 194789 ns 190584.5 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7289420 ns 7912209 ns 0.92
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 961353.5 ns 1265542 ns 0.76
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 155561 ns 153656.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 422146 ns 454937 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 422458 ns 454750 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 431375 ns 458229 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 446500.5 ns 427188 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 139235.5 ns 138010.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5896513 ns 5819207 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2058104 ns 2011000 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 381574 ns 325693 ns 1.17
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3767291.5 ns 3801708.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3755874.5 ns 3811125 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3827959 ns 3821292 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3647292 ns 3815375 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 723602 ns 710674 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33144724.5 ns 32043185 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11368791 ns 10832625.5 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1330629 ns 1491590 ns 0.89
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49871458.5 ns 49856479 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35510750 ns 35516042 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 26029104.5 ns 26022291 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97438834 ns 97102959 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1593681 ns 1594251.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1009851 ns 1009650 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154550187.5 ns 154623520.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112345583 ns 112350625 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 88949792 ns 89065125 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 299491791.5 ns 296081125 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6536119 ns 6489845.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5557747 ns 5556104 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 18146 ns 17312.5 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 17041 ns 16834 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 14917 ns 14291.5 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15042 ns 15167 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 22030 ns 21687 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1119955.5 ns 1157478.5 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 224333.5 ns 218167 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 26051 ns 27541 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10708 ns 11042 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 9062.5 ns 9000.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 8083 ns 7875 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17291 ns 17416.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 266847 ns 261161 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 9633341 ns 9552185 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1603229 ns 1560042 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 153121 ns 155181 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8750 ns 8125 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8791.5 ns 8084 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10417 ns 10083.5 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8437.5 ns 8542 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 128922 ns 116504 ns 1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3499288 ns 3349407.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 831208 ns 798667 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 239042 ns 238952.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10750 ns 9854 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9917 ns 10229.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10687 ns 10083 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9354.5 ns 9958 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 639262.5 ns 623888 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 21817843 ns 22194230 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5444791 ns 4515667 ns 1.21
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 670891.5 ns 656976 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9771 ns 9520.5 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9875 ns 9125 ns 1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11666.5 ns 11625 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8854 ns 9479.5 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 124952.5 ns 120769 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3245458 ns 3531092 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 911292 ns 888291 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 77221 ns 79170 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 15125 ns 14208 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13333 ns 13208.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 15083 ns 16333 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16667 ns 17000 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 609842 ns 594781 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19612533.5 ns 19851682 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4902166 ns 4474458 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 355303 ns 357348.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 625 ns 459 ns 1.36
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 584 ns 458 ns 1.28
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 35616 ns 34855 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1099929 ns 1184802 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 481667 ns 423042 ns 1.14
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 209262 ns 209842 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10292 ns 7709 ns 1.34
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7708 ns 7084 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9542 ns 7708 ns 1.24
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8042 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 236913 ns 231568.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21559240 ns 22217593.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5704459 ns 5660167 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 678127 ns 679867 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 16208 ns 16042 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 15041.5 ns 15333 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 13333 ns 13854 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10083 ns 10375 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 22727 ns 22215 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1191974 ns 1158702.5 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 321687.5 ns 205521 ns 1.57
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 193902 ns 194012 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32167 ns 31958 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32208.5 ns 32145.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32458 ns 32250 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32042 ns 32250 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 281783 ns 276502.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 10921459.5 ns 11085623 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1840125 ns 1721729 ns 1.07
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 605236 ns 605276.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 442417 ns 474834 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 441208 ns 445167 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 444667 ns 486875 ns 0.91
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 446792 ns 474916 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194055 ns 194410 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5844423 ns 5748288 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1998937.5 ns 2751937.5 ns 0.73
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 325264 ns 326354 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3799583.5 ns 3823792 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3786145.5 ns 3824042 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3841479.5 ns 3849500 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3776729 ns 3847584 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 555887 ns 546410 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27559795 ns 27926309 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10126917 ns 10140750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1386964 ns 1388348.5 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 782483459 ns 782652917 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 542653375 ns 542161792 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 417336604.5 ns 420966458.5 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1561846646 ns 1553203729.5 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22756973.5 ns 22558411.5 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14049919 ns 14062784.5 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3017390750 ns 2518008250 ns 1.20
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1789540125 ns 1785714792 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1525075250 ns 1525039667 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 5275786875 ns 4874366334 ns 1.08
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 366208751 ns 367235490 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 89425410 ns 88231178 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 82209 ns 77646 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 81334 ns 75959 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 80375 ns 82625 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76395.5 ns 77291 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 214623 ns 208602.5 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7766051 ns 8336540 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 665062.5 ns 525229 ns 1.27
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 109531 ns 109211 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 287479 ns 199042 ns 1.44
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 257000 ns 262396 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 194709 ns 276625 ns 0.70
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 234083 ns 287458 ns 0.81
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1077150 ns 1056833 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42088564.5 ns 40754174 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6234000 ns 6090583 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 648046 ns 646691 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199633792 ns 199913000 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 139392833 ns 139280375 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 103930750 ns 104140916 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388502917 ns 389020708 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5820718.5 ns 5827400 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3416784 ns 3419864.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 619755417 ns 620313062.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 441375250 ns 440225000 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 354920645.5 ns 352767458 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1184933208 ns 1182963541 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26386522.5 ns 26862507 ns 0.98
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21815610 ns 21755438 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7166 ns 7292 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6208 ns 6083 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5416 ns 5291 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9959 ns 10041 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28498 ns 28028 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1116372 ns 1272660 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 611458 ns 627458 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49170.5 ns 48010 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219833 ns 220750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221375 ns 220521 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222000 ns 221875 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206458.5 ns 209208.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 227567 ns 222206 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31738279 ns 29719216 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9196020.5 ns 9434666.5 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 536326 ns 527475 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7291.5 ns 8458.5 ns 0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9375 ns 9209 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10521 ns 10375 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8417 ns 8083 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 119937 ns 119377.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3293949 ns 3449983 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 889708 ns 855000 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 71981 ns 72520 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9229 ns 8958.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7896 ns 7500 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 10084 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 11083 ns 10187.5 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 536667 ns 521950 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19493979.5 ns 18008002 ns 1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4818459 ns 4315292 ns 1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 321554 ns 321943 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 666 ns 625 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 750 ns 625 ns 1.20
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 27275 ns 26701 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1222604 ns 1195571.5 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 497250 ns 459104 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 50630 ns 48701 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10875 ns 10375 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9333 ns 8479 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10750 ns 11375 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15083 ns 9375 ns 1.61
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 258393 ns 252977 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 23478126 ns 24052360 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 6003417 ns 5702709 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 398169 ns 397983.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 106542 ns 106500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 99625 ns 98125 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 86895.5 ns 87479.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146499.5 ns 147229 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 25487 ns 24863 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1171426 ns 1228355 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 273000 ns 263458.5 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 189472 ns 190212 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 516500 ns 478667 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 478500 ns 509250 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 478520.5 ns 518562.5 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 478667 ns 520417 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 238705 ns 234381 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11556495 ns 11772054 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2283167 ns 2148312.5 ns 1.06
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 621496.5 ns 621156 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5208 ns 5375 ns 0.97
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6917 ns 5167 ns 1.34
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7833 ns 7500 ns 1.04
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4791.5 ns 4833.5 ns 0.99
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16844 ns 16136 ns 1.04
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 73180.5 ns 79061 ns 0.93
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 13000 ns 14083 ns 0.92
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10916.5 ns 10208.5 ns 1.07
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 10250 ns 10292 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17500.5 ns 16708 ns 1.05
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 219010.5 ns 213958 ns 1.02
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 371483 ns 374963 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 40958 ns 40000 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 50709 ns 50584 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 51875 ns 52458.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 15166.5 ns 13895.5 ns 1.09
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20635.5 ns 19866 ns 1.04
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 86331 ns 87035.5 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 37292 ns 38625 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 32417 ns 30646 ns 1.06
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 29333 ns 30791.5 ns 0.95
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 58104.5 ns 57666 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 197497 ns 192524 ns 1.03
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 408484 ns 416745 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1750 ns 1604.5 ns 1.09
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 2000 ns 1791 ns 1.12
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2042 ns 1.08
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1729.5 ns 1708 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 21528 ns 21123 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1119527 ns 1140764 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 330250 ns 294500 ns 1.12
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 33530 ns 30391 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2083 ns 2042 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2125 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2333 ns 2292 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2167 ns 2208 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 208397 ns 205122.5 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 8834014.5 ns 8519681 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1669875 ns 1638500 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 137426.5 ns 139726.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5833 ns 5709 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4667 ns 5104 ns 0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6625 ns 5750 ns 1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5770.5 ns 4271 ns 1.35
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 149631.5 ns 146388.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5835867 ns 5488369.5 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 758875 ns 465291 ns 1.63
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 71411 ns 72161 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8708 ns 8479.5 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8625 ns 8209 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9375 ns 8750 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8958 ns 9042 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 903242 ns 884256.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 37421660 ns 38177021 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 6164958 ns 5496125 ns 1.12
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 388963 ns 394569 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56750 ns 56791 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57708 ns 57625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57000 ns 56875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57708 ns 58166 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 38228 ns 37427.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1180922 ns 1210467.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 755458 ns 468667 ns 1.61
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205452.5 ns 208482 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 449666.5 ns 487354.5 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 464874.5 ns 501250 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 464667 ns 492208.5 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 453208 ns 437438 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 274665.5 ns 267413 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26139340 ns 26782051.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8078333 ns 8248375 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 799638 ns 839679 ns 0.95
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3310209 ns 3311333.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2329479.5 ns 2340166.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1770896 ns 1769958 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6317959 ns 6319645.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205935 ns 205610 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 202792 ns 202712 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11454916.5 ns 11497979 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8308854.5 ns 8319667 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 6560334 ns 6588125 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21092271 ns 21221896 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 744389 ns 736463 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1059296 ns 1065445 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5667 ns 5562.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4958.5 ns 4666.5 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6792 ns 6437.5 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5958 ns 6104 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 142184.5 ns 139569.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5394650.5 ns 5734965.5 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 792375 ns 826042 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 61591 ns 59531 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9145.5 ns 9333.5 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 7000 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 11875 ns 0.66
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9791.5 ns 8708 ns 1.12
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 775371 ns 764194 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 34345007 ns 34028843.5 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5281125 ns 5176312.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 379583 ns 378403 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 97500 ns 99625 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 97041 ns 136708 ns 0.71
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 99521 ns 101312.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144416 ns 129709 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 153437 ns 151420 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5924089.5 ns 6034399 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2030687.5 ns 1982667 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 210747 ns 206692 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2013833 ns 2031041 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2026084 ns 2037417 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2035834 ns 2036291 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2002500 ns 2038584 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 724917.5 ns 708221 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31735074 ns 31488037 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10929291.5 ns 11251291 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1125621 ns 1126246 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 33542 ns 33459 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 36125.5 ns 36750 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 34458 ns 33833 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 750 ns 667 ns 1.12
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16054 ns 15506 ns 1.04
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 80321 ns 86920 ns 0.92
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 3500 ns 4792 ns 0.73
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2792 ns 2709 ns 1.03
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3125 ns 3167 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2916 ns 2291.5 ns 1.27
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 143545.5 ns 140769.5 ns 1.02
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 358138 ns 351474 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7125 ns 7250 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6041 ns 6000 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5375 ns 5375 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9750 ns 10000 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37536 ns 36795 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1179967 ns 1247042.5 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 669000.5 ns 351333 ns 1.90
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 51590.5 ns 49030 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213270.5 ns 213334 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221541.5 ns 220166.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220812.5 ns 228125 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 207375 ns 206875 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 245557 ns 244945 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27398743.5 ns 24969632 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7962500 ns 7965166.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 590596 ns 578090.5 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3916 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3959 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3916 ns 3958 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21981 ns 21762 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2171600 ns 2067928.5 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 247458.5 ns 245104 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 45951 ns 45631 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14916 ns 14875 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15000 ns 14916 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14709 ns 14667 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14917 ns 14667 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 309306 ns 310256.5 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11748235 ns 11269459 ns 1.04
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1034396 ns 1000292 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 196272 ns 193502 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 128084 ns 102917 ns 1.24
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 102584 ns 103667 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 103583 ns 108625 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 104209 ns 131875 ns 0.79
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 136798 ns 137366.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6093498 ns 5955500.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2059854.5 ns 1988958 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 203112 ns 200842 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1919896.5 ns 1926354.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1856625 ns 1913500 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1919937.5 ns 1917792 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1868750 ns 1936729 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 694137 ns 692519 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30885450 ns 33116808.5 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10571000 ns 11144584 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1230312 ns 1078360.5 ns 1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17979.5 ns 17708 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18958 ns 22291.5 ns 0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22041 ns 21250 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18625 ns 19146 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 109512 ns 109241 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3501819 ns 3392625.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1337916.5 ns 1271125 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76371 ns 81331 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218229.5 ns 221229.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 252958.5 ns 216791 ns 1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216958.5 ns 230083.5 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 216458 ns 216083.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 524772 ns 522920 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19866707.5 ns 19545470 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6113500 ns 6165645.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 491915 ns 476780 ns 1.03
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 25458 ns 26250 ns 0.97
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 29770.5 ns 31250 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 26708.5 ns 27875 ns 0.96
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1459 ns 1292 ns 1.13
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16298 ns 16312 ns 1.00
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 83171 ns 87751 ns 0.95
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 5250 ns 6625 ns 0.79
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5104 ns 4645.5 ns 1.10
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5292 ns 4917 ns 1.08
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4541 ns 4792 ns 0.95
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 208683 ns 207882.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 387234 ns 402074 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 304958 ns 305938 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 304709 ns 305917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 310083.5 ns 307521 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 305125 ns 305375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 230658.5 ns 230214 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7814600 ns 7500239 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1113875 ns 643000 ns 1.73
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 278313 ns 280903 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 530375 ns 538541 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 531208 ns 549750 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 532709 ns 542666 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 529417 ns 529708 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1090787 ns 1085631 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43485820.5 ns 44253871 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6297104 ns 6154687.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 884370 ns 872599 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19542 ns 19021 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19812.5 ns 19833.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23875.5 ns 22542 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19666 ns 21917 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 113863.5 ns 114174 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3373534.5 ns 3531348.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1442083 ns 1449271 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79591 ns 81471 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222250 ns 218834 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214333.5 ns 227542 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213833 ns 219708 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213584 ns 212708 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 765292 ns 761865.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24070784 ns 24050167 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7204250 ns 7412916.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 545746 ns 543136 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6709 ns 7125.5 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7312.5 ns 6479 ns 1.13
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8708 ns 8458 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6667 ns 6084 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 141798 ns 141785 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5523548 ns 5370056 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 816958 ns 777458 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 68790 ns 69581 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9792 ns 12958 ns 0.76
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10166 ns 9583.5 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11083.5 ns 10687.5 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9584 ns 9625 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 837915.5 ns 832452.5 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 38080103 ns 38810557 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5622750 ns 5231375 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 401534 ns 395184 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6000 ns 5145.5 ns 1.17
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5791 ns 4812.5 ns 1.20
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7458 ns 6958 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4125 ns 6833 ns 0.60
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 145690.5 ns 144967.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5623588 ns 5514807.5 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 838375 ns 829125 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 70361 ns 70250 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7375 ns 7770.5 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7667 ns 7333 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8291 ns 7667 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 7208 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 793785.5 ns 790491 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 39186335.5 ns 37869840 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 6086875 ns 5670687 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 404344 ns 398424.5 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14528416 ns 14518959 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10140541 ns 10120000 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 7739520.5 ns 7708791.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27755500 ns 27832250 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 533007 ns 532409 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 396764 ns 399949.5 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46273458.5 ns 46375083.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33442312.5 ns 33404583.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 26625187.5 ns 26627416.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85281333 ns 85835750 ns 0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2648843 ns 2644453 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3280893 ns 3278895 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 65042 ns 66042 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 68437.5 ns 66125 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 71167 ns 70520.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67833 ns 67875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 119287 ns 119873.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3455134 ns 3330724 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1439375 ns 1410021 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 231657.5 ns 229907.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 443062.5 ns 453292 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 484834 ns 441208 ns 1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 441417 ns 450208 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 440834 ns 445541 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 732137 ns 732886.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25953439 ns 26274297 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7552375 ns 7781500 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 813538 ns 794638 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 667 ns 0.75
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32502 ns 32132 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1227556 ns 1164338 ns 1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 474604 ns 431645.5 ns 1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 51870 ns 49160 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9375 ns 8292 ns 1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9729 ns 8708 ns 1.12
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9979 ns 9250 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8583 ns 8959 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 286691.5 ns 286401.5 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21308278 ns 21940598 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5559917 ns 5096125 ns 1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 380464 ns 388934 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9834 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9834 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9875 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9792 ns 9875 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23253 ns 23178 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2074064 ns 1908743.5 ns 1.09
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 224792 ns 222541 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 216413 ns 217383 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 46250 ns 45875 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 46209 ns 45917 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46375 ns 46167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 46000 ns 45875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 291933.5 ns 293089 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11156765 ns 10988297.5 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 1508167 ns 982875 ns 1.53
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 625531.5 ns 621107 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56209 ns 56250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57208 ns 57125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 56458 ns 56334 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57750 ns 57792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28351 ns 28527 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1247340 ns 1186883 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 717520.5 ns 578645.5 ns 1.24
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 212262.5 ns 204943 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 495167 ns 448333 ns 1.10
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 471209 ns 494125 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 466354 ns 507583 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 453729.5 ns 439437 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 247898 ns 247232 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31410562 ns 33216066 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9721459 ns 9499166 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 892599 ns 891519.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 646125 ns 652937.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 611208 ns 647333 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 659458 ns 662854 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 622229 ns 668500 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 207532 ns 207996 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8263342 ns 8125052.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1391958.5 ns 1384354.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 254943 ns 233282 ns 1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2234625 ns 2235042 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2230250 ns 2238979 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2254374.5 ns 2248959 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2230084 ns 2260792 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 981398 ns 984096 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49241405.5 ns 45382984 ns 1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6989291 ns 8132833.5 ns 0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1255398 ns 1370494 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20458 ns 20958 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19354.5 ns 20000 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23583.5 ns 22667 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19500 ns 22083 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 113732 ns 113160 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3641860 ns 3278898 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1478625 ns 1472792 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81541 ns 81561 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 231562.5 ns 222313 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222667 ns 257542 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221667 ns 232250 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 227771 ns 228000.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 732151 ns 734156.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26497381.5 ns 27357269 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7749334 ns 7692750 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 560606 ns 559476 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 541 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23209 ns 23248 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1215660.5 ns 1222462 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 487375 ns 466625 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 51290 ns 51870 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9166.5 ns 9167 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9667 ns 9208 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10375 ns 9292 ns 1.12
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8917 ns 9312.5 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 269153 ns 268568 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 23827343 ns 24289416 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6253667 ns 6049709 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 410245 ns 410500 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9334 ns 10333 ns 0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8583.5 ns 8458 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10979 ns 10354 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7875 ns 8333 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 119974 ns 120393.5 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3326234.5 ns 3445203 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 884291 ns 832874.5 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 74671 ns 72921 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7437.5 ns 7583 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7937.5 ns 8208 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8042 ns 7417 ns 1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7562.5 ns 7770.5 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 511775.5 ns 511772 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17975530 ns 16339001 ns 1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4375354.5 ns 3959271 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 328823 ns 328364 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1333 ns 1458 ns 0.91
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1666.5 ns 1542 ns 1.08
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 1833 ns 1.20
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1334 ns 1541 ns 0.87
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 21678 ns 21725 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1152981 ns 1136020 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 320854 ns 296000 ns 1.08
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 194212 ns 194712 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3250 ns 3250 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3417 ns 3250 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3625 ns 3500 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3375 ns 3209 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 219466 ns 220221.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10834184 ns 9698879 ns 1.12
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1832687.5 ns 1612667 ns 1.14
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 596936 ns 596166 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 150437 ns 148167 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 128250 ns 127709 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 108166.5 ns 107958.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225041.5 ns 225958 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 24160 ns 24338 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1310090.5 ns 1138772 ns 1.15
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 273792 ns 270854.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 40781 ns 40151 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 180083 ns 156125 ns 1.15
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 123167 ns 127209 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 87791 ns 100750 ns 0.87
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 262687.5 ns 256666.5 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 218451.5 ns 218905 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10657598 ns 10030041 ns 1.06
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2067708.5 ns 2003417 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 220982 ns 240417.5 ns 0.92
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7292 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6083 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5375 ns 5375 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10375 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32820 ns 32865 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1163489 ns 1134920.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 617459 ns 562875 ns 1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50661 ns 52191 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 232458 ns 230854.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229042 ns 270500 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228729.5 ns 264875 ns 0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219875.5 ns 213771 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 265376 ns 263381.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26768413 ns 28212764 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8263416.5 ns 8517000 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 529665 ns 607266 ns 0.87
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15083 ns 14958 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 14958.5 ns 15500 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 17125 ns 16500 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14833 ns 15625 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 140181 ns 140749.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5308940 ns 5465169 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 888979.5 ns 787125 ns 1.13
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 238352 ns 238512 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23438 ns 22583 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24208 ns 23500 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25166.5 ns 24084 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24375 ns 23167 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 870581 ns 875101 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 41063907 ns 37582744 ns 1.09
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5887062.5 ns 5600270.5 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 700012 ns 692048 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9625 ns 9125 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9542 ns 9250.5 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11979.5 ns 10521 ns 1.14
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9125 ns 9209 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 124080.5 ns 124561 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3416043 ns 3393331 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 822708 ns 802083 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 79851 ns 79030 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13750 ns 13750 ns 1
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14667 ns 14125 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15021 ns 14125 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13625 ns 13917 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 671966 ns 670894 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 19977653 ns 20295661 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5398125 ns 5274042 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 371888.5 ns 375405 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8958 ns 9208.5 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9916 ns 9167 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11812.5 ns 10438 ns 1.13
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9041 ns 9584 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 122970.5 ns 122339.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3293970 ns 3319433.5 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 925417 ns 882875 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 73870 ns 75581 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12417 ns 12333.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12708 ns 12645.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13167 ns 12708 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12584 ns 12708 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 556881 ns 557225 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19063398.5 ns 18661226 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4862458 ns 4435167 ns 1.10
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 347418.5 ns 345844 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 30417 ns 30292 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34729 ns 34021.5 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 29916.5 ns 30854.5 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 2167 ns 1791 ns 1.21
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16379 ns 16303 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 82421 ns 82211 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5084 ns 5270.5 ns 0.96
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5250 ns 5354 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5167 ns 5375 ns 0.96
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6500 ns 6625 ns 0.98
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 140549.5 ns 140733 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 368674 ns 394064.5 ns 0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 250 ns 250 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 26252 ns 26135 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1216332 ns 1123770.5 ns 1.08
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 474208 ns 474625 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 47830 ns 50311 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6500 ns 6375 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6750 ns 6145.5 ns 1.10
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6916.5 ns 6416 ns 1.08
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6417 ns 6416 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 187727.5 ns 187828 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 22632657 ns 23626156 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 6029520.5 ns 5544437.5 ns 1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391454 ns 395104 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 2042 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2042 ns 2000 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2084 ns 1959 ns 1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 1958 ns 2000 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 26767 ns 26544 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1182321 ns 1165809 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 489208 ns 461708.5 ns 1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 208282 ns 209972 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16250 ns 15792 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16792 ns 16375 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17750 ns 17000 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16270.5 ns 16084 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 275397.5 ns 275962 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24766157 ns 24890960.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6216125 ns 5972833 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 716018 ns 713667.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 200667 ns 178250 ns 1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 178687.5 ns 184187.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 152375 ns 153417 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147249.5 ns 147459 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 203130 ns 204372 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7922362.5 ns 7857309.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1454125 ns 1392667 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 176432 ns 196752 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1327938 ns 1326895.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1323854.5 ns 1320625 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1334542 ns 1330833 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1328041 ns 1334750 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 915965.5 ns 917280 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 46188516 ns 46023181 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6707708 ns 6714958.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1115346.5 ns 1108992 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26042 ns 25229.5 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24792 ns 26583 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28458 ns 26833 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24583 ns 25917 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 237159 ns 239791.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7575003.5 ns 7972748 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1212208 ns 980542 ns 1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 104651 ns 116941 ns 0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 117583 ns 179917 ns 0.65
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 172521 ns 141604.5 ns 1.22
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118500 ns 127354.5 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 117062 ns 118604 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1090625 ns 1092585 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43917712 ns 43816902.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6372458 ns 6033333 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 621286 ns 606086 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23137 ns 22970 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1238877 ns 1175116 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 498542 ns 456125 ns 1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 48881 ns 48591 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6625 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6833 ns 6750 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7167 ns 6542 ns 1.10
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6417 ns 6459 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 205267.5 ns 204628 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23886381 ns 23603781 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6302042 ns 6092458 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 402254.5 ns 397554 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7041.5 ns 6125 ns 1.15
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5541.5 ns 6334 ns 0.87
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7625 ns 6709 ns 1.14
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6000 ns 5937.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 146785.5 ns 147027 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5822475 ns 5559804 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 746708.5 ns 583167 ns 1.28
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 241962 ns 237472 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9958.5 ns 9666.5 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10500 ns 10041 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10417 ns 10041 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9979 ns 9854 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 912783 ns 910526.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 42603631 ns 39406121 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6373125 ns 5909375 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 691307 ns 686288 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 666 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 667 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 666 ns 667 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 667 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22725 ns 22655 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 1963806 ns 2037996 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 336042 ns 222583 ns 1.51
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 216303 ns 215862 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4583 ns 4584 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4792 ns 4584 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4916 ns 4625 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4625 ns 4625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 229275 ns 232442.5 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10608963 ns 9881227 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1723167 ns 1690521 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 602246 ns 600181 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7791 ns 8562.5 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8416.5 ns 7937.5 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10375 ns 9771 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8229.5 ns 8520.5 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 122398.5 ns 122197 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3507632 ns 3361719 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 910208 ns 761542 ns 1.20
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 78850.5 ns 76241 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8104 ns 8792 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8833 ns 8459 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9583 ns 8875 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8208 ns 8750 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 595869 ns 595652 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 22073835 ns 20278296 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5319000 ns 4718125 ns 1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 356484 ns 354274 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126292 ns 125917 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 128916 ns 128958 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 97208.5 ns 96959 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183042 ns 181416 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46271 ns 46106 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 97571 ns 96666 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 333542 ns 317875 ns 1.05
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 347479 ns 346375 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 167291.5 ns 178979 ns 0.93
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 625999.5 ns 569062.5 ns 1.10
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 192528.5 ns 191966 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 496135 ns 487875 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398041 ns 397125 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288000 ns 288292 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215416 ns 215791 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756458 ns 757959 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43488.5 ns 43243.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1364389 ns 1345812 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 410625 ns 404062.5 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 83870 ns 83381 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1469145.5 ns 1459854 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1135500 ns 1136645.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 863146 ns 865270.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2442125 ns 2359813 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 247561 ns 259216 ns 0.96
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 11237365 ns 11177773 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1833625 ns 1833666 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 354774 ns 349653.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 659187.5 ns 642333 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 640354.5 ns 649875 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 662145.5 ns 660416.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 548312.5 ns 623542 ns 0.88
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 203446 ns 202604 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7976375 ns 7957177 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1383791.5 ns 1348791.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 265813 ns 265108 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2448916.5 ns 2448583 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2438562 ns 2452104 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2469417 ns 2473833 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2434958 ns 2455791 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1005073 ns 1005284.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50692046 ns 50767854.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9506625 ns 10026166 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1461465 ns 1511186 ns 0.97
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32333.5 ns 32375 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 35166.5 ns 35749.5 ns 0.98
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 33042 ns 34312.5 ns 0.96
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 1125 ns 916 ns 1.23
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15934.5 ns 15700 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 81141 ns 81140 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3083 ns 3166 ns 0.97
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3250 ns 3083 ns 1.05
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3500 ns 3125 ns 1.12
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3083 ns 3000 ns 1.03
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 139173.5 ns 139352.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 343023.5 ns 344664 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 406750 ns 405583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 408334 ns 408750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 402208 ns 403083 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 420125 ns 422042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43496 ns 43343.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1528026.5 ns 1354478 ns 1.13
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1169000 ns 1109583 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 239982 ns 240442 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3872041 ns 3869125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3991083.5 ns 3994396 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4002833 ns 3999708 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3774729 ns 3774354.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 244269 ns 244251 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35724955 ns 35978667 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11845709 ns 11608750 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1455475 ns 1245273.5 ns 1.17
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3958 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33788 ns 34866 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1302459 ns 1227111 ns 1.06
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 181708 ns 175291 ns 1.04
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 40650 ns 42710 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15708 ns 15750 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15958 ns 15667 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15792 ns 15500 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15667 ns 15542 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 255478 ns 256386 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 8672466 ns 8908913 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 889209 ns 872958 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 173551.5 ns 174412 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404708 ns 404166 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 296104 ns 295666 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 221291 ns 221625 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760584 ns 760500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112862 ns 113218 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1002276 ns 1016425 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 423187 ns 393437 ns 1.08
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 89911 ns 90851 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1475917 ns 1473333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1161354.5 ns 1161666 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 887896 ns 888166.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2464979 ns 2383791 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 241481.5 ns 241468.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 11763790 ns 11846004 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1944292 ns 1877938 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 358523 ns 360704 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 459 ns 1.27
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 459 ns 542 ns 0.85
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26062.5 ns 25943 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1278854 ns 1192515 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 460042 ns 470937.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 208442 ns 208143 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7333 ns 7458 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7958 ns 7583 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7958 ns 7458 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7417 ns 7709 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 211704 ns 214477.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24805137 ns 25777295.5 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5538000 ns 5998979.5 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 700917 ns 700287 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 832145.5 ns 831271 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 616959 ns 617041 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 471750 ns 470000 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1544042 ns 1545709 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130757 ns 129860.5 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 184082 ns 169171.5 ns 1.09
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2688499.5 ns 2689145.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1996249.5 ns 2013250 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1538625 ns 1538125 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4933916 ns 4941375 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 245136 ns 241461 ns 1.02
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 883529 ns 867019 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32166 ns 31985 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1203462 ns 1142400.5 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 460354 ns 453291.5 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 48821 ns 48580 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6250 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6750 ns 6375 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7000 ns 6416 ns 1.09
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6333 ns 6166 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 225402.5 ns 224593 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19737803.5 ns 21127237.5 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5669541 ns 5053916 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 369024 ns 372504 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2394958 ns 2423917 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2408625 ns 2397291.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2394708 ns 2403792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2370458 ns 2371125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 201622 ns 203214 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7737337 ns 8123069 ns 0.95
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1450791 ns 1393562 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 380674 ns 332763.5 ns 1.14
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4647583 ns 4645250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4653208.5 ns 4645125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4673334 ns 4654250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4514458.5 ns 4658042 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 912510 ns 910071 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 47986504.5 ns 48057492 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6739083.5 ns 6619584 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1262608 ns 1416215 ns 0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6833 ns 7438 ns 0.92
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 9771 ns 7083 ns 1.38
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7250 ns 6958 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7542 ns 6979 ns 1.08
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23275 ns 23722 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1207675 ns 1176238 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 265333 ns 263000 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 37750 ns 34150 ns 1.11
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 64978.5 ns 68020.5 ns 0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 71125 ns 50312 ns 1.41
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 52833 ns 53292 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 70666.5 ns 32583 ns 2.17
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 218519 ns 218170 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10524819 ns 10824043 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2194937.5 ns 2030958 ns 1.08
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 240683 ns 244333 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 21375.5 ns 21437 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 26271 ns 25333 ns 1.04
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 22917 ns 23479.5 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5709 ns 6083 ns 0.94
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16707.5 ns 16786.5 ns 1.00
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 85561 ns 91501 ns 0.94
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 11812.5 ns 12208.5 ns 0.97
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10625 ns 10083 ns 1.05
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 9417 ns 9458.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18104.5 ns 17854.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 227822 ns 228126 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 392324 ns 376824 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406375 ns 406500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 296875 ns 297312.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 223583 ns 223791 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762417 ns 762958 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46850 ns 46683 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1392045 ns 1412498.5 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 490875 ns 476666.5 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 90551 ns 89121 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1484020.5 ns 1499875 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1170375 ns 1167833.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 895687 ns 894271 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2469958 ns 2389834 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 291935 ns 292932.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 11212854.5 ns 13048501 ns 0.86
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2109562.5 ns 2098166 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 377974 ns 380285 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 434000 ns 433875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436708 ns 436334 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 430250 ns 430709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 446334 ns 448020.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54345 ns 54564 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1014127 ns 1024914 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1104583.5 ns 1099208.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 235673 ns 236522.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3886750 ns 3897208 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4017812.5 ns 4021833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4032979.5 ns 4027708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3781312.5 ns 3812146 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 264536.5 ns 264154 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31087542.5 ns 31494055 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10495333 ns 10517749.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1395074 ns 1245028 ns 1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8708 ns 8750 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7666 ns 7666 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 6958 ns 6834 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12500 ns 12459 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24046 ns 24707 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2146721 ns 2085760.5 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 230625 ns 225250 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 216742 ns 215337.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45125 ns 45042 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45208 ns 45125 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 44875 ns 45083 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45334 ns 45187.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 346890 ns 350283.5 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13295582 ns 11134325 ns 1.19
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1934541.5 ns 1805125 ns 1.07
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 671457 ns 662902 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 106854.5 ns 93959 ns 1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 119958 ns 129416 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 87875 ns 87916.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 87250 ns 125062.5 ns 0.70
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189626 ns 189645 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5844625 ns 5972246.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1928125.5 ns 1906021.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 187382 ns 201947 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2014874.5 ns 2011375 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2017709 ns 2017791 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2027542 ns 2029459 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2000000 ns 2017916.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 539260 ns 537811 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28910116 ns 27667805 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9321083.5 ns 9734479.5 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1102071 ns 1103102 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/dropout_enz branch 4 times, most recently from 78b8546 to 08e559d Compare September 5, 2024 13:57
@avik-pal avik-pal merged commit da67a46 into main Sep 5, 2024
4 of 13 checks passed
@avik-pal avik-pal deleted the ap/dropout_enz branch September 5, 2024 14:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant