Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

Commit

Permalink
fix: missing import; fixes #179 (#180)
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal authored Oct 28, 2024
1 parent 877ef96 commit 6976693
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 2 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "LuxLib"
uuid = "82251201-b29d-42c6-8e01-566dec8acb11"
authors = ["Avik Pal <avikpal@mit.edu> and contributors"]
version = "1.3.5"
version = "1.3.6"

[deps]
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9"
Expand Down
3 changes: 2 additions & 1 deletion src/impl/Impl.jl
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ using ..Utils: Utils, NotaNumber, batchview, concrete_bias_act_output_eltype, co
copy_drop_gradients, eltype_mismatch, expand_batchdim,
maybe_reduce_BLAS_threads, ofeltype_array, only_derivative, remove_tracking,
reset_BLAS_threads, run_ka_kernel, safe_eltype, safe_vec, safe_warning,
unsafe_known, unrolled_mapreduce, can_loopvec_args, @enzyme_alternative
unsafe_known, unrolled_mapreduce, can_loopvec_args, is_extension_loaded,
@enzyme_alternative
using ..Traits: activation_intermediate_not_needed, activation_has_rrule, is_mutable_array,
fuse_cpu_activation
using ..System: explicit_blas_loaded, use_octavian, fits_in_l1cache, fits_in_l2cache,
Expand Down

3 comments on commit 6976693

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/118229

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v1.3.6 -m "<description of version>" 6976693d49414e0b0d2e23374cb27ab10ce951c8
git push origin v1.3.6

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 6976693 Previous: 877ef96 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6333 ns 5000 ns 1.27
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5125 ns 5125 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8041 ns 7375 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5542 ns 4833 ns 1.15
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 106513 ns 108327 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 892833 ns 704958 ns 1.27
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 441815 ns 452318 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9833 ns 10000 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10125 ns 9917 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10209 ns 10229.5 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9895.5 ns 9729.5 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 513421 ns 538089 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2325125 ns 2390625 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 666767 ns 709441 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1333 ns 1792 ns 0.74
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1500 ns 1792 ns 0.84
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1666.5 ns 2000.5 ns 0.83
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1459 ns 1584 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 20135 ns 19729 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 453896 ns 439229 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 33340 ns 33851 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4375 ns 4375 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4375 ns 3833.5 ns 1.14
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4417 ns 4250 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4271 ns 3520.5 ns 1.21
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 134295 ns 134838 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 2377250 ns 2235354 ns 1.06
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 150416.5 ns 143632.5 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58042 ns 56375 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46833 ns 46875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46833 ns 46750 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81500 ns 78375 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36986 ns 36801 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1095167 ns 1444229 ns 0.76
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 81585.5 ns 84285 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2034750 ns 2037375.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2082521 ns 2083500 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087250 ns 2090334 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1980458 ns 1999916 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 221788 ns 215168.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 4497250 ns 5415625 ns 0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1017651 ns 1280705 ns 0.79
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 155458 ns 148666.5 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145750 ns 145833 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149875 ns 152417 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147042 ns 160792 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166515 ns 167254 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1577896 ns 1500250 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 173242 ns 172909 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1120521 ns 1133479.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1117125 ns 1112750 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1113979.5 ns 1115292 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1110958 ns 1109687.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 654141.5 ns 623047 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8479083 ns 10180459 ns 0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1028872 ns 1022168 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4375 ns 4771 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4395.5 ns 4708 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6125 ns 6666 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3875 ns 4167 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 85237.5 ns 80121.5 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 1327375 ns 1222709 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 69631 ns 56392.5 ns 1.23
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8584 ns 8521 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 8542 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 9375 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8584 ns 8542 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 558891 ns 547974 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 7932562.5 ns 7799104.5 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 388659.5 ns 384758 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17541.5 ns 18062.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19417 ns 16875 ns 1.15
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19729.5 ns 21625 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17104 ns 17666.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 63486.5 ns 62259 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1339292 ns 1327729 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 72720.5 ns 76443 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221959 ns 212542 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212104 ns 217708 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217417 ns 222604.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219541 ns 235416.5 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 331752 ns 326680 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5750229 ns 5672875 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 468975 ns 468011 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 667 ns 625 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 959 ns 0.78
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 792 ns 0.74
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 18902 ns 18885 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 430875 ns 446167 ns 0.97
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 30710 ns 31881 ns 0.96
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1416 ns 1417 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1500 ns 1667 ns 0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1334 ns 1375 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 115876 ns 117120.5 ns 0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 2152000 ns 2151437.5 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 136481 ns 135835 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7250 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 6000 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6083 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9959 ns 10166 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23983 ns 23630 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 870625 ns 838084 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48931 ns 48897 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221500 ns 220042 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 241041 ns 234750 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 267583 ns 270833.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212771 ns 253000.5 ns 0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 190752 ns 188891 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9536917 ns 8581771 ns 1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 612916 ns 612944.5 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3916 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23861 ns 23120 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 449042 ns 433416 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48310 ns 47491 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16792 ns 16542 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16917 ns 17041 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16791 ns 17167 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 17084 ns 16875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 187040 ns 186342.5 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 2209583 ns 2081000 ns 1.06
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 176662 ns 174571.5 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 922958 ns 919250 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 828208 ns 828041 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 826833 ns 838917 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 1251916 ns 1258333 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113841 ns 113235.5 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 539875 ns 452875 ns 1.19
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 244213 ns 243040 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2596083.5 ns 2556167 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2332333 ns 2320333.5 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2327583.5 ns 2328916.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3546333 ns 3549104.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 231757 ns 229235 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 2200250 ns 2156125 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 745088 ns 739658 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6666 ns 6084 ns 1.10
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5312.5 ns 5520.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7000 ns 8354 ns 0.84
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6625 ns 5834 ns 1.14
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 83981 ns 83528.5 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 1197104.5 ns 1131521 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 59291 ns 58842 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11604.5 ns 11729.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11979 ns 11583 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10917 ns 11479.5 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11250 ns 10999.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 599108 ns 596279 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 7575937.5 ns 7505021 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 410064 ns 402564 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23782 ns 23594 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 438000 ns 436875 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 48641 ns 48301 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2083 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2084 ns 2208 ns 0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2083 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 224264.5 ns 224089.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 2511958 ns 2406437.5 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 175852 ns 182056 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8875 ns 8916 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9125 ns 8292 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9229 ns 11209 ns 0.82
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8792 ns 8375 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 107320.5 ns 101414 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 1245250 ns 1214500 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 73535.5 ns 73272.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18084 ns 18625 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17500 ns 17208.5 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17979 ns 18667 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17875.5 ns 16771 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 580294 ns 555190.5 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5603625 ns 5531208.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 384069 ns 379272 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 459 ns 458 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33286 ns 34468 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 662708.5 ns 654854 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 46191 ns 45552 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9229 ns 9854 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9917 ns 9250 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9041.5 ns 9458 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8875 ns 8562.5 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 252697 ns 257386.5 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5732729 ns 5553750 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 368379 ns 366942 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397000 ns 396542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288166 ns 288042 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288167 ns 287541 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756250 ns 756167 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 110629.5 ns 112104 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 394292 ns 519187.5 ns 0.76
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 74970 ns 76352 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1465084 ns 1409875 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1132458 ns 1132584 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1130750 ns 1126791.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2438896 ns 2436813 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 196505 ns 199625 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1727792 ns 1712834 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 325688.5 ns 322335 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7667 ns 7083 ns 1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6917 ns 6874.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7437.5 ns 8458 ns 0.88
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6479.5 ns 6938 ns 0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 134612.5 ns 134438.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 1200042 ns 1132749.5 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 59360.5 ns 59441 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15146 ns 16563 ns 0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14750 ns 13917 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15417 ns 16167 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13270.5 ns 15187.5 ns 0.87
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 893159 ns 880177 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 8030208 ns 7959042 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 422534.5 ns 418702.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25666 ns 24146 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 27708 ns 23791.5 ns 1.16
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 26375 ns 28250 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 31250 ns 24896 ns 1.26
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 184506 ns 185908.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1655917 ns 1653167 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 112126 ns 114524 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 104541 ns 152041 ns 0.69
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 149479.5 ns 105395.5 ns 1.42
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 117292 ns 113125 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 131500 ns 104979 ns 1.25
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1008242 ns 1011252 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8271541.5 ns 8155875 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 585661.5 ns 577332 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 79667 ns 79000 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 83312.5 ns 76417 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75250 ns 76833 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74625 ns 80250 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 188947 ns 190543 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1588521 ns 1268166 ns 1.25
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 126096 ns 125494 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 295166 ns 301375.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 297958 ns 295750 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 302917 ns 231208 ns 1.31
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 208479.5 ns 209499.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1011400 ns 1046615 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9046229.5 ns 9187687.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 695657.5 ns 689189 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13333 ns 13333 ns 1
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 13458 ns 13334 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13291.5 ns 15062.5 ns 0.88
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 13166 ns 12750 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 136849.5 ns 137754.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 1163104 ns 1170125 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 234843 ns 233927 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26270.5 ns 28270.5 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27250 ns 26542 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26187.5 ns 27166.5 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26750 ns 26062 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 906727.5 ns 912323.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 8072458.5 ns 7923459 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 686187 ns 689579 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13375 ns 15042 ns 0.89
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 15167 ns 14625 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14917 ns 17292 ns 0.86
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 14042 ns 13834 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 119077.5 ns 119657.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 1244520.5 ns 1225791.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239672 ns 239157 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26583 ns 26375 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26874.5 ns 26208 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26583 ns 26375 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27458 ns 26375 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 671026 ns 665016.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6008312.5 ns 5755000 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 677847 ns 674067.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 184417 ns 183750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182500 ns 181645.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184875 ns 187833 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 182167 ns 183666 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100828 ns 101191 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1357167 ns 1353021 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 234093 ns 235596.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 584250 ns 636291 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 584458 ns 594625 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 586834 ns 592062.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 601875 ns 613458 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 487358 ns 491587 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6120229 ns 6127021 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 715228 ns 708249 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7854.5 ns 7375 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8000 ns 8333 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7542 ns 9417 ns 0.80
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7250 ns 7229.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 135774.5 ns 137783 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 1199333 ns 1110021 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 58821 ns 57461 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13041 ns 14812.5 ns 0.88
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14709 ns 14791 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 13937.5 ns 14875 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14562.5 ns 12896 ns 1.13
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 874051 ns 881205 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 7775916 ns 7653313 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 397914.5 ns 399470 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6152062.5 ns 6156708 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6375958.5 ns 6375958.5 ns 1
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6375125 ns 6373937.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11916542 ns 11907750 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 336665 ns 347134 ns 0.97
batchedmm(512, Bsize=4)/forward/GPU/Metal 1592000 ns 1596208 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 301084 ns 300417.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19075667 ns 19072062.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19963000 ns 19937292 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19972937.5 ns 19969000 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36525333.5 ns 36484084 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1014155 ns 1007983 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/Metal 7901000 ns 7924354 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1170477.5 ns 1163329 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1750 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1791 ns 1875 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23343 ns 23636 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 470292 ns 431667 ns 1.09
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 209012 ns 208896 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4875 ns 4792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4792 ns 4875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4959 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4833 ns 4833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 268124 ns 270525.5 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2721875 ns 2513333 ns 1.08
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 627407 ns 618686 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7083.5 ns 9416.5 ns 0.75
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8229 ns 7917 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8458 ns 9625 ns 0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7666.5 ns 7271 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 115788 ns 116370.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 1227812.5 ns 1185875 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 66840 ns 68072 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12083 ns 11937.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11541.5 ns 10958 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12125 ns 12417 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11313 ns 11083.5 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 602742 ns 603718 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5680125 ns 5647937.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 358744 ns 355648 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22574 ns 22877 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 434292 ns 443875 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 50011 ns 46351 ns 1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 2916 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3000 ns 3083 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2958 ns 3250 ns 0.91
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3084 ns 2958 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 193009.5 ns 196283.5 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 2163750 ns 2099292 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 169072 ns 160444 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14791.5 ns 14208.5 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 14458.5 ns 14375 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15749.5 ns 17521 ns 0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 16229 ns 14729 ns 1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 117264.5 ns 116923.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 1169583 ns 1146125 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 239503 ns 237206 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25938 ns 25666 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25166.5 ns 25500 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25667 ns 25875 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25417 ns 25791 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 554698 ns 551650 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5215833 ns 5245875 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 655417 ns 650325 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4292 ns 4208 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4333 ns 4208 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4250 ns 4167 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24021 ns 24277 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 448792 ns 445125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 49241 ns 48561 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16334 ns 15917 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16458 ns 16208 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16417 ns 16250 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16167 ns 16125 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 315889 ns 320460 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 2507333 ns 2478875 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 210422 ns 206705 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5791 ns 5625 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5708 ns 5917 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5833 ns 5834 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5833 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34426 ns 35140 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 658604.5 ns 657000 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 206672 ns 205735 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22542 ns 20708 ns 1.09
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 22791.5 ns 21146 ns 1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21541.5 ns 22208 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 22125 ns 21750 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 281382.5 ns 281377 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 6163916 ns 5995542 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 690657 ns 679901 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 60229 ns 58583 ns 1.03
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 67000 ns 65083 ns 1.03
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 66291 ns 66334 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51000 ns 51645.5 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66426 ns 66570 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/Metal 14940791.5 ns 14881125 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 99716.5 ns 95562 ns 1.04
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 133354.5 ns 181791.5 ns 0.73
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 166083 ns 125000 ns 1.33
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 109833 ns 149958.5 ns 0.73
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 298209 ns 310334 ns 0.96
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 207628 ns 209829 ns 0.99
batchedmm(16, Bsize=512)/zygote/GPU/Metal 46423875 ns 46762875 ns 0.99
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 557541 ns 579958 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82458 ns 82625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81542 ns 80750 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84459 ns 86292 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83500 ns 82500 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192154 ns 192479 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2110021 ns 1995437.5 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 193692 ns 168164 ns 1.15
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1637875 ns 1923792 ns 0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1914562.5 ns 1884271 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1911000 ns 1888583 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1914333 ns 1917291 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 500508 ns 508617 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8777854.5 ns 8813959 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1073351 ns 923511 ns 1.16
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 333 ns 291 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21580 ns 21906 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 497166.5 ns 450667 ns 1.10
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 43201 ns 41861 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1791 ns 1875 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1916 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 241066 ns 246989 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 2295374.5 ns 2172458.5 ns 1.06
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 182807 ns 186805 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10625 ns 9979 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9687.5 ns 8562.5 ns 1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11875 ns 11458 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8063 ns 8666.5 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 115805 ns 114779 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 1135833.5 ns 1098750 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 236673 ns 238165 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10125 ns 9771 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9542 ns 10000 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10062.5 ns 10291 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10042 ns 9604.5 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 486530 ns 492318 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5071146 ns 5055604 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 637096.5 ns 634834 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 56541 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46541 ns 46708 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47000 ns 46792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 78709 ns 77500 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38846 ns 38130.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1372041 ns 1203084 ns 1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75456 ns 79889 ns 0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1862978.5 ns 1937792 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1924708 ns 1980021 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1939583.5 ns 1936541.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1603042 ns 1886999.5 ns 0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 209319 ns 211665 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10839458.5 ns 11204125 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1019311.5 ns 1008110 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 269563 ns 267979 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268083 ns 266375 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 270937.5 ns 271000 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 269646 ns 268291.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 192827 ns 193827.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1582625 ns 1446458.5 ns 1.09
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 284593 ns 282897 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 658729.5 ns 675542 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 675437.5 ns 673792 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 589229.5 ns 589042 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 691584 ns 681292 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 977700 ns 994673.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9348125 ns 8996396 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 902340 ns 898667.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2226917 ns 2161437 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2207375 ns 2211833 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2181500 ns 2212042 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2213209 ns 2215687.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 156097 ns 154115 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1440375 ns 1427083.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 409085 ns 406627 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5503437.5 ns 5581500 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5507667 ns 5501104 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5517375 ns 5517083.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5238729 ns 5264333.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 926541 ns 937351 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9910166.5 ns 10010417 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1552722.5 ns 1552019 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 991958 ns 986917 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 898791 ns 898250 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 900521 ns 898500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 1323375 ns 1324292 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46383 ns 46763 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 460146 ns 458458.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 244933 ns 243438 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2620500 ns 2547916.5 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2325791 ns 2324625 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2331292 ns 2333583 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3556812 ns 3548709 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 253489 ns 256534 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2357041 ns 2463833 ns 0.96
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 774424 ns 770755 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56375 ns 56084 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46395.5 ns 46250 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46750 ns 46542 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82708 ns 81750 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 27939 ns 27782 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1375750 ns 1193583 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 76386 ns 72909 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1707541 ns 2048500 ns 0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2085833 ns 2090917 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2085083 ns 2061417 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1715833 ns 1996958.5 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 222633 ns 223774 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11461896 ns 11058874.5 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1034336.5 ns 1035585 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 55854.5 ns 56458 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46833 ns 46709 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46791 ns 47084 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 78416.5 ns 78584 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48253 ns 48280 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1130333 ns 1315916.5 ns 0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 61021 ns 71380 ns 0.85
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1652750 ns 1903125 ns 0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1976145.5 ns 1963666.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1972583 ns 1961854 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1873145.5 ns 1850771 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 229304.5 ns 231382 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9660125.5 ns 9466667 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 919460 ns 913772 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32770 ns 34209 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 649167 ns 630896 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 46020 ns 48489 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6541 ns 6625 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7291 ns 6375 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6875 ns 7208 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 6500 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 203642.5 ns 205122.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5720167 ns 5599333 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 320874 ns 366869 ns 0.87
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 291 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31916 ns 32165 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 380667 ns 385250 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 34140 ns 40300 ns 0.85
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2959 ns 2875 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 3083 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2917 ns 2959 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 3125 ns 3000 ns 1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 180055.5 ns 183941 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 1844396 ns 1836854.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 148011 ns 164169.5 ns 0.90
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1425959 ns 1427166.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1413333 ns 1449750 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1411729 ns 1417625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1408583 ns 1441604 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 134588.5 ns 134383 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2798124.5 ns 2843875 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 352599 ns 355189 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5010083.5 ns 4996833 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5062521 ns 5015708 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5038334 ns 5020625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5011959 ns 4981250 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 667945 ns 673084.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10907500 ns 10662292 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1472837 ns 1463829 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49831333 ns 49772312.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35550667 ns 35522417 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35529334 ns 35489333 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97719291 ns 96946583 ns 1.01
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1621745 ns 1601690 ns 1.01
batchedmm(512, Bsize=32)/forward/GPU/Metal 10632250 ns 10627562.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1054872 ns 1042214.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154578979 ns 154216458 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112329583.5 ns 112301604.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112410541 ns 112218667 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 298542500 ns 294869708.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6574515 ns 6475752.5 ns 1.02
batchedmm(512, Bsize=32)/zygote/GPU/Metal 72685667 ns 70117375 ns 1.04
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5575743 ns 5557063.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 48000 ns 48417 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47583 ns 47916 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47959 ns 48021 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47625 ns 47541 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19664 ns 19924.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 506125 ns 496041 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 27380 ns 25680 ns 1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 51146 ns 49792 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50834 ns 50708.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50792 ns 51209 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50667 ns 51458 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 242242 ns 245262 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 2303041.5 ns 2146500 ns 1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 148061.5 ns 146160 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7958 ns 10209 ns 0.78
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 9875 ns 8959 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10167 ns 10750 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8791 ns 9000 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 118650 ns 118313 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 1220104.5 ns 1163542 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 237682.5 ns 237350.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10834 ns 10708 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10834 ns 10417 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10854.5 ns 10833 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12083 ns 10208 ns 1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 577803 ns 582997 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5781833 ns 5755625 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 657058 ns 653411 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9250 ns 8417 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9333 ns 8979 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10270.5 ns 11208 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8625 ns 9875 ns 0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 115000 ns 115767 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 1179208.5 ns 1146625 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 72391 ns 72681 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 14875 ns 14833 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 14604.5 ns 14584 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14708 ns 14979.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13937.5 ns 14125 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 553338.5 ns 554958.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5232750 ns 5137041 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 345108.5 ns 345660.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 958 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 958 ns 1083 ns 0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 958 ns 958 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 33129 ns 34204.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 644958 ns 638979.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 207273 ns 207831 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9667 ns 8291 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9833 ns 8541 ns 1.15
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9125 ns 9292 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9625 ns 9500 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 222729 ns 223363.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5938167 ns 5901875 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 660822 ns 657971.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23250 ns 23500 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23708 ns 23542 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23625 ns 23834 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23666 ns 23125 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 19720 ns 20050 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 463291 ns 448583.5 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 187092 ns 188301 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 54583.5 ns 53770.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 53417 ns 53042 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53500 ns 54042 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 53437 ns 55020.5 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 257983 ns 258832 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 2497375 ns 2415625 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 593097 ns 588042 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1400083 ns 1448437.5 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1397354 ns 1438125 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1396958 ns 1405125 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1439145.5 ns 1396021 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193949 ns 194395.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2076583 ns 2058625 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 346604 ns 346302 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5008750 ns 5024812.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4964729 ns 5026125 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5035396 ns 5011083 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4929271 ns 5006958 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 504516 ns 510089 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9047750 ns 9178458 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1200808.5 ns 1198365 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 788818625 ns 779661000 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 551612334 ns 541756209 ns 1.02
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 540244250 ns 545828709 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1530028375 ns 1513614750 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22553472 ns 22673094 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/Metal 107424833 ns 107171459 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14536744 ns 14686436 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2524317083 ns 2975273958 ns 0.85
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1813606000 ns 2889890291 ns 0.63
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1790647500 ns 1793050500 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4736014333 ns 4711214375 ns 1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118463004 ns 118916960 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/Metal 3156126000 ns 2622707250 ns 1.20
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 87116333 ns 87900974 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76750 ns 76541 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 79667 ns 79375 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 78750 ns 79167 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75625 ns 85583 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 191467.5 ns 191949 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1500958 ns 1500104 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 107821 ns 105890.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 197854 ns 261583.5 ns 0.76
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 282500 ns 232562.5 ns 1.21
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 194437.5 ns 196625 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 193354.5 ns 192687.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 989075 ns 996248 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8722083.5 ns 8743333 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 632717 ns 628158 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199760333.5 ns 198984604 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 138689875 ns 139204167 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139093875 ns 139144125 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 393577208 ns 393236834 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5844135 ns 5825572 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/Metal 33596666.5 ns 33344937.5 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3554721 ns 3611135.5 ns 0.98
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 618168312.5 ns 617564646 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 438128750 ns 440013042 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439741104.5 ns 438881145.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1193079708 ns 1193608916 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26785977 ns 26745549.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/Metal 111846000 ns 110179542 ns 1.02
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21973627 ns 21869093 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7500 ns 7083 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6208 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 6042 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9750 ns 9833 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26816 ns 26360.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 870208 ns 873478.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47821 ns 46220 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 227291 ns 213416.5 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 248354 ns 232437.5 ns 1.07
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221417 ns 222375 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206500 ns 219250 ns 0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 213108 ns 215332 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9197375 ns 8943333 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 528466 ns 524234 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8125 ns 8083 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8333 ns 8291 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10771 ns 10709 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7125 ns 8500 ns 0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 112963 ns 113094.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 1159166 ns 1123895.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 72821 ns 70651 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9000 ns 8917 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8541.5 ns 8958 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8520.5 ns 8584 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8041 ns 8208 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 485370 ns 492563 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5087333 ns 5073167 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 321064 ns 317437.5 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 459 ns 459 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 375 ns 541 ns 0.69
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 458 ns 500 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24957 ns 25048 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 721416 ns 713958 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 48420 ns 46561 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 12146 ns 10666.5 ns 1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 11771 ns 11479 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 11625 ns 11583 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 11833.5 ns 10354 ns 1.14
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 244608 ns 244034 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 6507583 ns 6283709 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 390079.5 ns 383588 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351395.5 ns 353416 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351312.5 ns 353792 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352000 ns 352021 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 352667 ns 350958 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 22442 ns 22877.5 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 331292 ns 312208 ns 1.06
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 189122 ns 188432 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 827250 ns 793000 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 807500 ns 807333.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 779084 ns 777437 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 824062.5 ns 830979 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 215869 ns 218580 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2827333 ns 2766209 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 609147 ns 604914.5 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5375 ns 5521 ns 0.97
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 5041.5 ns 5479 ns 0.92
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 6542 ns 7396 ns 0.88
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4292 ns 4166 ns 1.03
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17243 ns 17982 ns 0.96
batchedmm(16, Bsize=32)/forward/GPU/Metal 1971750 ns 1438291.5 ns 1.37
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 71641 ns 71380 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 12896 ns 12520.5 ns 1.03
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 12770.5 ns 11521 ns 1.11
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11687.5 ns 11521 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17250 ns 18042 ns 0.96
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 201112.5 ns 207562.5 ns 0.97
batchedmm(16, Bsize=32)/zygote/GPU/Metal 5430958.5 ns 5079708 ns 1.07
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 376764 ns 368113 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39958 ns 38125 ns 1.05
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51479.5 ns 51291.5 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 53084 ns 52584 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13375 ns 13500 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20843 ns 20289 ns 1.03
batchedmm(16, Bsize=128)/forward/GPU/Metal 4986791.5 ns 4978875 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 80861 ns 84681 ns 0.95
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 37666 ns 36896 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 32604.5 ns 31458 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 31666 ns 31958 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 67625 ns 66000 ns 1.02
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 178132 ns 184469 ns 0.97
batchedmm(16, Bsize=128)/zygote/GPU/Metal 13540167 ns 13432687 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 411414 ns 412423 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3583 ns 3583 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3458 ns 3666 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3750 ns 3958.5 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3458 ns 3500 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 19148 ns 19634 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 475042 ns 458041 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 26470 ns 28900 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4250 ns 4208 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4250 ns 4375 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4500 ns 4625 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4500 ns 4167 ns 1.08
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 192719 ns 197467.5 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 2182750 ns 2168666 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 138367 ns 138551.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6167 ns 5208 ns 1.18
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4833 ns 4792 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5833.5 ns 7250 ns 0.80
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4125 ns 3792 ns 1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 139477 ns 142334.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 1191667 ns 1171167 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 59621 ns 58781 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9312.5 ns 9125 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8541 ns 8833 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8833 ns 9125 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8250 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 804737 ns 822603 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 7653458.5 ns 7665708 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 389654.5 ns 387763.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 206083 ns 204042 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210292 ns 212000 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210625 ns 210875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 202208 ns 200958 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36872 ns 36985.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 841958 ns 853417 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205472 ns 205912 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 654354 ns 653187.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 673208.5 ns 665958 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 622021 ns 622770.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 623250 ns 585667 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 254216 ns 260510 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8205667 ns 8195083 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 802774 ns 799653 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3322833.5 ns 3369291 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2326375 ns 2332125 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2332417 ns 2329166 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6281542 ns 6307167 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204376.5 ns 205325 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/Metal 6114500 ns 6066541 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 214983 ns 212943 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11470521 ns 11648041 ns 0.98
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8342437.5 ns 8330687.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8323000 ns 8348104 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21067292 ns 21116042 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 785999 ns 734131.5 ns 1.07
batchedmm(128, Bsize=128)/zygote/GPU/Metal 28332646 ns 26082375 ns 1.09
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1073402 ns 1069061 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7562.5 ns 4521 ns 1.67
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4708 ns 5208 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6416 ns 7583 ns 0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4208.5 ns 5500 ns 0.77
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 130660 ns 132826.5 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 1185854.5 ns 1175375 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 55851 ns 55421 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8750 ns 9292 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9541 ns 8334 ns 1.14
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10834 ns 9562.5 ns 1.13
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 12333 ns 8604.5 ns 1.43
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 712252 ns 716825.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 7308583.5 ns 7184437.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 372784 ns 369984 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 124416 ns 98313 ns 1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93999.5 ns 125521 ns 0.75
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 98583 ns 100541 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 99812.5 ns 103500 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 148811.5 ns 149399 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2832625 ns 2228333.5 ns 1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 186182 ns 182342 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027833 ns 2046104.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2027833.5 ns 2031250 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2031312 ns 1985791.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1974000 ns 2021416.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 662744 ns 674153.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10775125 ns 10587167 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1253594 ns 1250004 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 33875 ns 34188 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 36708 ns 36000 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 35000 ns 35021 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 750 ns 833 ns 0.90
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15499 ns 15860 ns 0.98
batchedmm(2, Bsize=4)/forward/GPU/Metal 543666.5 ns 553417 ns 0.98
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 76180 ns 75761 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 3708 ns 3083.5 ns 1.20
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 4333 ns 3541 ns 1.22
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3458.5 ns 3625 ns 0.95
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2583.5 ns 3375 ns 0.77
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 135016 ns 140010.5 ns 0.96
batchedmm(2, Bsize=4)/zygote/GPU/Metal 1502041 ns 1942729.5 ns 0.77
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 356474 ns 353624 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7000 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 6041 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 5958 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9709 ns 9958 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36031.5 ns 35885 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 847396 ns 854042 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49931 ns 50330 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221584 ns 223104 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 223416 ns 234125 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221500 ns 221250 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219292 ns 215667 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 238688 ns 243422 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7960125 ns 8021021 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 520506 ns 512516 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3709 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21870 ns 22271.5 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 453375 ns 468292 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 43661 ns 43460 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14750 ns 14167 ns 1.04
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14666 ns 14541 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14542 ns 14583 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14459 ns 14500 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 296421 ns 303531 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 2342458 ns 2253708.5 ns 1.04
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 189867.5 ns 200012.5 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 130583 ns 99083 ns 1.32
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 100333 ns 128333.5 ns 0.78
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 102250.5 ns 103812 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 122167 ns 103958.5 ns 1.18
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131841 ns 150020 ns 0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2397729 ns 2875583 ns 0.83
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 183912 ns 195772 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1931437.5 ns 1887875.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1901292 ns 1929042 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1914291.5 ns 1884833 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1655875.5 ns 1894729 ns 0.87
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 656012 ns 670688 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10340375 ns 10463500 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1066462 ns 1065452 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18125 ns 18959 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17208.5 ns 17354.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20479 ns 22208 ns 0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17749.5 ns 17541.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104389.5 ns 104525.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1377917 ns 1362312.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79261 ns 79351 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 230583.5 ns 252250 ns 0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 233166 ns 260833 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219208.5 ns 219458 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218334 ns 257937 ns 0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 490023 ns 495429 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6087208 ns 6195583 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 468975 ns 462125 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 24459 ns 24958.5 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 32521 ns 32604.5 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 28917 ns 27500 ns 1.05
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1291.5 ns 1208 ns 1.07
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15797 ns 16021 ns 0.99
batchedmm(16, Bsize=4)/forward/GPU/Metal 560437.5 ns 533959 ns 1.05
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 87701 ns 80071 ns 1.10
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 5833 ns 5250 ns 1.11
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 6959 ns 5854.5 ns 1.19
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5708 ns 5792 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 5145.5 ns 6125 ns 0.84
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 199406 ns 201439.5 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/Metal 2041979 ns 2014541.5 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 394889.5 ns 376235 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222542 ns 221583 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221708 ns 222541.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222750 ns 226291 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221437.5 ns 221875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 217746 ns 219232.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1704083 ns 1686583 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 274143 ns 271454 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 509083.5 ns 559604 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 504916.5 ns 548354 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 500250 ns 500083.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 510916.5 ns 498250 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1021374 ns 1034159 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8740667 ns 8587229 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 857769.5 ns 850955.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19750 ns 19625 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18333 ns 19313 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22500 ns 23208 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19104.5 ns 20583 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111476.5 ns 111518.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1485375 ns 1475625 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77575.5 ns 80186 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225146 ns 215020.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215291.5 ns 250333 ns 0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215562 ns 214500 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217354.5 ns 221729.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 701275.5 ns 708936 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7154208 ns 7292833 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 539526 ns 539977 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7416.5 ns 6166 ns 1.20
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5916.5 ns 6479 ns 0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8958.5 ns 8042 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5791 ns 6417 ns 0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 131574 ns 133623 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 1218250 ns 1170916 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 65831 ns 66921 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12062.5 ns 12250 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13125 ns 11729.5 ns 1.12
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14500 ns 13334 ns 1.09
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16562.5 ns 11645.5 ns 1.42
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 768388 ns 771416.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 7371292 ns 7239334 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 382304 ns 391255 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7187.5 ns 4500 ns 1.60
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4771 ns 5041.5 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7083 ns 7042 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6375 ns 5500 ns 1.16
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 133143 ns 134989.5 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 1214458 ns 1146875 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 57010 ns 58260 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7687.5 ns 7750 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7750 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7708 ns 8125 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7958.5 ns 7709 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 736643 ns 738275 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 7612479.5 ns 7536771 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 390394 ns 386245 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14470208.5 ns 14664541 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10098583 ns 10093041 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10105833 ns 10106791 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27692583 ns 27704625 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 536068 ns 529053 ns 1.01
batchedmm(128, Bsize=512)/forward/GPU/Metal 22376167 ns 22466021 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 395754 ns 401266 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46412458.5 ns 46793583 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33522500 ns 33459958.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33458958 ns 33523667 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85300667 ns 85429125 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2879954 ns 2854223 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/Metal 83664854.5 ns 89341312.5 ns 0.94
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3291966 ns 3309294 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 187833.5 ns 188000 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 185250 ns 186250 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 187792 ns 188667 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 185292 ns 185938 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101714 ns 101713 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1511792 ns 1484500 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 230572 ns 235268 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 598667 ns 641812.5 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 641604.5 ns 636958 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 590145.5 ns 589208 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 598458.5 ns 591771 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 702997.5 ns 704450.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7556645.5 ns 7517417 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 789679 ns 785986 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 750 ns 0.67
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 750 ns 750 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 750 ns 667 ns 1.12
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31490 ns 32067 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 666250 ns 651375 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47360 ns 47241 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12458 ns 9979 ns 1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9958 ns 11521 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12270.5 ns 10188 ns 1.20
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10292 ns 9500 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 276432 ns 276358.5 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 6005333.5 ns 5875459 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 380359 ns 374075 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26291 ns 26291 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26750 ns 26500 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26292 ns 26209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23151.5 ns 23479 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 439917 ns 437083 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 210222 ns 210433 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67084 ns 67042 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67292 ns 68833 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67083 ns 68917 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66958 ns 67583 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 272442 ns 274089 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 2242375 ns 2210459 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 609967 ns 606899 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203875 ns 204500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210083 ns 210417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209458 ns 211125 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199541 ns 200125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28002 ns 27585 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 878708.5 ns 861208 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205762.5 ns 205157.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 647687 ns 652542 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 630416 ns 671541 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 624563 ns 624208 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 629666 ns 580625 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 234240 ns 236486 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9313917 ns 9239500 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 850729 ns 837472 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 655291 ns 650083 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 642270.5 ns 650625 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 617187.5 ns 550709 ns 1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 610333 ns 652708 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 187686 ns 186884 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1354667 ns 1405750 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 249712 ns 234974 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2271521 ns 2244125 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2229354.5 ns 2249625 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2251625 ns 2253687.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2105417 ns 2232292 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 903523 ns 908141 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9610542 ns 9610291 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1362010.5 ns 1356860 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20437.5 ns 19479 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18708 ns 20020.5 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22875 ns 22000 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19604.5 ns 20500 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107237 ns 107405.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1498312.5 ns 1497959 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81410 ns 82031 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 232333 ns 259687.5 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232917 ns 234896 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222708 ns 223354.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 231584 ns 222104 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 698969 ns 701938 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7523124.5 ns 7694083.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 552146 ns 552123 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 750 ns 0.67
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 750 ns 750 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 667 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22758 ns 22889 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 720354 ns 713250.5 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 49520 ns 47681 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 11209 ns 10833 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10708 ns 11458 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16250 ns 10958 ns 1.48
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10000 ns 11333 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 258661.5 ns 258094.5 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6648083 ns 6601250 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 400195 ns 398396 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10292 ns 8021 ns 1.28
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8166 ns 7916.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9771 ns 10479 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8687.5 ns 7771 ns 1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 114739 ns 114650.5 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 1152208.5 ns 1128833 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 69971 ns 67611 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8167 ns 8625 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 11375 ns 9459 ns 1.20
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 9334 ns 0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7479 ns 10083 ns 0.74
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 475261 ns 474110.5 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4907479.5 ns 4853125 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 325933 ns 322085 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2375 ns 2104.5 ns 1.13
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2208 ns 2375 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2562.5 ns 2667 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2333 ns 2125 ns 1.10
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 19068 ns 19503 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 427542 ns 435896 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 191412 ns 189822 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 8208 ns 7666.5 ns 1.07
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 7292 ns 7083 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 7083 ns 7771 ns 0.91
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6583 ns 8417 ns 0.78
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 208633.5 ns 209638.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 2362021 ns 2304438 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 583987 ns 579508 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 747083 ns 749167 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746666.5 ns 749833.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 747771 ns 747292 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 746625 ns 748521 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 22542.5 ns 22733 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 340104 ns 312604 ns 1.09
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 38681 ns 37375.5 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 773000 ns 778000 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 791312.5 ns 807229 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 775416.5 ns 774167 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 792583 ns 776625 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 207047 ns 207826 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2669500 ns 2597208 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 233853 ns 220633 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7209 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 6000 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6042 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10042 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32722 ns 32931 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 855791 ns 855708.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48751 ns 50540 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 230146 ns 262833 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 264521 ns 263396 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228791 ns 229333 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212833 ns 212854 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 253139 ns 255573 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8191500 ns 8358834 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 522136 ns 524047.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13021 ns 12083 ns 1.08
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11375 ns 11959 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 14187 ns 13583 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12395.5 ns 12771 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 131017.5 ns 132456 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 1184667 ns 1189125 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 233542 ns 233113 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25271 ns 25021 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24167 ns 25500 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25041 ns 25458 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24333 ns 24792 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 811743.5 ns 815326 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 7729583 ns 7701292 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 685097 ns 681611 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9187.5 ns 9562.5 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9958 ns 9833 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10208 ns 12000 ns 0.85
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9458.5 ns 9541.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 118607 ns 118599 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 1259500 ns 1229416 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 72621 ns 74341 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15791.5 ns 14375 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14125 ns 20917 ns 0.68
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15334 ns 17250 ns 0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14583.5 ns 15562.5 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 623973.5 ns 626256 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5665292 ns 5717062 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 372954 ns 368145 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9771 ns 9270.5 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9125 ns 9208 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10917 ns 11042 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9375 ns 9145.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 117527.5 ns 117653 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 1177500 ns 1158958 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 72381 ns 73341 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13687 ns 14062.5 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 16084 ns 15125 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13000 ns 15125 ns 0.86
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12937.5 ns 15146 ns 0.85
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 517238 ns 518369.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5092125 ns 5051833 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 342213.5 ns 340775 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 28209 ns 27708 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34167 ns 33875 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 32375 ns 31792 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 2020.5 ns 2229.5 ns 0.91
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16305 ns 16522 ns 0.99
batchedmm(2, Bsize=128)/forward/GPU/Metal 4760375 ns 4854041.5 ns 0.98
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 77641 ns 78412 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 6458 ns 5583 ns 1.16
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 6625 ns 5917 ns 1.12
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5666 ns 6084 ns 0.93
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6417 ns 7770.5 ns 0.83
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 134001.5 ns 136257 ns 0.98
batchedmm(2, Bsize=128)/zygote/GPU/Metal 13311250 ns 13273333 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 386434 ns 379326 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 24308 ns 24751 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 701250 ns 682541.5 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48831 ns 48791 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9291.5 ns 7520.5 ns 1.24
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 8583 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8625 ns 8625 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6416.5 ns 7458.5 ns 0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 180410.5 ns 181857 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 6473583 ns 6285375 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 388845 ns 389326 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5708 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5916 ns 6208 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6166 ns 6000 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25125 ns 25394 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 723375 ns 714417 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 207722 ns 207474 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22833 ns 26375 ns 0.87
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 23000 ns 23250 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 23187.5 ns 21459 ns 1.08
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21625 ns 20250 ns 1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 264109 ns 262619.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6721166.5 ns 6644125 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 692722.5 ns 695681 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 149812.5 ns 145625 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 147708 ns 178292 ns 0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149834 ns 150417 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147375 ns 153812.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 185601 ns 188204 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1536750 ns 1588584 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 193682 ns 190633 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1331000 ns 1345771 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1332124.5 ns 1331542 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1336000 ns 1322333.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1317917 ns 1167354 ns 1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 846555 ns 856737 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9345604.5 ns 9165250 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1090882 ns 997975 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25313 ns 24250 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23709 ns 24458.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 26250 ns 27084 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24500 ns 24417 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 225227.5 ns 225455 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1714291 ns 1705354 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 103451 ns 115742 ns 0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 154250 ns 127500 ns 1.21
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 144603.5 ns 174187 ns 0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 122187.5 ns 119042 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 118542 ns 130375 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 967258 ns 984493 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8776208.5 ns 8679292 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 545716 ns 591319 ns 0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22145 ns 22641 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 722291.5 ns 689208 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 49071 ns 47290 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7292 ns 7083.5 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8000 ns 8083 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12458 ns 6958 ns 1.79
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6750 ns 6500 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 196429.5 ns 197931.5 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6548875 ns 6549187.5 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391269.5 ns 395326.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7125 ns 6333.5 ns 1.12
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5333 ns 5708 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6771 ns 7541 ns 0.90
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5833 ns 6000 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 135991 ns 137058.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 1202458 ns 1181916.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 235063 ns 232733 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10875 ns 10833.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10166 ns 10583 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10959 ns 10416 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10125 ns 9792 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 836132.5 ns 841858 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 8143458 ns 8090729 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 676977 ns 672580 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1542 ns 1584 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1583 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22716 ns 22927 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 438354.5 ns 429250 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 209702 ns 208003 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5792 ns 5917 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6084 ns 6375 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5833 ns 6125 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5750 ns 5750 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 215208 ns 217549 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 2221500 ns 2167125 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 586451.5 ns 581914.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8292 ns 8562 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8541 ns 8458 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8562.5 ns 10291.5 ns 0.83
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7875 ns 8229.5 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 116817 ns 116906 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 1243833 ns 1209583 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 71641 ns 77271.5 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10084 ns 9104.5 ns 1.11
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9041 ns 15417 ns 0.59
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9708 ns 8792 ns 1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8084 ns 1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 553365 ns 557267.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5644167 ns 5634417 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 347974 ns 344656 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126146 ns 125125 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129625 ns 130729 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 129167 ns 130250 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 180792 ns 181042 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 45896 ns 46296.5 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/Metal 360083 ns 364354 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 98361 ns 100232 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 331812.5 ns 309333 ns 1.07
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 345666 ns 342125 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 328917 ns 313833 ns 1.05
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 614041 ns 570709 ns 1.08
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 182821 ns 185266 ns 0.99
batchedmm(128, Bsize=4)/zygote/GPU/Metal 2280959 ns 1373875 ns 1.66
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 505920.5 ns 506148 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397250 ns 396437.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288041 ns 289000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288166 ns 288375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756000 ns 756250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43372 ns 43482.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 445416 ns 434458 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 81381 ns 79761 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1449813 ns 1408916.5 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1136895.5 ns 1136979 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1132916 ns 1132062 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2442792 ns 2443000.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 244597 ns 248184 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1978292 ns 1965375 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 348724 ns 349476 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 652792 ns 645500 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 653208.5 ns 650562.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 548209 ns 546541.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 584396 ns 545645.5 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 183317 ns 173484 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1360500 ns 1350375 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 252133 ns 242424 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2466333.5 ns 2520666.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2483562.5 ns 2473750 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2467833 ns 2447792 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2282833.5 ns 2452584 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 931909 ns 937381.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10376583 ns 10132041 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1299404 ns 1450713 ns 0.90
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32959 ns 30500 ns 1.08
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 36125 ns 36187.5 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34625 ns 34146 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 958 ns 958 ns 1
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15098 ns 15458 ns 0.98
batchedmm(2, Bsize=32)/forward/GPU/Metal 1378749.5 ns 1293854 ns 1.07
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 79251 ns 71001 ns 1.12
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 4084 ns 3084 ns 1.32
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 4750 ns 3958 ns 1.20
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3917 ns 3333 ns 1.18
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3125 ns 3042 ns 1.03
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 132398 ns 135380 ns 0.98
batchedmm(2, Bsize=32)/zygote/GPU/Metal 5241250 ns 5260562.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 338354 ns 340585.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1459250 ns 1460666 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1500166 ns 1503375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1498875 ns 1503000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1438479.5 ns 1441729 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 42417 ns 41871 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1219791 ns 1242250 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 240737.5 ns 239254 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5130542 ns 5151979 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5293020.5 ns 5296833.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5286125 ns 5285437.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4986583 ns 4980042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 230731.5 ns 230225 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11576750 ns 11359208.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1242163 ns 1233400 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3709 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33753 ns 33654 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 376312 ns 352750 ns 1.07
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 39530 ns 39741 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15708.5 ns 15041 ns 1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15417 ns 15709 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15750 ns 15500 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15292 ns 15375 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 249731.5 ns 251748 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 1855541 ns 1635667 ns 1.13
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 171782 ns 165632 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404167 ns 401812.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295875 ns 296666 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296083 ns 295167 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760916 ns 760709 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113370 ns 113125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 587375.5 ns 574187 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 89731 ns 87471 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1479833 ns 1429500 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1153270.5 ns 1159833 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1161500 ns 1157541 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2457520.5 ns 2466395.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 240863 ns 235512 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 2046250 ns 1507125 ns 1.36
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 353339 ns 353405 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 959 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1000 ns 1083 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1042 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 958 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 24309 ns 24950 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 718583 ns 692770.5 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 209613 ns 208254 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10000 ns 7917 ns 1.26
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10375 ns 9916 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9916 ns 8583 ns 1.16
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 206553.5 ns 202658.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6653271 ns 6448187.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 690112.5 ns 697032 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 829375 ns 831021 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 616937.5 ns 619667 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 616312 ns 618250 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1443333.5 ns 1541417 ns 0.94
batchedmm(128, Bsize=32)/forward/GPU/CUDA 132766 ns 131643 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/Metal 1745250 ns 1716917 ns 1.02
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 166192 ns 166023 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2679709 ns 2699312.5 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1975667 ns 1995500 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2000708 ns 1985791 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4917479.5 ns 4946958 ns 0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 234270.5 ns 234057 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/Metal 6784500 ns 6761458 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 861380 ns 852834 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 30986 ns 32746 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 649291.5 ns 642249.5 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 47661 ns 47461 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9250 ns 6208 ns 1.49
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7666 ns 9334 ns 0.82
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9145.5 ns 6708 ns 1.36
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6458 ns 6229 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 216879.5 ns 223155 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 6034187.5 ns 6000375 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 362223.5 ns 361916 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1727271 ns 1731292 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1726375 ns 1754791 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1726708 ns 1728874.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1718541.5 ns 1745562.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 184752 ns 190073 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1378104.5 ns 1502437.5 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 355824 ns 353886 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4378291.5 ns 4404625 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4427500 ns 4422041 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4404270.5 ns 4362625 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4352625.5 ns 4346521 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 828954 ns 855907 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9565917 ns 9512792 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1254843 ns 1246280 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7167 ns 6875 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7000 ns 17395.5 ns 0.40
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7167 ns 7250 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6896 ns 6834 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 22145 ns 22751 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 276666.5 ns 272959 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 36790 ns 37041 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 66916 ns 33000 ns 2.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 46000 ns 68979.5 ns 0.67
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 39708.5 ns 33333 ns 1.19
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 33041 ns 45500 ns 0.73
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 204844 ns 212527.5 ns 0.96
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2631208.5 ns 2608042 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 218562.5 ns 221728.5 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 21375 ns 23417 ns 0.91
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 26250 ns 25542 ns 1.03
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 25542 ns 23312.5 ns 1.10
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5292 ns 5625 ns 0.94
batchedmm(2, Bsize=512)/forward/GPU/CUDA 17697 ns 18456 ns 0.96
batchedmm(2, Bsize=512)/forward/GPU/Metal 15018041 ns 14791020.5 ns 1.02
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 84241 ns 89826.5 ns 0.94
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 13041 ns 11917 ns 1.09
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 11875 ns 11125 ns 1.07
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 11000 ns 10625 ns 1.04
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18250 ns 17958 ns 1.02
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 215801 ns 223372.5 ns 0.97
batchedmm(2, Bsize=512)/zygote/GPU/Metal 46206666 ns 45999500 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 371364 ns 382947 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 405666.5 ns 403917 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297167 ns 297500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 297083 ns 297375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762584 ns 762334 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46537 ns 47041 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 552000 ns 533542 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 88131 ns 89431 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1486542 ns 1426250 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1163917 ns 1164625 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1163479.5 ns 1163125 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2469750 ns 2468250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 271051 ns 281846 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2273937.5 ns 2244750 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 377594 ns 378111.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1484541 ns 1487625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1526541 ns 1529979.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1526583 ns 1529729.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1461708 ns 1464667 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 53340 ns 54740 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1143959 ns 1143667 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236227.5 ns 235424 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5132125 ns 5146979 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5279312.5 ns 5286395.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5291750.5 ns 5251625 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4941792 ns 4982541.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 250059.5 ns 258236.5 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10397667 ns 10236958 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1214863 ns 1218755 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28458.5 ns 28125 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28208 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28292 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23939 ns 24960 ns 0.96
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 446500 ns 430583 ns 1.04
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 211713 ns 212483 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66708 ns 66375 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67542 ns 66542 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66500 ns 67000 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66417 ns 66584 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 332916 ns 344216.5 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 2719875 ns 2732875 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 656417.5 ns 652061 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 86042 ns 84500 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 131334 ns 93000 ns 1.41
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85417 ns 85541 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80250 ns 81042 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192392 ns 190669 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2062187.5 ns 2029208 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 203152 ns 183273 ns 1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2025521 ns 2023313 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1850750 ns 2010958 ns 0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2019000 ns 1979291.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1835792 ns 1995645.5 ns 0.92
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 505719.5 ns 520209.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9377312.5 ns 9143521 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1086637 ns 1082408 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.