Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: move all subpackages into a mono-repo #1002

Merged
merged 1,196 commits into from
Nov 3, 2024
Merged

refactor: move all subpackages into a mono-repo #1002

merged 1,196 commits into from
Nov 3, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Nov 3, 2024

  • CI
    • LuxTestUtils
    • LuxCUDA
    • LuxLib
    • LuxCore
    • MLDataDevices
    • WeightInitializers
    • Lux
  • Install LuxCUDA from the local path
  • Open a PR in General to Fix links
    • Check all versions are available here
  • Benchmarks need to be unified
  • Documentation CI
  • Lux README add all build badges
  • CompatHelper
  • Move all issues to this repo

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 910fb3a Previous: 699c8d8 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 410375 ns 411375 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 321625 ns 323084 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 322750 ns 241750 ns 1.34
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739396 ns 742584 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43662 ns 43670.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 615625.5 ns 638083 ns 0.96
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 490583 ns 521459 ns 0.94
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 469667 ns 403792 ns 1.16
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 949916 ns 908000 ns 1.05
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 195838 ns 188991 ns 1.04
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 751625 ns 744083.5 ns 1.01
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 637916 ns 624667 ns 1.02
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 609500 ns 521562.5 ns 1.17
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 988166.5 ns 1006750 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1618750 ns 1618667 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1150500 ns 1189854.5 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1351666.5 ns 1358375 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2446750 ns 2360458 ns 1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212964 ns 211422.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12208375 ns 12284958.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9532354 ns 9550979.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9263895.5 ns 9390791 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18000916.5 ns 18060041.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1919703 ns 1906624.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17325667 ns 17280916 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14342687.5 ns 14329167 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14309292 ns 14463083 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21140646 ns 21088375 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 122324917 ns 121038500 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174050729.5 ns 174268209 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147981416 ns 155647417 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 103728375 ns 103289458 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5486859 ns 5459016 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 593769375 ns 592681937.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 540254041 ns 540116125 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 443330833 ns 460022146 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 621468875 ns 623412250 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38148484 ns 38146652 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 712530979.5 ns 751859749.5 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 672059166 ns 667614542 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 605679875 ns 606980437.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 736203250 ns 744028250 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 867750 ns 861145.5 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 828521 ns 826334 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1221687.5 ns 1164604.5 ns 1.05
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 948646 ns 959395.5 ns 0.99
lenet(28, 28, 1, 32)/forward/GPU/CUDA 267646 ns 263975.5 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2781563 ns 2730708 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2455416 ns 2455708.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3324667 ns 3317604.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3352208.5 ns 3286521.5 ns 1.02
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1067634.5 ns 1038213 ns 1.03
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6746187.5 ns 6779291.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6394041.5 ns 6365500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6493500 ns 6531583 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7521520.5 ns 7635875 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212213 ns 210025 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24032458 ns 24055375 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21271709 ns 21237625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21344875 ns 21535792 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29716875 ns 29721771 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1977016 ns 1973993 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37416250 ns 37426416 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45549167 ns 34385895.5 ns 1.32
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45656917 ns 45888792 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 37866583.5 ns 49367041.5 ns 0.77
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13369437.5 ns 13355875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12433895.5 ns 12430958.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12479520.5 ns 12600937.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15092812.5 ns 15122729 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 513752 ns 518849 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47146959 ns 47134500 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41624042 ns 41671875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40662062.5 ns 41125499.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 57955000.5 ns 58336333 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2971047 ns 3218047 ns 0.92
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74388062.5 ns 74376750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91684083 ns 68965000 ns 1.33
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 91334209 ns 91496292 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 75923583 ns 98399104 ns 0.77
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 287784708 ns 286107083.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 340294999.5 ns 339607208 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313237875 ns 321183396 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 268896291 ns 268796333 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7099084 ns 7107764 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 974563458 ns 971792250 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 922104479.5 ns 922480542 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 823100917 ns 835684104 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1124136437.5 ns 1117474583 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33817317 ns 33742759 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1426228458.5 ns 1448964667 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1732184792 ns 1371326875 ns 1.26
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1589204583 ns 1656412041 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1296915520.5 ns 1663889000 ns 0.78
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1513833 ns 1528208 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1256250.5 ns 1277937.5 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1617250 ns 1635937.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2100271 ns 2136917 ns 0.98
lenet(28, 28, 1, 128)/forward/GPU/CUDA 267408.5 ns 277390.5 ns 0.96
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7855792 ns 7872250 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6609083 ns 6588000 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7129958 ns 7229396.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10427583.5 ns 10478041 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1101772 ns 1130644 ns 0.97
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 177559792 ns 177405459 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 134698625 ns 132546709 ns 1.02
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 116750812.5 ns 130053917 ns 0.90
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 166832000 ns 165568083 ns 1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4841379 ns 4878153.5 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 638544042 ns 643663333 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 609109167 ns 496969000 ns 1.23
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 550183416 ns 558568375 ns 0.98
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 658007084 ns 654929750 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16323115 ns 18110009 ns 0.90
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1077687.5 ns 1068292 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 971791.5 ns 983291 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1355916.5 ns 1327542 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1365791 ns 1373792 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 269969 ns 281111 ns 0.96
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5959604.5 ns 6002271 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4641791.5 ns 4660958.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4996479.5 ns 5006354 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5667042 ns 5624708 ns 1.01
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1130354 ns 1151478.5 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23702250 ns 23602937.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35287917 ns 34462041.5 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37102750 ns 41206708 ns 0.90
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34955125 ns 34998812.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1830165.5 ns 1861561 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184351146 ns 184955020.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 160101312.5 ns 159249771 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 145775125 ns 150499917 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 391466917 ns 390550250 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16490669 ns 16472871 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 283855583 ns 286689500 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 244916416 ns 244388646 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 287580791 ns 296120917 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 442181417 ns 440533417 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 621682291 ns 624998521 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 475715042 ns 477642917 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 377666417 ns 411867812.5 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 655324521.5 ns 656030104 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12470870 ns 12477905 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1807135667 ns 1873735437.5 ns 0.96
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1613894500 ns 1636021583 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1552828834 ns 1558895000 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2103353020.5 ns 2103890062.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49667614 ns 49609571 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3035395.5 ns 3064313 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2088333 ns 2106875 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2298562.5 ns 2301542 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4529187.5 ns 4944708.5 ns 0.92
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585054.5 ns 586671 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25522562 ns 25694166 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19793583.5 ns 20092625.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19132291.5 ns 19545895.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36529125 ns 36568812 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2984903 ns 3200820 ns 0.93
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 35045167 ns 35138250 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28386333.5 ns 28420084 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 29775375 ns 30280062.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 43881334 ns 42544854.5 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1642291 ns 1650167 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1185042 ns 1195708 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1384250 ns 1388458 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2486791.5 ns 2498125 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215264.5 ns 218867 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12687125 ns 12700771 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9979292 ns 9962124.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9688917 ns 9800459 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18438270.5 ns 18403354 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1943963.5 ns 1957280 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17711875 ns 17702708 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14732146 ns 14737000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14565417 ns 14865041 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21440250 ns 21477333.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23368375 ns 23644021 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34475875 ns 34568146 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37242333 ns 41693959 ns 0.89
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34907208 ns 34878583 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1836858.5 ns 1840287 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189818125 ns 188357375 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 232958916 ns 233488333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 194219375 ns 202742250 ns 0.96
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 433843687.5 ns 429823895.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13937728 ns 13939550 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 290295000 ns 291377187.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 250001292 ns 249397167 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 292767062.5 ns 300701042 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 447731583 ns 446062833 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3385312.5 ns 3387083 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3107292 ns 3112854 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3020041 ns 2905708 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4097104 ns 3940000 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 584212 ns 570283 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7637750 ns 7636021 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7447916.5 ns 7442000 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7451875 ns 7380521 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8206917 ns 8212750 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1433678 ns 1364212 ns 1.05
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18783083 ns 13685833.5 ns 1.37
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19125417 ns 19094334 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19126167 ns 19126041 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 10729541 ns 15649500.5 ns 0.69
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 68312.5 ns 69459 ns 0.98
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 68166.5 ns 69875 ns 0.98
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 71166 ns 72083 ns 0.99
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 70167 ns 68812.5 ns 1.02
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 48867 ns 47850 ns 1.02
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 343583 ns 318833.5 ns 1.08
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 330271 ns 285875.5 ns 1.16
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 339812.5 ns 326000 ns 1.04
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 290000 ns 319625 ns 0.91
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 219544 ns 210144 ns 1.04
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 400062 ns 447500 ns 0.89
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 406250 ns 437791 ns 0.93
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 448000 ns 413375 ns 1.08
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 364250 ns 328959 ns 1.11
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3037417 ns 3055292 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2086458 ns 2092833 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2275542 ns 2283687.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4814625 ns 4895416.5 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585443.5 ns 585359 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23616479.5 ns 23561833 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18027146 ns 18085229 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16961541.5 ns 18562458 ns 0.91
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35005958 ns 35017833 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2898476 ns 3105298.5 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33365500 ns 33378229 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27654792 ns 27662145.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27370834 ns 27887458 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41989125 ns 41809854.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121294166 ns 120765334 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174334083 ns 174275666 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147981625 ns 156098417 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 109331959 ns 103997770.5 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5464056 ns 5461795.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 470670208.5 ns 471697125 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 469290292 ns 468205208 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 436302917 ns 455789333 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 735119292 ns 728998166 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35168838 ns 35173763 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 638891333.5 ns 640412562.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 653076271 ns 655505917 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 576963916.5 ns 590476187.5 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 728588750 ns 732032000 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1243209 ns 1249541 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 987854 ns 949958.5 ns 1.04
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 975229.5 ns 764125 ns 1.28
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2064062.5 ns 2000458 ns 1.03
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 570575.5 ns 568299.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2951521 ns 2960792 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2624959 ns 2611021 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2613709 ns 2513020.5 ns 1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3698958.5 ns 3690271 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1330708 ns 1319857 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6648250 ns 6641791 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6511917 ns 6504791 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6505854 ns 6489375 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4458562.5 ns 4443166 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 103271 ns 104249.5 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 101895.5 ns 105166 ns 0.97
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 104645.5 ns 105250 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 102875 ns 105625 ns 0.97
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 28169 ns 28456 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 237000 ns 236750 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 237292 ns 236541 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 236875 ns 237667 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 250083 ns 249625 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 219764 ns 217310.5 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 742333 ns 330167 ns 2.25
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 742416 ns 742062.5 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 742584 ns 748209 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 322584 ns 721792 ns 0.45
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 13020.5 ns 13583 ns 0.96
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 12417 ns 14250 ns 0.87
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 14167 ns 14354 ns 0.99
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 13042 ns 13791 ns 0.95
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 28587 ns 28098 ns 1.02
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 25917 ns 25333.5 ns 1.02
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 26000 ns 25750 ns 1.01
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 26167 ns 25667 ns 1.02
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 25625 ns 25750 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 210517 ns 206637.5 ns 1.02
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 45833.5 ns 45583.5 ns 1.01
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 46000 ns 45875 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 46334 ns 46000 ns 1.01
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 27938 ns 28209 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 306901708.5 ns 309099062.5 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 235674896 ns 232469666.5 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 188123167 ns 216377833 ns 0.87
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 310304750 ns 308762583 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7674455 ns 7672114 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1094591792 ns 1103432604 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 992766021 ns 1001458208 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 897397000 ns 901919771 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1303823270.5 ns 1293921625 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27217919 ns 27115979 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 416417 ns 414208.5 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 413667 ns 415583 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 417771 ns 416958 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 416833 ns 418375.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47774 ns 48086 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1457750 ns 1344667 ns 1.08
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1344208 ns 1315687 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 1353833 ns 1294125 ns 1.05
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 1722937.5 ns 1745083.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 225041 ns 221906 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 3506083 ns 1836104.5 ns 1.91
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 3470604 ns 3473770.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 3472167 ns 3450771 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 1988875 ns 3660083 ns 0.54
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1498041 ns 1396583.5 ns 1.07
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1080334 ns 1097333 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1146834 ns 939062.5 ns 1.22
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2142125 ns 2231792 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 576012 ns 574483.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3068416 ns 2873417 ns 1.07
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2740584 ns 2715208 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2734667 ns 2626645.5 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3799166 ns 3813542 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1356849 ns 1401203 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8839542 ns 8821895.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8778250 ns 8770604 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8790334 ns 8763666.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6362666.5 ns 6350229.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2645.5 ns 2250 ns 1.18
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2375 ns 2583 ns 0.92
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 3375 ns 3333 ns 1.01
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2208 ns 2583 ns 0.85
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24665 ns 24886 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7291 ns 7292 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7334 ns 7042 ns 1.04
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7000 ns 7375 ns 0.95
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7083 ns 6959 ns 1.02
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 188559 ns 184871.5 ns 1.02
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8500 ns 8479.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8417 ns 8667 ns 0.97
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8542 ns 8625 ns 0.99
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 5792 ns 6000 ns 0.97
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 13708 ns 13291 ns 1.03
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 13292 ns 13750 ns 0.97
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 14000 ns 14521 ns 0.96
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 13334 ns 13458 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 24861 ns 25102 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 29083 ns 29250 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 28833 ns 28959 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 28895.5 ns 29167 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 29125 ns 29208.5 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 197487 ns 194866.5 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 93000 ns 43333 ns 2.15
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 92625 ns 94750 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 93125 ns 93687.5 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 41042 ns 90834 ns 0.45
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28500 ns 27916 ns 1.02
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28125 ns 28500 ns 0.99
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28416 ns 27166 ns 1.05
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 45958 ns 46166 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 26150.5 ns 26285 ns 0.99
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 44208 ns 44541 ns 0.99
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 44292 ns 44250 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 45459 ns 44666 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 63708 ns 63625 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 169236 ns 167275 ns 1.01
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 68333 ns 68458 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 68209 ns 68125 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 69125 ns 68708 ns 1.01
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 71292 ns 68208 ns 1.05
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 2041 ns 1834 ns 1.11
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 1625 ns 2042 ns 0.80
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2084 ns 2250 ns 0.93
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 1791 ns 1958 ns 0.91
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 22711 ns 23492 ns 0.97
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5333 ns 5416 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5291.5 ns 5333 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5291 ns 5375 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5000 ns 5291.5 ns 0.94
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 173262 ns 171557 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 8417 ns 8312.5 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 8187.5 ns 8250 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 8208 ns 8208 ns 1
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5542 ns 5667 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 105871250 ns 106272125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 116494916.5 ns 117220895.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 119452666 ns 123891541 ns 0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117536208 ns 117462292 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2657394 ns 2638590.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 389086750 ns 390984854 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 365207375 ns 370181584 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 354574916 ns 344393625 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 482493375 ns 481330584 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15173832 ns 15192721.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 612900125 ns 619409458 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 843986250 ns 668415479 ns 1.26
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 805484791.5 ns 816519375 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 734781250 ns 916595917 ns 0.80

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal merged commit 0f874c5 into main Nov 3, 2024
95 of 143 checks passed
@avik-pal avik-pal deleted the ap/mono_repo branch November 3, 2024 18:50
avik-pal added a commit that referenced this pull request Nov 3, 2024
* perf: cleanup the benchmarking script

* perf: add benchmarks for Zygote

* perf: try reclaiming memory

* fix: incorrect system parameters

* perf: temporarily disable non-dense benchmarks

[skip tests]

* ci(benchmark): allow proceed on failure

[skip tests]

* perf: update polyalg selection for matmul and matmuladd

* test: ensure no additional allocations for matmul

* fix: typo in AMDGPU batched matmul

* perf: restore running all benchmarks

* docs: add link to benchmarks

* ci: fix benchmarks config

* test: run allocs test only on CPU

* fix: mixed-precision use Octavian if possible

* feat: add traits to fuse activation functions

[skip ci]

* perf: selective vectorization of operations bias_add/activation

* perf: fused bias activation for certain operations

* perf: optimize batchnorm implementation

* perf: don't fuse tanh

* perf: run specific benchmarks

* perf: be conservative while fusing activation functions

* refactor: qualify CPU functions with `_cpu`

* perf: restore running all benchmarks

* fix(tracker): expand custom Tracker AD for wrapper types

* fix: subtyping correction

* test: ignore tests for batched_vec (not our code)

* perf: faster version of groupnorm

* ci: run downstream testing only on pull requests

* refactor: remove unnecessary forced inlining

* refactor: move PartialFunctions into a module

* refactor: move utilities into Utils

* refactor: move device agnostic functions to `DeviceAgnostic`

* test: separate out the testing project file

* refactor: move internal functions into separate modules

* test: separate out the testing project file

* fix: incorrect internal calls

* refactor: remove unnecessary turbo loop

* perf: don't rely on compile time branch removal for KA

* perf: static ndrange kernel launches

* perf: let it autotune

* refactor: use multiple dispatch for cleaner kernels

* refactor: disable cpu codegen for kernels

* fix: nicer information for fallback mixed-precision matmul

* fix: allow zero-sized arrays in bias_activation

* fix: don't restrict bias_act to number

* fix: don't restrict traits/ext/utils to number

* fix: more aggressive type specialization

* chore: update version

* fix: broken qa tests

* fix: use `fmap_with_path` to correctly identify all internal states

* chore: apply formatting suggestion

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix: don't error on detecting arrays with undefined entries

* refactor: move ChainRulesCore into an extension

* fix: skip enzyme tests if it is a pre-release

* chore: bump version for release

* fix: decide internal operation based on unwrapped arrays

* fix: avoid wrappers for SVector using `insert_batch_dim`

* fix: enzyme forward mode with octavian

* feat: swap Enzyme forward rules along with reverse

* test: simple enzyme forward test to check no crash

* chore: bump crate-ci/typos from 1.23.6 to 1.24.1

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.23.6 to 1.24.1.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.23.6...v1.24.1)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): bump crate-ci/typos from 1.23.6 to 1.24.1

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.23.6 to 1.24.1.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.23.6...v1.24.1)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.23.6 to 1.24.1

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.23.6 to 1.24.1.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.23.6...v1.24.1)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.23.6 to 1.24.1

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.23.6 to 1.24.1.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.23.6...v1.24.1)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.23.6 to 1.24.1

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.23.6 to 1.24.1.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.23.6...v1.24.1)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* feat: add `unsafe_free!`

* feat: add DeviceIterator (and support parallel Device DataLoader)

* test: basic tests for free-ing data

* refactor: simplify parallel dataloader

* test: DataLoader aggressive freeing

* docs: add docstrings for `DeviceIterator`

* refactor: deprecate "Explicit" in favor of "Lux"

* chore: add deprecation for the single arg outputsize

* fix: remove old uses of Explicit

* fix!: remove deprecations

* chore: add exports for abstract layers

* refactor: move Functors and Setfield into ext

* fix!: remove hacky version of outputsize

* feat: add `AbstractLuxWrapperLayer`

* refactor: cleanup extension usage

* test: update test to new API

* test: extension loading errors

* feat: support functors for WrappedLayer

* test: LuxWrappedLayer tested

* test: don't qualify unnecessarily

* refactor: cleanup internal functions

* fix!: remove default slow handling of outputsize

* fix: update removed API

* test: update old tests

* fix!: remove unused `inputsize`

* fix: add fmap_with_path support

* chore: fix formatting

* feat: default call for wrapper layers

* fix: remove hacky usage of module getproperty rrules

* fix: accidental dual usage of `ofeltype_array`

* feat: auto-training mode and strict checks

* chore: bump compat for LuxCore to 1, (keep existing compat) (#147)

Co-authored-by: CompatHelper Julia <compathelper_noreply@julialang.org>

* feat: extend the layernorm API

* test: more detailed layernorm testing

* chore: bump version for release

* fix!: remove deprecations for 1.0 release

* chore!: remove Reexport of NNlib (will be done via Lux)

* perf: add NNlib to benchmarks deps

* fix: remove unused explicit imports

* chore: update to using LuxCore@1.0

* fix!: remove dropout branching based on size

* fix!: change the default layernorm dims

* chore: bump crate-ci/typos from 1.24.1 to 1.24.3

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.1 to 1.24.3.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.1...v1.24.3)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): bump crate-ci/typos from 1.24.1 to 1.24.3

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.1 to 1.24.3.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.1...v1.24.3)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.1 to 1.24.3

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.1 to 1.24.3.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.1...v1.24.3)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.1 to 1.24.3

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.1 to 1.24.3.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.1...v1.24.3)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.1 to 1.24.3

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.1 to 1.24.3.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.1...v1.24.3)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* feat: add enzyme reverse rules for `fused_dense!`

* test: add tests for the enzyme fused_dense rules

* fix: typo in reverse rule

* test: run tests with more activations

* feat: instancenorm with running statistics

* fix: fixes for testing

* fix: modify the dropout testing

* fix: windows testing for dropout

* chore(deps): bump crate-ci/typos from 1.24.3 to 1.24.5

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.3...v1.24.5)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): bump peter-evans/create-pull-request from 6 to 7

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](peter-evans/create-pull-request@v6...v7)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump peter-evans/create-pull-request from 6 to 7

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](peter-evans/create-pull-request@v6...v7)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.3 to 1.24.5

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.3...v1.24.5)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump peter-evans/create-pull-request from 6 to 7

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](peter-evans/create-pull-request@v6...v7)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.3 to 1.24.5

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.3...v1.24.5)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump peter-evans/create-pull-request from 6 to 7 (#19)

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](peter-evans/create-pull-request@v6...v7)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump peter-evans/create-pull-request from 6 to 7

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](peter-evans/create-pull-request@v6...v7)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.3 to 1.24.5

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.3...v1.24.5)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* test: add tests comparing the fused op with unfused op

* fix: improve load times by moving CRC to ext

* fix: remove UnrolledUtilities dep

* fix: remove UnrolledUtilities dep

* chore: bump minimum MLDataDevices version

* fix: dropout tests are no longer broken

* chore: accidentally left deprecations file

* fix: missing enzyme rules for matmuladd! (CUDA support)

* test: incorrect condition

* test: incorrect function name

* fix: zero out shadows

* fix: enzyme reverse bias needs a check on Const

* chore: bump crate-ci/typos from 1.24.5 to 1.24.6

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.5...v1.24.6)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* feat: better test integration in test_gradients

* feat: add test_gradients macro

* chore: apply formatting suggestion

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix: update to use test_gradients macro

* fix: bias needs to add accum gradients

* chore: bump `EnzymeCore` version

* CompatHelper: bump compat for EnzymeCore in [weakdeps] to 0.8, (keep existing compat)

* chore: bump version for release

---------

Co-authored-by: CompatHelper Julia <compathelper_noreply@julialang.org>
Co-authored-by: Avik Pal <avikpal@mit.edu>

* chore: install latest enzyme version

* chore: update Enzyme version

* chore: bump minimum versions

* ci: update buildkite settings

* feat: wider support for batched_matmul

* perf: benchmark fallback batched_matmul

* feat: slow fallback conv impl

* feat: parallel fallback batchedmm

* ci(buildkite): add GPU testing for Metal and oneAPI

* test: check for FP64 support

* fix: convert element type before broadcasting

* fix: dispatch for NNlib conv

* ci(buildkite): disable testing for Metal and oneAPI

* chore: bump version

* feat: update minimum version of Enzyme to 0.13

* feat: support within_gradient for Enzyme

* refactor: rename within_gradient to within_autodiff

* fix: update forward rules to new API

* fix: use known on the return type

* fix: forward enzyme rules

* fix: broken enzyme tests

* feat: support runtime activity for enzyme

* fix: check was accidentally broken

* chore(deps): bump crate-ci/typos from 1.24.5 to 1.24.6

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.5...v1.24.6)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.3 to 1.24.6

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.6.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.3...v1.24.6)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.5 to 1.24.6

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.5...v1.24.6)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.5 to 1.24.6

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.5...v1.24.6)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: rollback custom gelu implementation

* feat: XLADevice via Reactant

* chore: apply suggestions from code review

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* chore: bump version

* feat: more extensive testing of XLA backend

* fix: incorrect function call

* test: rename

* test: incorrect env var

* fix: copy to XLA in main thread

* fix: don't support pre-moving the data

* fix: urgent patch for reactant breakage

* chore: bump crate-ci/typos from 1.24.6 to 1.25.0

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.6 to 1.25.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.6...v1.25.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): bump crate-ci/typos from 1.24.6 to 1.25.0 (#41)

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.6 to 1.25.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.6...v1.25.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump crate-ci/typos from 1.24.6 to 1.25.0

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.6 to 1.25.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.6...v1.25.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.6 to 1.25.0

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.6 to 1.25.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.6...v1.25.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: bump crate-ci/typos from 1.24.6 to 1.26.0

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.6 to 1.26.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.6...v1.26.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* ci: run on `1.10` and `1` (#57)

* ci: run on `1.10` and `1`

* ci: run on `1.10` and `1`

* ci: run on `1.10` and `1` (#81)

* ci: run on 1.10 and 1

* ci: run on `1.10` and `1`

* ci: run on `1.10` and `1`

* ci: run on `1.10` and `1` (#43)

* ci: run on `1.10` and `1`

* ci: run on `1.10` and `1`

* test: mark truncated normal on Metal as unbroken

* ci: run buildkite on `1.10` and `1`

* chore: bump peter-evans/create-pull-request from 6 to 7 (#40)

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](peter-evans/create-pull-request@v6...v7)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* ci: run tests only on `1.10` for now (#172)

* fix: relax cublaslt types (#173)

* docs: add Flux.jl to the README (#83)

After FluxML/Flux.jl#2492 also Flux relies on MLDataDevices.

* chore: bump crate-ci/typos from 1.25.0 to 1.26.0 (#58)

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.25.0 to 1.26.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.25.0...v1.26.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump crate-ci/typos from 1.25.0 to 1.26.0 (#44)

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.25.0 to 1.26.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.25.0...v1.26.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump crate-ci/typos from 1.25.0 to 1.26.0 (#174)

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.25.0 to 1.26.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.25.0...v1.26.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump compat for GPUArrays in [weakdeps] to 11, (keep existing compat) (#86)

Co-authored-by: CompatHelper Julia <compathelper_noreply@julialang.org>

* chore: bump version for release

* chore: bump compat for GPUArrays in [weakdeps] to 11, (keep existing compat) (#46)

Co-authored-by: CompatHelper Julia <compathelper_noreply@julialang.org>

* chore: bump compat for GPUArraysCore to 0.2, (keep existing compat) (#47)

Co-authored-by: CompatHelper Julia <compathelper_noreply@julialang.org>
Co-authored-by: Avik Pal <avikpal@mit.edu>

* chore: bump version for release

* feat: add fallbacks for unknown objects (#87)

* feat: add fallbacks for unknown objects

* feat: handle RNGs and undef arrays gracefully

* test: RNG movement

* test: functions and closures

* refactor: move `JuliaSIMD` deps to extensions (#175)

* fix: remove LV.vmap! usage

* fix: remove LV handling for bias_activation

* fix: remove LV usage in dropout

* refactor: move LV and octavian behind an extension

* docs: add docs for loading packages

* refactor: move SLEEFPirates to an ext

* fix: enzyme rules for batched matmul

* fix: patch more enzyme issues

* feat: add a preference to disable loop vectorization

* fix: incorrect dispatch called

* fix: enzyme segfault bypass

* feat: define isleaf (#84)

* isleaf

* exclude

* add tests and docs

* more tests

* import functors

* fix test

* chore: reduce min compat

* chore: run formatter

* chore: bump version for release

* fix: handle bitstypes and wrapped arrays in isleaf (#88)

* bitstype and wrapped arrays

* fixes

* fix import

* bound

* cleanup

* chore: fix min version of LinearAlgebra

* chore: run formatter

---------

Co-authored-by: Avik Pal <avik.pal.2017@gmail.com>
Co-authored-by: Avik Pal <avikpal@mit.edu>

* fix: task switching in AMDGPU complex batched_matmul (#178)

* ci(buildkite): add downstream testing for NeuralOperators

* perf: restore old batched_mul

* fix: disable threading for certain devices

* revert: "perf: restore old batched_mul"

This reverts commit a8c0f3b4615f96a8773577e16fac61ba310d8123.

* fix: correctly handle adjoints of wrapped arrays (#90)

* fix: correctly handle adjoints of wrapped arrays

* fix: use fast paths for adapt

* fix: adapt ranges to JuliaGPU/Adapt.jl#86

* chore(deps): bump crate-ci/typos from 1.25.0 to 1.26.8 (#44)

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.25.0 to 1.26.8.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.25.0...v1.26.8)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump crate-ci/typos from 1.26.0 to 1.26.8 (#49)

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.26.0 to 1.26.8.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.26.0...v1.26.8)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump crate-ci/typos from 1.26.0 to 1.26.8 (#60)

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.26.0 to 1.26.8.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.26.0...v1.26.8)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: missing import; fixes #179 (#180)

* chore: bump crate-ci/typos from 1.26.0 to 1.26.8 (#93)

Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.26.0 to 1.26.8.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.26.0...v1.26.8)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* ci: merge LuxCUDA testing scripts

* ci: merge LuxCore testing scripts

* ci: merge WeightInitializers testing scripts

* ci: add WI to pipeline launch

* ci: add MLDataDevices to pipeline launch

* ci: change 1.10 to "lts"

* test: LuxCore test fixes

* ci: soft fail MLDataDevices

* ci: add a central downstream testing

* ci: partially migrate LuxLib CI

* ci: remove name field

* ci: minor fixes to build scripts

* ci: move LuxTestUtils CI scripts

* ci: update LuxLib workflow

* ci: update LuxLib workflows

* ci: split out downstream testing

* ci: fix certain pipelines

* ci: minor tweaks

* fix: workflows

* test: use local LuxCUDA for tests

* fix: use develop

* docs: update

* fix: add dev packages

* docs: dev required packages

* perf: merge the benchmarks

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CompatHelper Julia <compathelper_noreply@julialang.org>
Co-authored-by: Christopher Rackauckas <accounts@chrisrackauckas.com>
Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants