Stabilize `MulAddMul` strategically #52439

dkarrasch · 2023-12-07T12:09:41Z

Manual rebase of #47206. Closes #47206.

Co-authored-by: Ashley Milsted ashmilsted@gmail.com

KristofferC · 2024-02-14T10:12:44Z

Bump, what is the status here?

dkarrasch · 2024-02-14T15:35:09Z

This is internal, but breaking. It affects a function that we encouraged data storage packages such as SparseArrays.jl, GPUArrays.jl etc. to overload. If we want to do that, then it requires organized action. I thought that would be best done right after the v1.11 cut, since I didn't have time to follow up all the packages before the branching.

Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>

dkarrasch · 2024-02-27T14:43:12Z

@nanosoldier runbenchmarks("linalg", vs = ":master")

nanosoldier · 2024-02-27T15:23:35Z

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

dkarrasch · 2024-02-27T16:47:57Z

Awesome, this seems to speed up multiplication of small matrices significantly.

dkarrasch · 2024-02-28T19:36:01Z

@nanosoldier runtests()

dkarrasch · 2024-02-29T09:49:31Z

Let's see what the pkgeval run says, but I think we should/could slowly start to collectively think about whether we want to do this, or what should be taken care of to finish this.

The clear upside is that for packages that have their own 5-arg mul! implementations like the packages that have already received PRs (except for GPUArrays.jl), there is no MulAddMul in the pathway anymore, so no issues with dynamic calls to generic_mat[vec/mat]mul! due to type instability of the _add::MulAddMul argument. Those pathways should be somewhat lighter on the compiler, because alpha and beta are just pushed forward.

The downside is that those pathways that run by MulAddMul constructors (including the *diagonal ones and the most generic _generic_matmatmul!) need to compile four different multiplication kernel methods. So, initially, the load on the compiler is increased, for the benefit of avoiding dynamic calls/allocation of MulAddMul objects.

I'd like to call for some support here, especially with running some benchmarks of your favorite workload and reporting issues.

@maleadt On the above PRs, the nightly runs seem to fail (independently from my changes?). Is there a way to test this PR here together with the PRs from GPUArrays.jl and CUDA/oneAPI/Metal respectively?

@chriselrod Would you be able to run some of your compilation benchmarks? I'm afraid this will undo some of the speed-ups we achieved in v1.10, but let's see the numbers.

If we agree to include this, I believe it would be good to have it early in the cycle, so that chances are high to catch packages that rely on these internal functions.

nanosoldier · 2024-02-29T17:50:36Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

maleadt · 2024-03-01T06:55:24Z

On the above PRs, the nightly runs seem to fail (independently from my changes?). Is there a way to test this PR here together with the PRs from GPUArrays.jl and CUDA/oneAPI/Metal respectively?

No, there isn't.

dkarrasch · 2024-03-01T16:00:23Z

@nanosoldier runtests(["FloatTracker", "EulerAngles", "NumericalAlgorithms", "ConvexHulls2d", "SubSIt", "CompressedSparseBlocks", "MLKernels", "NMF", "VibrationGEPHelpers", "JSOSolvers", "ExtendableSparse", "EcologicalNetworksPlots", "CrystalNets", "OptimalPortfolios", "GLPK", "PlantRayTracer", "ProximalOperators", "Gtk4", "SkyDomes", "BasisMatrices", "TaylorIntegration", "TransitionsInTimeseries", "MixedModels", "ClimaCoreSpectra", "ClimaCorePlots", "LowLevelParticleFilters", "MultiStateSystems", "ManifoldDiffEq", "ConScape", "BLASBenchmarksCPU", "LowRankIntegrators", "SpiDy", "AiidaDFTK", "Petri", "StirredReactor", "BatchReactor", "ONSAS", "ChargeTransport", "SMLMFrameConnection", "MathepiaModels", "NamedTrajectories", "BloqadeGates", "Population"])

dkarrasch · 2024-03-02T07:51:50Z

@nanosoldier runtests(["FloatTracker", "EulerAngles", "NumericalAlgorithms", "ConvexHulls2d", "SubSIt", "CompressedSparseBlocks", "MLKernels", "NMF", "VibrationGEPHelpers", "JSOSolvers", "ExtendableSparse", "EcologicalNetworksPlots", "CrystalNets", "OptimalPortfolios", "GLPK", "PlantRayTracer", "ProximalOperators", "Gtk4", "SkyDomes", "BasisMatrices", "TaylorIntegration", "TransitionsInTimeseries", "MixedModels", "ClimaCoreSpectra", "ClimaCorePlots", "LowLevelParticleFilters", "MultiStateSystems", "ManifoldDiffEq", "ConScape", "BLASBenchmarksCPU", "LowRankIntegrators", "SpiDy", "AiidaDFTK", "Petri", "StirredReactor", "BatchReactor", "ONSAS", "ChargeTransport", "SMLMFrameConnection", "MathepiaModels", "NamedTrajectories", "BloqadeGates", "Population"])

nanosoldier · 2024-03-03T00:53:32Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

nanosoldier · 2024-03-03T02:49:56Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

nanosoldier · 2024-04-23T13:42:30Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

dkarrasch · 2024-04-23T14:54:43Z

@nanosoldier runtests(["SubSIt", "ReduceWindows", "CompressedSparseBlocks", "EasyCurl", "IterativeSolvers", "LinearMaps", "VibrationGEPHelpers", "PreallocationTools", "ParameterizedQuantumControl", "CrystalNets", "MRICoilSensitivities", "CDDLib", "TaylorIntegration", "BasisMatrices", "ReservoirComputing", "MixedModels", "StructuralEquationModels", "LongwaveModePropagator", "HierarchicalGaussianFiltering", "LowRankIntegrators", "MimiRFFSPs", "Plots", "Knockoffs", "ConceptualClimateModels", "AstrodynamicalModels", "MinimallyDisruptiveCurves", "Biofilm", "SMLMSim"])

Most failing tests were due to time-out. From what I've seen there are no failures due to method errors or things like that. One possible source of extensive runtimes is that SparseArrays.jl is run without adaptation, so multiplication runs by the most generic kernel, and not the sparse kernels.

nanosoldier · 2024-04-24T09:39:52Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

dkarrasch · 2024-04-24T10:20:01Z

A few packages throw some signal in the most generic multiplication kernel, in the simd-loop applying muladd. It could be that this is related to sparse arrays, which don't get caught by the status quo of SparseArrays.jl. Failures, however, are limited to a handful of packages, so if we decide to go with this, we could first merge the SparseArrays.jl PR, then bump the stdlib, then update this branch, and rerun pkgeval.

dkarrasch · 2024-04-30T15:15:15Z

The following compile time benchmark from @chriselrod

using ForwardDiff, LinearAlgebra

d(x, n) = ForwardDiff.Dual(x, ntuple(_ -> randn(), n))

function dualify(A, n, j)
  if n > 0
    A = d.(A, n)
    if (j > 0)
      A = d.(A, j)
    end
  end
  A
end

@time for n = 0:8, j = (n!=0):4
  A = dualify.(randn(5,5), n, j);
  B = dualify.(randn(5,5), n, j);
  C = similar(A);
  mul!(C, A, B);
  mul!(C, A', B);
  mul!(C, A, B');
  mul!(C, A', B');
  mul!(C, transpose(A), B);
  mul!(C, A, transpose(B));
  mul!(C, transpose(A), transpose(B));
end

doesn't show a dramatic increase, something like 10%. Testing with this:

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.12.0-DEV.394 (2024-04-22)
 _/ |\__'_|_|_|\__'_|  |  dk/muladdmul/0562602986* (fork: 6 commits, 7 days)
|__/                   |

julia> using LinearAlgebra

julia> A = randn(BigFloat, 5, 5);

julia> B = randn(BigFloat, 5, 5);

julia> @time C = A * B;
  0.358988 seconds (988.50 k allocations: 49.872 MiB, 31.52% gc time, 99.89% compilation time) # PR
  0.188448 seconds (671.25 k allocations: 32.915 MiB, 99.79% compilation time) # v1.11-beta1

julia> @time mul!(C, A, B, true, false);
  0.169249 seconds (6.58 k allocations: 306.602 KiB, 99.96% compilation time: 100% of which was recompilation) # PR
  0.036884 seconds (1.60 k allocations: 76.930 KiB, 99.78% compilation time) # v1.11-beta1

julia> @time mul!(C, A, B, true, true);
  0.000080 seconds (600 allocations: 30.469 KiB) # PR
  0.115821 seconds (252.68 k allocations: 12.372 MiB, 13.96% gc time, 99.80% compilation time) # v1.11-beta1

julia> @time mul!(C, A, B, 3.7, false);
  0.313045 seconds (804.67 k allocations: 40.479 MiB, 5.31% gc time, 99.97% compilation time) # PR
  0.139149 seconds (496.09 k allocations: 24.297 MiB, 99.82% compilation time) # v1.11-beta1

julia> @time mul!(C, A, B, 3.7, 2.8);
  0.384808 seconds (804.58 k allocations: 40.463 MiB, 24.57% gc time, 99.97% compilation time) # PR
  0.142509 seconds (504.06 k allocations: 24.781 MiB, 9.86% gc time, 99.81% compilation time) # v1.11-beta1

however, surprises me a bit. Why does it need to recompile mul!(C, A, B, true, false)? Other than that, the increase in compile times is less than factor 4, which one would naively expect by how the macro is set up. OTOH, this may be outbalanced by the fact that external packages don't need to compile through MulAddMul anymore, only w.r.t. the types of the alpha and beta arguments. Plus: small matrix multiplication is dramatically sped up!

I'd like to suggest to leave this for a few more days and wait for comments, and absent objections merge this.

dkarrasch · 2024-04-30T15:20:54Z

@nanosoldier runbenchmarks("linalg", vs = ":master")

nanosoldier · 2024-04-30T16:00:39Z

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

dkarrasch · 2024-04-30T16:09:52Z

@nanosoldier runbenchmarks("linalg", vs = ":release-v1.11")

amilsted · 2024-04-30T16:14:45Z

@dkarrasch thanks for picking this up! Did you consider keeping an inference barrier at the call to generic_matmatmul to avoid compile-time overhead in BLAS or small-matrix cases, at the expense of slightly slowing down the generic_matmatmul case due to always-on runtime dispatch? The last version of my PR made that choice - I assumed generic_matmatmul is typically a slow path anyway. Not sure if that's right in all cases though!

dkarrasch · 2024-04-30T16:21:21Z

@nanosoldier runbenchmarks("linalg", vs = ":release-v1.10")

dkarrasch · 2024-04-30T16:37:38Z

The last version of my PR

Could you please point me to that commit?

@nanosoldier runbenchmarks("linalg", vs = ":release-1.11")

nanosoldier · 2024-04-30T17:16:35Z

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

amilsted · 2024-04-30T18:42:43Z

Could you please point me to that commit?

It's in this commit. It just means not using the macro for the generic_matmatmul call (see comment).

I decided not to do it for the vector case (see here).

dkarrasch · 2024-04-30T19:18:45Z

I'll try that out. Thanks!

dkarrasch · 2024-05-01T16:47:49Z

Ok, I included that suggestion, but kept the macro call at the most generic version (for now). I believe this should not affect the usual small-matrix- or blas-paths. I'll double-check, and then re-evaluate.

ADDENDUM: I made the inference barrier complete. That generic method may get called even by BLAS-type matrices and hence gets compiled. So, for generic or mixed eltypes this PR doesn't change anything about the runtime dispatch.

dkarrasch · 2024-05-08T07:56:47Z

All green, let's go!

Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>

dkarrasch force-pushed the dk/muladdmul branch from 847b9c7 to 5f39c10 Compare December 7, 2023 12:11

brenhinkeller added performance Must go faster linear algebra Linear algebra labels Dec 7, 2023

dkarrasch force-pushed the dk/muladdmul branch 3 times, most recently from 9ebeeb3 to 21e3f79 Compare February 27, 2024 11:01

Stabilize MulAddMul strategically

8033ae4

Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>

dkarrasch force-pushed the dk/muladdmul branch from 21e3f79 to 8033ae4 Compare February 27, 2024 14:39

dkarrasch added 2 commits March 1, 2024 23:07

clean up legacy methods

261cc46

include matvecmul!, use default muladdmul arg

81618b4

fix matvec case

884fcf5

Merge branch 'master' into dk/muladdmul

a145160

This was referenced Apr 23, 2024

Avoid constructing MulAddMuls dkarrasch/AMDGPU.jl#1

Closed

Avoid constructing MulAddMuls JuliaGPU/AMDGPU.jl#623

Draft

have an inference barrier before _generic_matmatmul!

64aaa0f

dkarrasch added 4 commits May 1, 2024 19:07

make a complete inference barrier

7dab92d

Merge branch 'master' into dk/muladdmul

0383e15

Merge branch 'master' into dk/muladdmul

2ab4fe7

fixes after master merge

2946c98

dkarrasch merged commit 29ced9e into master May 8, 2024
7 checks passed

dkarrasch deleted the dk/muladdmul branch May 8, 2024 07:57

xlxs4 pushed a commit to xlxs4/julia that referenced this pull request May 9, 2024

Stabilize MulAddMul strategically (JuliaLang#52439)

43fafd9

Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>

This was referenced May 22, 2024

Push MulAddMul away from BLAS.gemm! #47026

Closed

Avoid type instability when constructing MulAddMul #47088

Closed

amilsted mentioned this pull request Jun 10, 2024

dense-matrix mul!(C, A, B, alpha, beta) allocates #46865

Closed

lazarusA pushed a commit to lazarusA/julia that referenced this pull request Jul 12, 2024

Stabilize MulAddMul strategically (JuliaLang#52439)

bbc3f79

Co-authored-by: Ashley Milsted <ashmilsted@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize `MulAddMul` strategically #52439

Stabilize `MulAddMul` strategically #52439

dkarrasch commented Dec 7, 2023

KristofferC commented Feb 14, 2024

dkarrasch commented Feb 14, 2024

dkarrasch commented Feb 27, 2024

nanosoldier commented Feb 27, 2024

dkarrasch commented Feb 27, 2024

dkarrasch commented Feb 28, 2024

dkarrasch commented Feb 29, 2024

nanosoldier commented Feb 29, 2024

maleadt commented Mar 1, 2024

dkarrasch commented Mar 1, 2024

dkarrasch commented Mar 2, 2024

nanosoldier commented Mar 3, 2024

nanosoldier commented Mar 3, 2024

nanosoldier commented Apr 23, 2024

dkarrasch commented Apr 23, 2024

nanosoldier commented Apr 24, 2024

dkarrasch commented Apr 24, 2024

dkarrasch commented Apr 30, 2024 •

edited

Loading

dkarrasch commented Apr 30, 2024

nanosoldier commented Apr 30, 2024

dkarrasch commented Apr 30, 2024

amilsted commented Apr 30, 2024

dkarrasch commented Apr 30, 2024

dkarrasch commented Apr 30, 2024

nanosoldier commented Apr 30, 2024

amilsted commented Apr 30, 2024 •

edited

Loading

dkarrasch commented Apr 30, 2024

dkarrasch commented May 1, 2024 •

edited

Loading

dkarrasch commented May 8, 2024

Stabilize MulAddMul strategically #52439

Stabilize MulAddMul strategically #52439

Conversation

dkarrasch commented Dec 7, 2023

KristofferC commented Feb 14, 2024

dkarrasch commented Feb 14, 2024

dkarrasch commented Feb 27, 2024

nanosoldier commented Feb 27, 2024

dkarrasch commented Feb 27, 2024

dkarrasch commented Feb 28, 2024

dkarrasch commented Feb 29, 2024

nanosoldier commented Feb 29, 2024

maleadt commented Mar 1, 2024

dkarrasch commented Mar 1, 2024

dkarrasch commented Mar 2, 2024

nanosoldier commented Mar 3, 2024

nanosoldier commented Mar 3, 2024

nanosoldier commented Apr 23, 2024

dkarrasch commented Apr 23, 2024

nanosoldier commented Apr 24, 2024

dkarrasch commented Apr 24, 2024

dkarrasch commented Apr 30, 2024 • edited Loading

dkarrasch commented Apr 30, 2024

nanosoldier commented Apr 30, 2024

dkarrasch commented Apr 30, 2024

amilsted commented Apr 30, 2024

dkarrasch commented Apr 30, 2024

dkarrasch commented Apr 30, 2024

nanosoldier commented Apr 30, 2024

amilsted commented Apr 30, 2024 • edited Loading

dkarrasch commented Apr 30, 2024

dkarrasch commented May 1, 2024 • edited Loading

dkarrasch commented May 8, 2024

Stabilize `MulAddMul` strategically #52439

Stabilize `MulAddMul` strategically #52439

dkarrasch commented Apr 30, 2024 •

edited

Loading

amilsted commented Apr 30, 2024 •

edited

Loading

dkarrasch commented May 1, 2024 •

edited

Loading