Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SimpleGMRES implementation #366

Merged
merged 14 commits into from
Sep 20, 2023
Merged

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Aug 22, 2023

  • Working version of GMRES
  • If BlockDiagonal, then check if we can use a faster algorithm
    • Uniform Square Case
    • Non-Uniform Square Blocks
  • Fast Dispatch for Structured Block Diagonal Matrices
    • Uniform Square Case
    • Non-Uniform Square Blocks
  • Improve the base GMRES code
    • Performance really bad compared to Krylov.jl
    • Figure out how to do the inner linsolves inplace
    • Support Preconditioners: GMRES is really bad without preconditioners
    • FIXMEs in the code
  • Tests
  • Needs testing on GPU

Copy link

@ai-maintainer ai-maintainer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-Maintainer Review for PR - Add SimpleGMRES implementation

Title and Description 👍

The Title and Description are clear and focused
The title and description of the pull request are clear and focused. They effectively communicate the purpose of the changes, which is to add a SimpleGMRES implementation. The description provides a checklist of tasks that need to be completed, indicating the different aspects of the implementation that are being worked on.

Scope of Changes 👍

The changes are narrowly focused
The changes in this pull request appear to be focused on a single issue, which is the addition of the SimpleGMRES implementation. The author is working on various aspects of this implementation, such as handling different types of matrices, improving performance, and adding tests. All these tasks are related to the main goal of implementing SimpleGMRES.

Testing 👎

Testing details are missing
The description does not mention how the author tested the changes. It would be helpful for the author to include information about the testing methodology, such as the specific test cases used, the platforms or environments tested on, and any relevant test results or observations.

Documentation 👎

Docstrings are missing for several functions, classes, or methods
The following newly added functions, classes, or methods do not have docstrings:
  • LinearSolveBlockDiagonalsExt
  • LinearSolveNNlibExt
  • SimpleGMRES
  • SimpleGMRESCache
  • init_cacheval
  • _init_cacheval
  • default_alias_A
  • default_alias_b
  • SciMLBase.solve! (for SimpleGMRESCache)
  • SciMLBase.solve! (for SimpleGMRES)

These entities should have docstrings added to describe their behavior, arguments, and return values.

Suggested Changes

  • Please add docstrings to the functions, classes, or methods listed above. Each docstring should describe the behavior, arguments, and return values of the function, class, or method.
  • Please provide details about how you tested the changes. This could include the specific test cases you used, the platforms or environments you tested on, and any relevant test results or observations.

Reviewed with AI Maintainer

@codecov
Copy link

codecov bot commented Aug 22, 2023

Codecov Report

Merging #366 (38e3401) into main (60ae26a) will increase coverage by 30.51%.
The diff coverage is 85.08%.

@@             Coverage Diff             @@
##             main     #366       +/-   ##
===========================================
+ Coverage   41.83%   72.35%   +30.51%     
===========================================
  Files          20       23        +3     
  Lines        1446     1787      +341     
===========================================
+ Hits          605     1293      +688     
+ Misses        841      494      -347     
Files Changed Coverage Δ
ext/LinearSolveKernelAbstractionsExt.jl 0.00% <0.00%> (ø)
ext/LinearSolveBlockDiagonalsExt.jl 81.81% <81.81%> (ø)
src/simplegmres.jl 88.25% <88.25%> (ø)
src/LinearSolve.jl 93.84% <100.00%> (-1.40%) ⬇️
src/iterative_wrappers.jl 72.78% <100.00%> (+16.41%) ⬆️

... and 14 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@avik-pal
Copy link
Member Author

avik-pal commented Aug 23, 2023

Simple Benchmark Script

using LinearAlgebra, LinearSolve, SciMLOperators
using BlockDiagonals, NNlib
using Plots

blocksizes_ = [1, 2, 4, 5, 10, 25, 50]
nblocks = 100  blocksizes_
simplegmres_timings = Vector{Float64}(undef, length(blocksizes_))
simplegmres_nobsize_timings = Vector{Float64}(undef, length(blocksizes_))
krylovjlgmres_timings = Vector{Float64}(undef, length(blocksizes_))
for (i, (bsize, N)) in enumerate(zip(blocksizes_, nblocks))
    A_ = Array(BlockDiagonal([randn(bsize, bsize) for _ in 1:N]))
    b_ = randn(size(A_, 1))

    lincache_ = init(LinearProblem(A_, b_), SimpleGMRES(; blocksize=bsize, restart = 20);
        maxiters = 1000)
    lincache_krylov_ = init(LinearProblem(A_, b_), KrylovJL_GMRES(; gmres_restart = 20);
        maxiters = 1000)
    lincache_simple_nobsize = init(LinearProblem(A_, b_), SimpleGMRES(; restart = 20);
        maxiters = 1000)

    simplegmres_timings[i] = @belapsed solve!($(deepcopy(lincache_)))
    krylovjlgmres_timings[i] = @belapsed solve!($(deepcopy(lincache_krylov_)))
    simplegmres_nobsize_timings[i] = @belapsed solve!($(deepcopy(lincache_simple_nobsize)))

    @info "Trial $i" bsize=bsize N=N SimpleGMRES=simplegmres_timings[i] KrylovJL_GMRES=krylovjlgmres_timings[i] SimpleGMRES_nobsize=simplegmres_nobsize_timings[i]
end

begin
    plt = plot(blocksizes_, simplegmres_timings, label = "SimpleGMRES", xlabel = "Block Size",
        ylabel = "Time (s)", title = "SimpleGMRES vs KrylovJL_GMRES", legend = :topleft,
        xscale = :log10, yscale = :log10)
    scatter!(plt, blocksizes_, simplegmres_timings; label = "")
    plot!(plt, blocksizes_, krylovjlgmres_timings, label = "KrylovJL_GMRES")
    scatter!(plt, blocksizes_, krylovjlgmres_timings; label = "")
    plot!(plt, blocksizes_, simplegmres_nobsize_timings, label = "SimpleGMRES (no blocksize)")
    scatter!(plt, blocksizes_, simplegmres_nobsize_timings; label = "")
    savefig(plt, "wip/simplegmres_vs_krylovjl_gmres.png")
end

simplegmres_vs_krylovjl_gmres

Takeaways

  • We should exploit the structure, it significantly brings down the computational complexity and should have better convergence rate (~O(convergence rate of blocksize))
  • The SimpleGMRES code is pathetically slow compared to GMRES

@avik-pal
Copy link
Member Author

simplegmres_vs_krylovjl_gmres_100
simplegmres_vs_krylovjl_gmres_1000

@avik-pal avik-pal changed the title [WIP] Add SimpleGMRES implementation Add SimpleGMRES implementation Aug 25, 2023
@avik-pal
Copy link
Member Author

The non-unform blocks handling is a bit more finicky (not hard, just a bit irritating to handle). Unless we have a concrete use case for that, we can hold off implementing that in a later PR.

SimpleGMRES is mostly a copy of KrylovJL_GMRES, but we should hold off using it as a default anywhere, till we have battle-tested it at places.

I will follow up with benchmarks in DEQs.jl and SciMLSensitivity.jl, my initial runs suggest massive improvements in backpropagation (simply due to less number of iterations ==> less VJPs computed).

Again, we can't default to the blocksize version. It will make code really really fast for 99% users, for the 1% (say some using BatchNorm) it will give incorrect results. So, I will add a doc entry in SteadyStateAdjoint encouraging the use of this if the user is absolutely sure of what is happening here.

@avik-pal
Copy link
Member Author

(Don't know why I cant seem to request reviews)

@ChrisRackauckas, this is ready for review

Comment on lines +280 to +294
# Update the QR factorization of Hₖ₊₁.ₖ.
# Apply previous Givens reflections Ωᵢ.
# [cᵢ sᵢ] [ r̄ᵢ.ₖ ] = [ rᵢ.ₖ ]
# [s̄ᵢ -cᵢ] [rᵢ₊₁.ₖ] [r̄ᵢ₊₁.ₖ]
for i in 1:(inner_iter - 1)
Rtmp = c[i] * R[nr + i] + s[i] * R[nr + i + 1]
R[nr + i + 1] = conj(s[i]) * R[nr + i] - c[i] * R[nr + i + 1]
R[nr + i] = Rtmp
end

# Compute and apply current Givens reflection Ωₖ.
# [cₖ sₖ] [ r̄ₖ.ₖ ] = [rₖ.ₖ]
# [s̄ₖ -cₖ] [hₖ₊₁.ₖ] [ 0 ]
(c[inner_iter], s[inner_iter], R[nr + inner_iter]) = _sym_givens(R[nr + inner_iter],
Hbis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there Base functions for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is. I have opened an issue to track those #375

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avik-pal This a 2x2 Givens reflection and not a 2x2 Givens rotation.
This a not the same thing.
The advantage of the reflections is that they are Hermitian and it's easier to perform products with Q'.

@ChrisRackauckas
Copy link
Member

It's at least good enough to merge. I think some things could be simplified but that can be done over time.

@ChrisRackauckas ChrisRackauckas merged commit ac2b5aa into SciML:main Sep 20, 2023
15 of 17 checks passed
@avik-pal avik-pal deleted the ap/simplegmres branch September 20, 2023 20:30
@amontoison
Copy link
Contributor

amontoison commented Oct 31, 2023

@ChrisRackauckas @avik-pal
Why not just implement an efficient linear operator to perform your matrix-vector products?
You can just create a structure and apply a dense mat-vec product (BLAS gemm) with each block.
You just need to define LinearAlgebra.mul! for your structure.
You can also apply each mat-vec in parallel.
We are doing that for years in Krylov.jl...

@ChrisRackauckas
Copy link
Member

That's fine, but it imposes that the user does that. Sometimes doing something for the user is simpler and removes restrictions.

@avik-pal
Copy link
Member Author

@amontoison I was testing out if we can use a efficient Linear Operator to get rid of this. I have a UniformBlockDiagonalMatrix which internally stores a N x N x M array which translates into M -- N x N blocks on the diagonal. Even with really fast matrix multiply (I call the CUBLAS batched gemm routines) and such the problem arises from the dot operation which needs to perform a much larger reduction than needed. See this flamegraph

image

julia> @benchmark CUDA.@sync solve(prob1, KrylovJL_GMRES())
BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min  max):  462.166 ms  480.612 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     476.150 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   473.402 ms ±   6.727 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁     ▁            ▁▁             ▁           ▁ ▁   ▁      █▁  
  █▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁█▁▁▁▁▁▁██ ▁
  462 ms           Histogram: frequency by time          481 ms <

 Memory estimate: 1.25 MiB, allocs estimate: 37978.

julia> @benchmark CUDA.@sync solve(prob2, KrylovJL_GMRES())
BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min  max):  467.924 ms  488.971 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     474.196 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   474.975 ms ±   6.809 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁█ ▁         ▁    ▁▁      ▁           ▁ ▁                   ▁  
  ██▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁██▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  468 ms           Histogram: frequency by time          489 ms <

 Memory estimate: 1.09 MiB, allocs estimate: 36442.

julia> @benchmark CUDA.@sync solve(prob2, SimpleGMRES(; blocksize = size(A1.data, 1)))
BenchmarkTools.Trial: 27 samples with 1 evaluation.
 Range (min  max):  143.097 ms  641.160 ms  ┊ GC (min  max): 0.00%  8.66%
 Time  (median):     152.867 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   186.930 ms ± 125.551 ms  ┊ GC (mean ± σ):  2.32% ± 2.52%

  █▆                                                             
  ██▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃ ▁
  143 ms           Histogram: frequency by time          641 ms <

 Memory estimate: 10.75 MiB, allocs estimate: 289135.

prob1 is constructed with the operator and prob2 is with the equivalent Matrix form.

@amontoison
Copy link
Contributor

amontoison commented Mar 11, 2024

@avik-pal
Can you display the output of CUDA.@profile solve(prob1, KrylovJL_GMRES()) and with your SimpleGMRES?

I don't understand why the dot is different between your variant and mine, do you use your own kernel?
The linear system has the same size so the time spent in dot products should be similar with the generic dispatch.

mul!(w, A, p) # w ← ANvₖ
PlisI || ldiv!(q, Pl, w) # q ← MANvₖ
for i in 1:inner_iter
sum!(R[nr + i]', __batch(V[i]) .* __batch(q))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amontoison see this line. For the batched case (N x B input size), the dot needs to only reduce to 1 x B instead of reducing to a scalar.

@avik-pal
Copy link
Member Author

Can you display the output of CUDA.@Profile solve(prob1, KrylovJL_GMRES()) and with your SimpleGMRES?

I will share the profile in a couple of hrs.

@avik-pal
Copy link
Member Author

avik-pal commented Mar 11, 2024

julia> CUDA.@profile solve(prob2, SimpleGMRES(; blocksize=8))
Profiler ran for 5.86 ms, capturing 5269 events.

Host-side activity: calling CUDA APIs took 3.58 ms (61.09% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                        │
├──────────┼────────────┼───────┼───────────────────────────────────────┼─────────────────────────────┤
│   25.14%1.47 ms │   4693.14 µs ± 26.0   (  0.72535.73)  │ cuMemAllocFromPoolAsync     │
│   16.65%976.32 µs │   4172.34 µs ± 0.96   (  1.6714.07)   │ cuLaunchKernel              │
│    5.96%349.28 µs │     2174.64 µs ± 240.57 (  4.53344.75)  │ cuMemcpyDtoDAsync           │
│    3.07%179.77 µs │    247.49 µs ± 0.91   (   6.210.01)   │ cuMemcpyDtoHAsync           │
│    1.32%77.25 µs │     238.62 µs ± 43.5   (  7.8769.38)   │ cudaMemcpyAsync             │
│    0.56%32.9 µs │    112.99 µs ± 1.13   (  1.915.01)    │ cudaLaunchKernel            │
│    0.50%29.09 µs │    48605.98 ns ± 230.47 (238.421192.09) │ cuStreamSynchronize         │
│    0.11%6.68 µs │     1 │                                       │ cuMemsetD32Async            │
│    0.07%4.29 µs │     1 │                                       │ cudaFuncGetAttributes       │
│    0.04%2.62 µs │    14187.33 ns ± 166.72 (   0.0476.84)  │ cudaGetLastError            │
│    0.04%2.15 µs │     1 │                                       │ cudaStreamSynchronize       │
│    0.02%1.43 µs │     5286.1 ns ± 261.17 (   0.0715.26)  │ cudaStreamGetCaptureInfo_v2 │
│    0.02%1.43 µs │     1 │                                       │ cudaEventRecord             │
│    0.01%476.84 ns │     4119.21 ns ± 137.65 (   0.0238.42)  │ cuDeviceGet                 │
└──────────┴────────────┴───────┴───────────────────────────────────────┴─────────────────────────────┘

Device-side activity: GPU was busy for 679.49 µs (11.59% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                  
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    1.38%80.82 µs │    362.25 µs ± 0.16   (  1.912.38)   │ _Z22partial_mapreduce_grid8identity7add_sumv16CartesianIndicesILi2E5TupleI5OneToI5Int64ES3_IS4_EEES1_ILi2ES2_IS3_IS4_ES3_IS4_EEE3ValILitrueEE13ReshapedArrayI7Float32Li3E7AdjointIS7_13CuDeviceArrayI 1.10%64.37 µs │    521.24 µs ± 0.15   (  0.951.43)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES9_IS0 0.98%57.46 µs │    371.55 µs ± 0.12   (  1.431.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_ES5_ 0.94%54.84 µs │    361.52 µs ± 0.13   (  1.191.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedI7AdjointIS1_S0_IS1_Li1ELi1EEES5_I 0.92%54.12 µs │    451.2 µs ± 0.14   (  0.951.43)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_E 0.70%41.25 µs │    281.47 µs ± 0.15   (  1.191.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_EvS8_S5_IS2_IS3_ILi1ES4_Ev4conjS5_I8 0.69%40.53 µs │    281.45 µs ± 0.14   (  1.191.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_EvS8_S5_I8ExtrudedIS0_IS1_Li1ELi1EES 0.67%39.1 µs │    281.4 µs ± 0.18   (  1.191.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES2_IS3 0.64%37.43 µs │    361.04 µs ± 0.12   (  0.951.19)   │ _Z2_615CuKernelContext7AdjointI7Float3213CuDeviceArrayIS1_Li1ELi1EEES1_                                                                                                                               0.48%28.13 µs │    251.13 µs ± 0.19   (  0.951.67)   │ [copy device to pageable memory]                                                                                                                                                                      0.38%22.41 µs │    151.49 µs ± 0.17   (  1.191.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_ES5_ 0.36%20.98 µs │     92.33 µs ± 0.11   (  2.152.38)   │ _Z22partial_mapreduce_grid8identity7add_sum7Float3216CartesianIndicesILi2E5TupleI5OneToI5Int64ES4_IS5_EEES2_ILi2ES3_IS4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li3ELi1EE11BroadcastedI12CuArr 0.31%18.12 µs │     82.26 µs ± 0.18   (  2.152.62)   │ _ZN8internal5gemvx6kernelIiiffffLb0ELb1ELb0ELb0ELi7ELb0E18cublasGemvParamsExIi30cublasGemvTensorStridedBatchedIKfES5_S3_IfEfEEENSt9enable_ifIXntT5_EvE4typeET11_                                      0.28%16.69 µs │     82.09 µs ± 0.11   (  1.912.15)   │ _Z22partial_mapreduce_grid8identity1_4Bool16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBuffe 0.27%15.74 µs │     81.97 µs ± 0.21   (  1.672.15)   │ _Z22partial_mapreduce_grid8identity3max7Float3216CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12Device 0.27%15.74 µs │     81.97 µs ± 0.17   (  1.672.15)   │ _Z22partial_mapreduce_grid8identity3max7Float3216CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12Device 0.22%12.87 µs │     91.43 µs ± 0.21   (  1.191.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE4sqrtS5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_E 0.21%12.16 µs │     81.52 µs ± 0.18   (  1.191.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE6ifelseS5_IS2_IS3_ILi1ES4_Ev2__S5_IS2_IS3_ILi1ES4_Ev3absS 0.20%11.92 µs │     81.49 µs ± 0.21   (  1.191.91)   │ _Z29gpu___fast_sym_givens_kernel_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE13CuDe 0.18%10.73 µs │     81.34 µs ± 0.12   (  1.191.43)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_Ev4conjS5_I8ExtrudedIS0_IS1_Li1ELi1E 0.18%10.73 µs │     81.34 µs ± 0.18   (  1.191.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI4BoolLi1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE2__S5_IS2_IS3_ILi1ES4_Ev3absS5_I8ExtrudedIS0_I7Float32Li1ELi 0.10%5.72 µs │     22.86 µs ± 0.0    (  2.862.86)   │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cublasNrm2ParamsIiT_T1_E                                                                                                                                         0.06%3.58 µs │     21.79 µs ± 0.84   (  1.192.38)   │ [copy device to device memory]                                                                                                                                                                        0.03%1.67 µs │     2834.47 ns ± 168.59 (715.26953.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_IS1_EES7_                                     0.02%1.43 µs │     1 │                                      │ _Z15axpy_kernel_valIffEv19cublasAxpyParamsValIT_S1_T0_E                                                                                                                                               0.01%476.84 ns │     1 │                                      │ [set device memory]                                                                                                                                                                                   0.01%476.84 ns │     1 │                                      │ [copy pageable to device memory]                                                                                                                                                                      
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                1 column omitted
julia> CUDA.@profile solve(prob2, KrylovJL_GMRES())
Profiler ran for 672.18 ms, capturing 767034 events.

Host-side activity: calling CUDA APIs took 573.04 ms (85.25% of the trace)
┌──────────┬────────────┬────────┬──────────────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │  Calls │ Time distribution                            │ Name                        │
├──────────┼────────────┼────────┼──────────────────────────────────────────────┼─────────────────────────────┤
│   32.08%215.65 ms │  334126.45 µs ± 1.86   (  1.91108.48)         │ cudaMemcpyAsync             │
│   30.94%207.96 ms │  997162.09 µs ± 10.1   (  1.432249.96)        │ cudaLaunchKernel            │
│    7.66%51.47 ms │  331541.55 µs ± 0.94   (  1.19113.49)         │ cudaFuncGetAttributes       │
│    4.98%33.45 ms │ 165770201.78 ns ± 164.18 (   0.012397.77)       │ cudaStreamGetCaptureInfo_v2 │
│    3.49%23.49 ms │  33154708.47 ns ± 11711.15 (476.842.13193893e6) │ cudaStreamSynchronize       │
│    2.27%15.25 ms │ 16628091.73 ns ± 5416.7 (   0.02.20823288e6)   │ cudaGetLastError            │
│    1.78%11.93 ms │  33154359.97 ns ± 693.56 (238.42117301.94)      │ cudaEventRecord             │
│    1.01%6.81 ms │      1 │                                              │ cuMemsetD32Async            │
│    0.16%1.07 ms │    2624.07 µs ± 2.01   (  0.7216.21)          │ cuMemAllocFromPoolAsync     │
│    0.12%833.75 µs │    2783.0 µs ± 1.41   (  1.4316.69)          │ cuLaunchKernel              │
└──────────┴────────────┴────────┴──────────────────────────────────────────────┴─────────────────────────────┘

Device-side activity: GPU was busy for 163.36 ms (24.30% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                                                                                                                                                                                
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    7.12%47.83 ms │ 328961.45 µs ± 0.17   (  1.191.91)     │ _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStridedBatchedIfEEEvT2_                                                                                         6.64%44.64 ms │ 328961.36 µs ± 0.16   (  1.191.67)     │ _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2_iS4_S2_T3_T4_19cublasPointerMode_t18cublasLtEpilogue_tS0_IKN8biasTypeINS7_10value_typeES2_E4typeEE         5.61%37.68 ms │ 331521.14 µs ± 0.17   (  0.951.43)     │ _Z15axpy_kernel_valIffEv19cublasAxpyParamsValIT_S1_T0_E                                                                                                                                             4.57%30.74 ms │ 33154927.25 ns ± 353.61 (715.2612397.77) │ [copy device to pageable memory]                                                                                                                                                                    0.21%1.39 ms │   5162.7 µs ± 0.24   (  2.153.1)      │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cublasNrm2ParamsIiT_T1_E                                                                                                                                       0.10%641.11 µs │   2562.5 µs ± 0.17   (  2.152.86)     │ _ZN8internal5gemvx6kernelIiiffffLb0ELb1ELb0ELb0ELi7ELb0E18cublasGemvParamsExIi30cublasGemvTensorStridedBatchedIKfES5_S3_IfEfEEENSt9enable_ifIXntT5_EvE4typeET11_                                    0.04%302.08 µs │   2561.18 µs ± 0.16   (  0.951.43)     │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES1_E 0.02%103.95 µs │   258402.91 ns ± 165.07 (238.42953.67)   │ [copy pageable to device memory]                                                                                                                                                                    0.00%21.46 µs │    211.02 µs ± 0.11   (  0.951.19)     │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_IS1_EES7_                                   0.00%1.43 µs │     1 │                                        │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7 0.00%715.26 ns │     1 │                                        │ [set device memory]                                                                                                                                                                                 
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                1 column omitted

For sanity check both the algorithms do converge:

julia> solve(prob1, KrylovJL_GMRES()).resid
0.00023247192f0

julia> solve(prob2, KrylovJL_GMRES()).resid
0.00023807214f0

julia> solve(prob2, SimpleGMRES(; blocksize=8)).resid
1.1054129f-5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants