Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fusing FH datalayouts is slower than fusing HF datalayouts #2165

Open
charleskawczynski opened this issue Jan 29, 2025 · 0 comments
Open

Fusing FH datalayouts is slower than fusing HF datalayouts #2165

charleskawczynski opened this issue Jan 29, 2025 · 0 comments
Assignees

Comments

@charleskawczynski
Copy link
Member

Reproducer (in ClimaAtmos, from CliMA/ClimaAtmos.jl#3540):

#=
julia --project=examples

ENV["CLIMACOMMS_DEVICE"] = "CUDA"
=#
using ClimaComms
ClimaComms.@import_required_backends
import ClimaAtmos as CA
using ClimaCore.CommonSpaces
using ClimaCore: Spaces, Fields, Geometry, ClimaCore
using Test
using Base.Broadcast: materialize
using ClimaCore.CommonSpaces
using ClimaCore.DataLayouts

FT = Float64;
ᶜspace = ExtrudedCubedSphereSpace(
    FT;
    z_elem = 30,
    z_min = 0,
    z_max = 1,
    radius = 10,
    h_elem = 15,
    n_quad_points = 4,
    horizontal_layout_type = DataLayouts.IJHF,
    staggering = CellCenter(),
);
ᶠspace = Spaces.face_space(ᶜspace);
ᶜz = Fields.coordinate_field(ᶜspace).z;
ᶠz = Fields.coordinate_field(ᶠspace).z;
zmax = maximum(ᶠz);
vs = CA.ViscousSponge{FT}(; zd = 0, κ₂ = 1);
ᶜuₕ = map(z -> zero(Geometry.Covariant12Vector{eltype(z)}), ᶜz);
ᶜuₕₜ = similar(ᶜuₕ);
@. ᶜuₕ.components.data.:1 = 1;
@. ᶜuₕ.components.data.:2 = 1;
rs = CA.RayleighSponge(; zd = FT(0), α_uₕ = FT(1), α_w = FT(1));
rst = CA.rayleigh_sponge_tendency_uₕ(ᶜuₕ, rs);
vst = CA.viscous_sponge_tendency_uₕ(ᶜuₕ, vs);
function main_unfused(ᶜuₕₜ, rst, vst)
    @. ᶜuₕₜ += vst
    @. ᶜuₕₜ += rst
    nothing
end
function main_fused(ᶜuₕₜ, rst, vst)
    @. ᶜuₕₜ += vst + rst
    nothing
end

using BenchmarkTools
main_fused(ᶜuₕₜ, rst, vst)
main_unfused(ᶜuₕₜ, rst, vst)
device = ClimaComms.device()
@benchmark ClimaComms.@cuda_sync device main_unfused($ᶜuₕₜ, $rst, $vst)
@benchmark ClimaComms.@cuda_sync device main_fused($ᶜuₕₜ, $rst, $vst)

We should boil this down, and better understand why fusing hurts performance for the cartesian indexed kernels and improves performance for the linear indexed kernels.

We saw this in MultiBroadcastFusion, too, but, IIRC, that was across different broadcast expressions, this includes when we're fusing into a single broadcast expression, so this is slightly different, but is in agreement with the result found in MBF.jl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant