Fusing FH datalayouts is slower than fusing HF datalayouts #2165

charleskawczynski · 2025-01-29T02:32:13Z

Reproducer (in ClimaAtmos, from CliMA/ClimaAtmos.jl#3540):

#=
julia --project=examples

ENV["CLIMACOMMS_DEVICE"] = "CUDA"
=#
using ClimaComms
ClimaComms.@import_required_backends
import ClimaAtmos as CA
using ClimaCore.CommonSpaces
using ClimaCore: Spaces, Fields, Geometry, ClimaCore
using Test
using Base.Broadcast: materialize
using ClimaCore.CommonSpaces
using ClimaCore.DataLayouts

FT = Float64;
ᶜspace = ExtrudedCubedSphereSpace(
    FT;
    z_elem = 30,
    z_min = 0,
    z_max = 1,
    radius = 10,
    h_elem = 15,
    n_quad_points = 4,
    horizontal_layout_type = DataLayouts.IJHF,
    staggering = CellCenter(),
);
ᶠspace = Spaces.face_space(ᶜspace);
ᶜz = Fields.coordinate_field(ᶜspace).z;
ᶠz = Fields.coordinate_field(ᶠspace).z;
zmax = maximum(ᶠz);
vs = CA.ViscousSponge{FT}(; zd = 0, κ₂ = 1);
ᶜuₕ = map(z -> zero(Geometry.Covariant12Vector{eltype(z)}), ᶜz);
ᶜuₕₜ = similar(ᶜuₕ);
@. ᶜuₕ.components.data.:1 = 1;
@. ᶜuₕ.components.data.:2 = 1;
rs = CA.RayleighSponge(; zd = FT(0), α_uₕ = FT(1), α_w = FT(1));
rst = CA.rayleigh_sponge_tendency_uₕ(ᶜuₕ, rs);
vst = CA.viscous_sponge_tendency_uₕ(ᶜuₕ, vs);
function main_unfused(ᶜuₕₜ, rst, vst)
    @. ᶜuₕₜ += vst
    @. ᶜuₕₜ += rst
    nothing
end
function main_fused(ᶜuₕₜ, rst, vst)
    @. ᶜuₕₜ += vst + rst
    nothing
end

using BenchmarkTools
main_fused(ᶜuₕₜ, rst, vst)
main_unfused(ᶜuₕₜ, rst, vst)
device = ClimaComms.device()
@benchmark ClimaComms.@cuda_sync device main_unfused($ᶜuₕₜ, $rst, $vst)
@benchmark ClimaComms.@cuda_sync device main_fused($ᶜuₕₜ, $rst, $vst)

We should boil this down, and better understand why fusing hurts performance for the cartesian indexed kernels and improves performance for the linear indexed kernels.

We saw this in MultiBroadcastFusion, too, but, IIRC, that was across different broadcast expressions, this includes when we're fusing into a single broadcast expression, so this is slightly different, but is in agreement with the result found in MBF.jl.

charleskawczynski added the performance label Jan 29, 2025

charleskawczynski self-assigned this Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fusing FH datalayouts is slower than fusing HF datalayouts #2165

Fusing FH datalayouts is slower than fusing HF datalayouts #2165

charleskawczynski commented Jan 29, 2025

Fusing FH datalayouts is slower than fusing HF datalayouts #2165

Fusing FH datalayouts is slower than fusing HF datalayouts #2165

Comments

charleskawczynski commented Jan 29, 2025