Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed remapping interpolation bug #2108

Open
charleskawczynski opened this issue Dec 20, 2024 · 10 comments · May be fixed by #2169
Open

Distributed remapping interpolation bug #2108

charleskawczynski opened this issue Dec 20, 2024 · 10 comments · May be fixed by #2169
Labels
bug Something isn't working

Comments

@charleskawczynski
Copy link
Member

Our unit tests seem to be flaky: https://buildkite.com/clima/climacore-ci/builds/4846#0193e0c9-7253-46f0-b029-699ab6c5bf95 (the corresponding PR does not touch anything related to distributed remapping / interpolation).

@charleskawczynski charleskawczynski added the bug Something isn't working label Dec 20, 2024
@Sbozzolo
Copy link
Member

I don't think I've ever seen this error, and we haven't touched that code in a long time.

I am wondering if something else has changed that led to this

@charleskawczynski
Copy link
Member Author

Here's another failure, but this time on updating the dependencies: https://buildkite.com/clima/climacore-ci/builds/4912#01944dc1-7201-4793-8d9f-13723f028c96

@charleskawczynski
Copy link
Member Author

Looking at the failed test, the first value in the array is very different, and not machine precision. So, I think there's likely a race condition somewhere in Remapper.interpolate.

@Sbozzolo
Copy link
Member

Sbozzolo commented Jan 10, 2025

I still have no clue what this could be. What I see is that the entire array is not exactly wrong: it seems that the values are out of order (e.g., we are looking at the first vs last level data).

@Sbozzolo
Copy link
Member

Sbozzolo commented Jan 14, 2025

Here's what I am finding:

  • On clima, I ran the test 20 times and it always passed
  • On the centeral cluster, the following script fails most of the time
using Logging
using Test
using IntervalSets

import ClimaCore:
    Domains,
    Fields,
    Geometry,
    Meshes,
    Operators,
    Spaces,
    Quadratures,
    Topologies,
    Remapping,
    Hypsography
using ClimaComms
ClimaComms.@import_required_backends
const context = ClimaComms.context()
const pid, nprocs = ClimaComms.init(context)
const device = ClimaComms.device()
ArrayType = ClimaComms.array_type(device)

# log output only from root process
logger_stream = ClimaComms.iamroot(context) ? stderr : devnull
prev_logger = global_logger(ConsoleLogger(logger_stream, Logging.Info))
atexit() do
    global_logger(prev_logger)
end

@testset "3D sphere" begin
    vertdomain = Domains.IntervalDomain(
        Geometry.ZPoint(0.0),
        Geometry.ZPoint(1000.0);
        boundary_names = (:bottom, :top),
    )

    vertmesh = Meshes.IntervalMesh(vertdomain, nelems = 30)
    verttopo = Topologies.IntervalTopology(
        ClimaComms.SingletonCommsContext(ClimaComms.device()),
        vertmesh,
    )
    vert_center_space = Spaces.CenterFiniteDifferenceSpace(verttopo)

    horzdomain = Domains.SphereDomain(1e6)

    quad = Quadratures.GLL{4}()
    horzmesh = Meshes.EquiangularCubedSphere(horzdomain, 6)
    horztopology = Topologies.Topology2D(context, horzmesh)
    horzspace = Spaces.SpectralElementSpace2D(horztopology, quad)

    hv_center_space =
        Spaces.ExtrudedFiniteDifferenceSpace(horzspace, vert_center_space)

    longpts = range(-120.0, 120.0, 21)
    latpts = range(-80.0, 80.0, 21)
    zpts = range(0.0, 1000.0, 21)
    hcoords =
        [Geometry.LatLongPoint(lat, long) for long in longpts, lat in latpts]
    zcoords = [Geometry.ZPoint(z) for z in zpts]

    remapper =
        Remapping.Remapper(hv_center_space, hcoords, zcoords, buffer_length = 2)

    coords = Fields.coordinate_field(hv_center_space)

    interp_sin_long = Remapping.interpolate(remapper, sind.(coords.long))
    interp_sin_lat = Remapping.interpolate(remapper, sind.(coords.lat))
    interp_long_lat =
        Remapping.interpolate(remapper, [sind.(coords.long), sind.(coords.lat)])

    interp_long_lat_long = Remapping.interpolate(
        remapper,
        [sind.(coords.long), sind.(coords.lat)],
    )
    if ClimaComms.iamroot(context)
        @test interp_sin_long  interp_long_lat_long[:, :, :, 1]
        @test interp_sin_lat  interp_long_lat_long[:, :, :, 2]
    end
end

The script stops failing if I checkout commit b4a04d8402cf95792bfc5b89d63279d9539da3b2 (as tested by running it 8 times)

@Sbozzolo
Copy link
Member

With older dependencies, the test seems to pass:

https://buildkite.com/clima/climacore-ci/builds/4953

(I tried only once on buildkite, so maybe it's luck. I tried more times locally)

@Sbozzolo
Copy link
Member

Sbozzolo commented Jan 23, 2025

I spent (many) more hours on this.

The problem seems to be a subtle issue with synchronization or something along those lines. This is a reproducer that consistently triggers the problem on the Caltech cluster (running with 2 GPUs)

using Logging
using Test
using IntervalSets

import ClimaCore:
    Domains,
    Fields,
    Geometry,
    Meshes,
    Operators,
    Spaces,
    Quadratures,
    Topologies,
    Remapping,
    Hypsography
using ClimaComms
ClimaComms.@import_required_backends
const context = ClimaComms.context()
const pid, nprocs = ClimaComms.init(context)
const device = ClimaComms.device()
ArrayType = ClimaComms.array_type(device)

@testset "3D sphere" begin
    vertdomain = Domains.IntervalDomain(
        Geometry.ZPoint(0.0),
        Geometry.ZPoint(1000.0);
        boundary_names = (:bottom, :top),
    )

    vertmesh = Meshes.IntervalMesh(vertdomain, nelems = 30)
    verttopo = Topologies.IntervalTopology(
        ClimaComms.SingletonCommsContext(ClimaComms.device()),
        vertmesh,
    )
    vert_center_space = Spaces.CenterFiniteDifferenceSpace(verttopo)

    horzdomain = Domains.SphereDomain(1e6)

    quad = Quadratures.GLL{4}()
    horzmesh = Meshes.EquiangularCubedSphere(horzdomain, 6)
    horztopology = Topologies.Topology2D(context, horzmesh)
    horzspace = Spaces.SpectralElementSpace2D(horztopology, quad)

    hv_center_space =
        Spaces.ExtrudedFiniteDifferenceSpace(horzspace, vert_center_space)

    longpts = range(-120.0, 120.0, 21)
    latpts = range(-80.0, 80.0, 21)
    zpts = range(0.0, 1000.0, 21)
    hcoords =
        [Geometry.LatLongPoint(lat, long) for long in longpts, lat in latpts]
    zcoords = [Geometry.ZPoint(z) for z in zpts]

    remapper =
        Remapping.Remapper(hv_center_space, hcoords, zcoords, buffer_length = 2)

    coords = Fields.coordinate_field(hv_center_space)

    interp_sin_long = Remapping.interpolate(remapper, sind.(coords.long))
    interp_sin_lat = Remapping.interpolate(remapper, sind.(coords.lat))
    interp_long_lat =
        Remapping.interpolate(remapper, [sind.(coords.long), sind.(coords.lat)])

    remapper2 =
        Remapping.Remapper(hv_center_space, hcoords, zcoords, buffer_length = 2)

    # @info "HERE", ClimaComms.mypid(context)

    interp_long_lat_long = Remapping.interpolate(
        remapper2,
        [sind.(coords.long), sind.(coords.lat)],
    )

    if ClimaComms.iamroot(context)
        @test Array(interp_long_lat_long)[1, 1, 1, 1] < 0
        @info Array(interp_long_lat_long)[1, 1, 1, 1]
    end
end

If the lines

    interp_sin_long = Remapping.interpolate(remapper, sind.(coords.long))
    interp_sin_lat = Remapping.interpolate(remapper, sind.(coords.lat))
    interp_long_lat =
        Remapping.interpolate(remapper, [sind.(coords.long), sind.(coords.lat)])

are uncommented, the bug is not triggered.

If the @info is uncommented, the bug is not triggered.

What I observed is that interp_long_lat_long[1, 1, 1, 1] does not get written (so it has its initialization value of 0). interp_long_lat_long should be obtained by the reduction of the local interpolated terms in the two MPI processes. The relevant block is:

    cat_fn = (l...) -> cat(l..., dims = length(remapper.colons) + 1)

    interpolated_values = mapreduce(cat_fn, index_ranges) do range
        num_fields = length(range)

        # Reset interpolated_values. This is needed because we collect distributed results
        # with a + reduction.
        _reset_interpolated_values!(remapper)
        # Perform the interpolations (horizontal and vertical)
        _set_interpolated_values!(
            remapper,
            view(fields, index_field_begin:index_field_end),
        )

        if !isa_vertical_space
            # For spaces with an horizontal component, reshape the output so that it is a nice grid.
            _apply_mpi_bitmask!(remapper, num_fields)
        else
            # For purely vertical spaces, just move to _interpolated_values
            remapper._interpolated_values .= remapper._local_interpolated_values
        end

        # Finally, we have to send all the _interpolated_values to root and sum them up to
        # obtain the final answer. Only the root will contain something useful.
        return _collect_and_return_interpolated_values!(remapper, num_fields)
    end

I added print statements to check that the returned value in _collect_and_return_interpolated_values!(remapper, num_fields) are correct. When the bug is triggered, they are not. I verified that the first entries of remapper._interpolated_values are they what they should be. So, my impression is that the only place where this can go wrong ClimaComms.reduce (which is just MPI.Reduce, which just calls MPI_Reduce`).

I thought it had to do with barriers, but I verified that explicitely synching or adding barriers does not fix the issue.

Of course, reducing the problem further would be good, but it is really hard to make smaller reproducers and still trigger the bug. At this point, I am going to leave it here. If this happens in the wild, I think we would immediately see because the output files would come with large holes filled with significantly incorrect values.

@ph-kev
Copy link
Member

ph-kev commented Jan 28, 2025

On this buildkite, for this run GPU AMIP FINE: new target amip: topo + diagedmf, there are a bunch of zeros in the pressure. See the plot below. This bug could be the cause of this.

Code to reproduce the plot

# Go on clima and download the data in clima_atmos
# Then, run this script
using ClimaAnalysis
using Makie
using GeoMakie
using CairoMakie

var = OutputVar("clima_atmos/pfull_1M_average.nc")

times = copy(var.dims["time"])
z_ref = copy(var.dims["z_reference"])
var = slice(var, time = times[end])
var = slice(var, z_reference = z_ref[1])

fig = Figure()
ClimaAnalysis.Visualize.contour2D_on_globe!(fig, var)
save("pfull.png", fig)

The code produces this plot.

Image

@Sbozzolo
Copy link
Member

Note that the run posted above was on clima, so the bug is not as machine-dependent as I was expecting it to be.

Sbozzolo added a commit that referenced this issue Jan 30, 2025
I spent many hours tracking down
#2108 and could not find the
root issue.

I decided to take a different approach and simplify redefine
`interpolate` in terms of `interpolate!`.
@Sbozzolo Sbozzolo linked a pull request Jan 30, 2025 that will close this issue
Sbozzolo added a commit that referenced this issue Jan 30, 2025
I spent many hours tracking down
#2108 and could not find the
root issue.

I decided to take a different approach and simplify redefine
`interpolate` in terms of `interpolate!`.
Sbozzolo added a commit that referenced this issue Jan 30, 2025
I spent many hours tracking down
#2108 and could not find the
root issue.

I decided to take a different approach and simplify redefine
`interpolate` in terms of `interpolate!`.
Sbozzolo added a commit that referenced this issue Jan 30, 2025
I spent many hours tracking down
#2108 and could not find the
root issue.

I decided to take a different approach and simplify redefine
`interpolate` in terms of `interpolate!`.
Sbozzolo added a commit that referenced this issue Jan 31, 2025
I spent many hours tracking down
#2108 and could not find the
root issue.

I decided to take a different approach and simplify redefine
`interpolate` in terms of `interpolate!`.
@Sbozzolo
Copy link
Member

Sbozzolo commented Feb 4, 2025

I discussed this on the Julia slack, and bisected this to JuliaParallel/MPI.jl@aac9688.

Here is a failing build with commit JuliaParallel/MPI.jl@aac9688, here is a passing build with JuliaParallel/MPI.jl@9584ac8

Sbozzolo added a commit that referenced this issue Feb 6, 2025
I spent many hours tracking down
#2108 and could not find the
root issue.

I decided to take a different approach and simplify redefine
`interpolate` in terms of `interpolate!`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants