-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed remapping interpolation bug #2108
Comments
I don't think I've ever seen this error, and we haven't touched that code in a long time. I am wondering if something else has changed that led to this |
Here's another failure, but this time on updating the dependencies: https://buildkite.com/clima/climacore-ci/builds/4912#01944dc1-7201-4793-8d9f-13723f028c96 |
Looking at the failed test, the first value in the array is very different, and not machine precision. So, I think there's likely a race condition somewhere in |
I still have no clue what this could be. What I see is that the entire array is not exactly wrong: it seems that the values are out of order (e.g., we are looking at the first vs last level data). |
Here's what I am finding:
using Logging
using Test
using IntervalSets
import ClimaCore:
Domains,
Fields,
Geometry,
Meshes,
Operators,
Spaces,
Quadratures,
Topologies,
Remapping,
Hypsography
using ClimaComms
ClimaComms.@import_required_backends
const context = ClimaComms.context()
const pid, nprocs = ClimaComms.init(context)
const device = ClimaComms.device()
ArrayType = ClimaComms.array_type(device)
# log output only from root process
logger_stream = ClimaComms.iamroot(context) ? stderr : devnull
prev_logger = global_logger(ConsoleLogger(logger_stream, Logging.Info))
atexit() do
global_logger(prev_logger)
end
@testset "3D sphere" begin
vertdomain = Domains.IntervalDomain(
Geometry.ZPoint(0.0),
Geometry.ZPoint(1000.0);
boundary_names = (:bottom, :top),
)
vertmesh = Meshes.IntervalMesh(vertdomain, nelems = 30)
verttopo = Topologies.IntervalTopology(
ClimaComms.SingletonCommsContext(ClimaComms.device()),
vertmesh,
)
vert_center_space = Spaces.CenterFiniteDifferenceSpace(verttopo)
horzdomain = Domains.SphereDomain(1e6)
quad = Quadratures.GLL{4}()
horzmesh = Meshes.EquiangularCubedSphere(horzdomain, 6)
horztopology = Topologies.Topology2D(context, horzmesh)
horzspace = Spaces.SpectralElementSpace2D(horztopology, quad)
hv_center_space =
Spaces.ExtrudedFiniteDifferenceSpace(horzspace, vert_center_space)
longpts = range(-120.0, 120.0, 21)
latpts = range(-80.0, 80.0, 21)
zpts = range(0.0, 1000.0, 21)
hcoords =
[Geometry.LatLongPoint(lat, long) for long in longpts, lat in latpts]
zcoords = [Geometry.ZPoint(z) for z in zpts]
remapper =
Remapping.Remapper(hv_center_space, hcoords, zcoords, buffer_length = 2)
coords = Fields.coordinate_field(hv_center_space)
interp_sin_long = Remapping.interpolate(remapper, sind.(coords.long))
interp_sin_lat = Remapping.interpolate(remapper, sind.(coords.lat))
interp_long_lat =
Remapping.interpolate(remapper, [sind.(coords.long), sind.(coords.lat)])
interp_long_lat_long = Remapping.interpolate(
remapper,
[sind.(coords.long), sind.(coords.lat)],
)
if ClimaComms.iamroot(context)
@test interp_sin_long ≈ interp_long_lat_long[:, :, :, 1]
@test interp_sin_lat ≈ interp_long_lat_long[:, :, :, 2]
end
end The script stops failing if I checkout commit |
With older dependencies, the test seems to pass: https://buildkite.com/clima/climacore-ci/builds/4953 (I tried only once on buildkite, so maybe it's luck. I tried more times locally) |
I spent (many) more hours on this. The problem seems to be a subtle issue with synchronization or something along those lines. This is a reproducer that consistently triggers the problem on the Caltech cluster (running with 2 GPUs) using Logging
using Test
using IntervalSets
import ClimaCore:
Domains,
Fields,
Geometry,
Meshes,
Operators,
Spaces,
Quadratures,
Topologies,
Remapping,
Hypsography
using ClimaComms
ClimaComms.@import_required_backends
const context = ClimaComms.context()
const pid, nprocs = ClimaComms.init(context)
const device = ClimaComms.device()
ArrayType = ClimaComms.array_type(device)
@testset "3D sphere" begin
vertdomain = Domains.IntervalDomain(
Geometry.ZPoint(0.0),
Geometry.ZPoint(1000.0);
boundary_names = (:bottom, :top),
)
vertmesh = Meshes.IntervalMesh(vertdomain, nelems = 30)
verttopo = Topologies.IntervalTopology(
ClimaComms.SingletonCommsContext(ClimaComms.device()),
vertmesh,
)
vert_center_space = Spaces.CenterFiniteDifferenceSpace(verttopo)
horzdomain = Domains.SphereDomain(1e6)
quad = Quadratures.GLL{4}()
horzmesh = Meshes.EquiangularCubedSphere(horzdomain, 6)
horztopology = Topologies.Topology2D(context, horzmesh)
horzspace = Spaces.SpectralElementSpace2D(horztopology, quad)
hv_center_space =
Spaces.ExtrudedFiniteDifferenceSpace(horzspace, vert_center_space)
longpts = range(-120.0, 120.0, 21)
latpts = range(-80.0, 80.0, 21)
zpts = range(0.0, 1000.0, 21)
hcoords =
[Geometry.LatLongPoint(lat, long) for long in longpts, lat in latpts]
zcoords = [Geometry.ZPoint(z) for z in zpts]
remapper =
Remapping.Remapper(hv_center_space, hcoords, zcoords, buffer_length = 2)
coords = Fields.coordinate_field(hv_center_space)
interp_sin_long = Remapping.interpolate(remapper, sind.(coords.long))
interp_sin_lat = Remapping.interpolate(remapper, sind.(coords.lat))
interp_long_lat =
Remapping.interpolate(remapper, [sind.(coords.long), sind.(coords.lat)])
remapper2 =
Remapping.Remapper(hv_center_space, hcoords, zcoords, buffer_length = 2)
# @info "HERE", ClimaComms.mypid(context)
interp_long_lat_long = Remapping.interpolate(
remapper2,
[sind.(coords.long), sind.(coords.lat)],
)
if ClimaComms.iamroot(context)
@test Array(interp_long_lat_long)[1, 1, 1, 1] < 0
@info Array(interp_long_lat_long)[1, 1, 1, 1]
end
end If the lines
are uncommented, the bug is not triggered. If the What I observed is that cat_fn = (l...) -> cat(l..., dims = length(remapper.colons) + 1)
interpolated_values = mapreduce(cat_fn, index_ranges) do range
num_fields = length(range)
# Reset interpolated_values. This is needed because we collect distributed results
# with a + reduction.
_reset_interpolated_values!(remapper)
# Perform the interpolations (horizontal and vertical)
_set_interpolated_values!(
remapper,
view(fields, index_field_begin:index_field_end),
)
if !isa_vertical_space
# For spaces with an horizontal component, reshape the output so that it is a nice grid.
_apply_mpi_bitmask!(remapper, num_fields)
else
# For purely vertical spaces, just move to _interpolated_values
remapper._interpolated_values .= remapper._local_interpolated_values
end
# Finally, we have to send all the _interpolated_values to root and sum them up to
# obtain the final answer. Only the root will contain something useful.
return _collect_and_return_interpolated_values!(remapper, num_fields)
end I added print statements to check that the returned value in I thought it had to do with barriers, but I verified that explicitely synching or adding barriers does not fix the issue. Of course, reducing the problem further would be good, but it is really hard to make smaller reproducers and still trigger the bug. At this point, I am going to leave it here. If this happens in the wild, I think we would immediately see because the output files would come with large holes filled with significantly incorrect values. |
On this buildkite, for this run GPU AMIP FINE: new target amip: topo + diagedmf, there are a bunch of zeros in the pressure. See the plot below. This bug could be the cause of this. Code to reproduce the plot
The code produces this plot. |
Note that the run posted above was on |
I spent many hours tracking down #2108 and could not find the root issue. I decided to take a different approach and simplify redefine `interpolate` in terms of `interpolate!`.
I spent many hours tracking down #2108 and could not find the root issue. I decided to take a different approach and simplify redefine `interpolate` in terms of `interpolate!`.
I spent many hours tracking down #2108 and could not find the root issue. I decided to take a different approach and simplify redefine `interpolate` in terms of `interpolate!`.
I spent many hours tracking down #2108 and could not find the root issue. I decided to take a different approach and simplify redefine `interpolate` in terms of `interpolate!`.
I spent many hours tracking down #2108 and could not find the root issue. I decided to take a different approach and simplify redefine `interpolate` in terms of `interpolate!`.
I discussed this on the Julia slack, and bisected this to JuliaParallel/MPI.jl@aac9688. Here is a failing build with commit JuliaParallel/MPI.jl@aac9688, here is a passing build with JuliaParallel/MPI.jl@9584ac8 |
I spent many hours tracking down #2108 and could not find the root issue. I decided to take a different approach and simplify redefine `interpolate` in terms of `interpolate!`.
Our unit tests seem to be flaky: https://buildkite.com/clima/climacore-ci/builds/4846#0193e0c9-7253-46f0-b029-699ab6c5bf95 (the corresponding PR does not touch anything related to distributed remapping / interpolation).
The text was updated successfully, but these errors were encountered: