Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging variable values outside of valid ranges in GPU kernels #2136

Open
charleskawczynski opened this issue Sep 21, 2023 · 0 comments
Open

Comments

@charleskawczynski
Copy link
Member

charleskawczynski commented Sep 21, 2023

There are several situations where we need these two functionalities:

  • logging variables passed to certain functions that are outside of their valid ranges (based on physics assumptions) inside GPU kernels.

  • collecting statistics for how many times these variables occur outside of their valid ranges, how far outside of their valid range. If these variables are used in an iterative solve, we would also like to track the number of times that we reach the maximum number of iterations and the number of times convergence fails. These stats will help us reduce the frequency and extremity of these outliers.

This functionality is needed for different variables, in different packages:

  • Thermodynamics (density, total specific humidity)
  • CloudMicrophysics (cc @trontyl?)
  • SurfaceFluxes (cc @akshaysridhar )

In code, this may look something like:

function saturation_adjustment(ρ, q_tot, e_int)
    if q_tot < 0
        log_bad!("q_tot", q_tot)
    end
end
function PhaseEquil(...)
    T = saturation_adjustment(...)
    PhaseEquil{FT}(...)
end

@. ts = PhaseEquil(....) # GPU kernel launch

where log_bad! logs the most negative value of q_tot.

In the past, we have tried simply printing this value, however, this may result in a prohibitive number of prints, where the buildkite output exceeds the buffer or, worse, the simulation appears to hang. So, we need some sort of logging that performs reductions in space and time, and the ability to collect statistics so that we have an idea of how to reduce the frequency of evaluating the model with these bad values.

Issues in dependencies:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant