Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn the user when choice of grid size can cause a CUDA error #340

Merged
merged 2 commits into from
Nov 4, 2024

Conversation

JonathanMaes
Copy link
Contributor

This PR adds warnings for two situations where the choice of grid size could result in a CUDA error.

  1. Simulations failing to run: panic: CUDA_ERROR_INVALID_VALUE  #284
    When the number of cells along an axis has a prime factor >127, the CUDA_ERROR_INVALID_VALUE error occurs because of the inner workings of the cuFFT algorithm (see @jplauzie's reply).

    The new warning (example below) is already raised when the grid is not 7-smooth, i.e. when there is a prime factor greater than 7. This includes the >127 case, while also raising awareness about the recommendation to use a 7-smooth grid.

    // WARNING: y-axis is not 7-smooth. It has 501 cells, with prime
    //          factors [3 167], at least one of which is greater than 7.
    //          This may reduce performance or cause a CUDA_ERROR_INVALID_VALUE error.
    
  2. panic: CURAND_STATUS_LENGTH_NOT_MULTIPLE issue when grid size odd and temperature finite #314
    When temperature is nonzero, and the grid contains an odd number of cells, the CURAND_STATUS_LENGTH_NOT_MULTIPLE error occurs. This is explained in the curandGenerateNormal documentation:

    Normally distributed results are generated from pseudorandom generators with a Box-Muller transform, and so require n to be even.

    The new warning (example below) is raised if the grid is odd, when the random thermal field is updated for the first time.

    // WARNING: nonzero temperature requires an even amount of grid cells,
    //          but all axes have an odd number of cells: [625 625 1].
    //          This may cause a CURAND_STATUS_LENGTH_NOT_MULTIPLE error.
    

These warnings are printed during program execution, so may be buried within the output. Alternatively, an error could be raised, but that seems premature if the CUDA error has not yet occurred. Alternatively, the warning could be printed at the very end of the output, but that seems hard to implement.

MathieuMoalic added a commit to MathieuMoalic/amumax that referenced this pull request Oct 18, 2024
@JonathanMaes JonathanMaes mentioned this pull request Nov 4, 2024
2 tasks
@JonathanMaes JonathanMaes merged commit 01bda93 into 3.11 Nov 4, 2024
@JonathanMaes JonathanMaes deleted the feature/warnPrimeFactors branch November 4, 2024 10:01
MathieuMoalic added a commit to MathieuMoalic/amumax that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant