Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_cb05 CVODE convergence fails when MPI=OFF and different rates #143

Open
cguzman95 opened this issue Jun 9, 2020 · 9 comments
Open
Assignees
Labels

Comments

@cguzman95
Copy link
Collaborator

cguzman95 commented Jun 9, 2020

Hi @mattldawson,

Let me put in context: This error can be easily view through the branch chem_mod_testcb05_monarch . This branch adds different photo_rates to test_cb05 (extracted from a monarch experiment), and has also an extra test_cb05 file to test_cb05 with all the monarch input values (same photo_Rates, temp, press, timestep and concs).

CMake flags::

cmake -D CMAKE_C_COMPILER=gcc \
-D CMAKE_BUILD_TYPE=debug \
-D CMAKE_C_FLAGS_DEBUG="-g" \
-D CMAKE_Fortran_FLAGS_DEBUG="-g" \
-D CMAKE_Fortran_COMPILER=mpifort \
-D ENABLE_JSON=ON \
-D ENABLE_SUNDIALS=ON \
-D ENABLE_TESTS=OFF \
-D ENABLE_GPU=OFF \
-D ENABLE_DEBUG=OFF \
-D FAILURE_DETAIL=OFF \
-D ENABLE_CXX=OFF \
-D ENABLE_MPI=ON \
..

Then, testing test_cb05 with:

  • Monarch photo_rates
  • MPI=ON
  • i_repeat=1, NUM_TIME_STEPS=1

image

It converges with a expected difference respect on EBI.

But using same config and MPI=OFF:

image

I'm not sure if is a error from test_cb05 or from CAMP.

It also happens by running the file test_cb05_monarch (wich has the complete monarch config). As an extra detail (maybe this is produced by another bug or by the same one, I'm not sure), using this config with MPI=ON it takes a lot on converge the first time-step (~3 seconds) :

image

@mattldawson
Copy link
Collaborator

Hi @cguzman95,

For the MPI=OFF tests are you still compiling with mpiifort?

My guess is that there could be a bug in the test. A couple things to try/note:

  • I'm pretty sure the "corrector convergence failed repeatedly" error is coming from KPP or EBI - that does not look like a CVODE error. My guess is that it's KPP failing given that the KPP rates are NaN

  • I would plot the results with the gnuplot scripts for each of these scenarios and look at the profiles for the species that are triggering warnings/errors

  • Are you sure that the new rates/conditions you are using are getting into KPP and EBI correctly? KPP doesn't have any DO_MPI flags in its code; there could be DO_MPI flags in the cb05 test code that maybe are affecting how the new initial conditions you have added are getting to KPP?

@cguzman95
Copy link
Collaborator Author

For the MPI=OFF tests are you still compiling with mpiifort?

Yes.

Are you sure that the new rates/conditions you are using are getting into KPP and EBI correctly?

In ebi yes, in KPP I'm not sure. But CAMP and KPP should be independent. It could be an error on KPP that stops the execution of CAMP? But it's strange that only appears when MPI is ON...

@mattldawson
Copy link
Collaborator

I'm not sure what you mean by KPP stopping the execution of CAMP. Is this in the test? It seems from the output that KPP just printed the convergence failure message and let the test continue, but it would be possible for KPP to just exit the whole test (although I don't think it does this).

The fact that there are NaN rates in KPP seems like there must be some problem with the way the initial conditions are being passed to KPP. I would check the test code, particularly for blocks affected by DO_MPI flags.

@cguzman95
Copy link
Collaborator Author

cguzman95 commented Jun 9, 2020

Thanks. From your deduction and the hints I apported, seems is only a problem of KPP. But I must add one more clue (the one that bring me here):

When executing test_cb05_monarch (the same cb05 with all the monarch input),with MPI ON and MPI OFF the results differs during the EBI comparison with CAMP. With MPI ON the test passes succesfully:

image

But with MPI=OFF:

image

The message of convergence failed could be perfectly from KPP, but the problem is that now the test fails with different results in CAMP by only disabling the MPI flag. I think it's something related with photo_rates because the test works fine if you set these rates to zero.

@mattldawson
Copy link
Collaborator

ah, ok - yeah seems like it could be a problem. could you somehow output the photolysis rates during the solving? to compare with EBI and between the MPI=ON/OFF?

I would also fix the KPP problem, so you can compare among the three, because EBI includes parameterizations that aren't in KPP or CAMP that can affect the results.

@mattldawson
Copy link
Collaborator

it also could be that whatever the problem is with KPP and MPI=OFF is also a problem with getting conditions to EBI or CAMP, but that the problem is showing up as a difference in results rather than a solver failure

@cguzman95
Copy link
Collaborator Author

cguzman95 commented Jun 10, 2020

ah, ok - yeah seems like it could be a problem. could you somehow output the photolysis rates during the solving? to compare with EBI and between the MPI=ON/OFF?

Printing the BASE_RATE_ in rxn_photolysis_update_env_state, it shows the same photolysis rates set at the init, both with MPI=ON and MPI=OFF

This can sounds strange, but the error is not happening on mn4, only on p9 . CMake flags for mn4 are:

cmake -D CMAKE_C_COMPILER=$(which mpicc) \
-D CMAKE_Fortran_COMPILER=$(which mpiifort) \
-D CMAKE_BUILD_TYPE=release \
-D CMAKE_C_FLAGS_DEBUG="-std=c99 " \
-D CMAKE_C_FLAGS_RELEASE="-std=c99 -O3 " \
-D CMAKE_Fortran_FLAGS_DEBUG="" \
-D ENABLE_JSON=ON \
-D ENABLE_SUNDIALS=ON \
-D ENABLE_MPI=OFF \
-D ENABLE_GSL=ON \
-D ENABLE_TESTS=OFF \
..

Change the C and F flags to the same than p9 configuration doesn't make a change. Maybe is an error with the gcc compiler? Or an error that only shows gcc?

@cguzman95
Copy link
Collaborator Author

Another discoverement, more related with the MONARCH bug but also related with the photolysis rate:

(with MPI=ON and photo_rates=X),I just enable the FAILURE_DETAIL flag, and in test_cb05 CVODE returns an error of convergence: "mxstep steps taken before reaching tout.", even when the final results are pretty similar from EBI.

Seems that when the photolysis rates are not homogeneus (0 or 0.01), CVODE doesn't converge. Me and Oriol are checking the setting of photolysis rates, maybe some values are wrong.

@cguzman95
Copy link
Collaborator Author

cguzman95 commented Jun 15, 2020

More info about the convergence fail on CVODE when using monarch photo_rates:

Output:

image

Output when testing on my GPU branch (old camp version, GPU=OFF):

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants