External matrix solver #89

edoddridge · 2017-03-29T03:20:46Z

This implements a preconditioned conjugate gradient algorithm from Hypre to solve the implicit equation for eta, and closes #1.

Although the external library is built around MPI and is intrinsically parallel, this branch does not implement any parallelisation. One day that will have its own PR.

The solver implemented in this PR relies on two external libraries: Hypre and MPI. The Hypre dependency is dealt with by including Hypre as a submodule. I have not yet automated the process of downloading and compiling Hypre, but I think it should be possible. The MPI dependency requires the user to have a working version of MPI already installed.

To make this branch pass the test suite I had to disable the -Wall compiler flag. I tried to come up with some way of making sure all variables were used in all modes, or only declared in the modes in which they were used, but I failed. Possibly we could deal with this by moving the offending subroutines to a separate module, overloading them, and implementing generic interfaces. Doing that would be a fair amount of work, and should probably be discussed in its own issue and implemented in its own PR. Either way, if this branch is merged without a fix for this, a new issue should opened about the -Wall flag.

I tried running this version in the pytest test suite, but the numerics are sufficiently different that of the n-layer tests only test_f_plane test passes. This is another good reason to sort out #66 as soon as possible.

Benchmarking shows that this version is between 2 and 4 times faster than the original code, even though it is still purely serial. It also shows that the execution time does not grow anywhere near as rapidly as nx and ny increase (see #1, comment), which means its relative advantage increases as the domain size increases.

Summary of things to do when this is merged:

documentation about how to install dependencies
- automate the installation of Hypre (optional)
file new ticket about -Wall compiler flag in test suite
parallelise the code, or at least make it mpi safe, so that we can use the external solver in parallel.
experiment with different solvers and preconditioners from Hypre to see if there are additional performance gains to be had with little effort (optional)

This brings the new layout and names to the external matrix solver branch. Conflicts: .gitignore Makefile aronnax.f90

…hypre matrix

edoddridge · 2017-04-20T15:04:27Z

Actually, I forgot about my style sweep. I'll push that later today.

Groveling the Ubuntu packages database at packages.ubuntu.com suggests that `mpirun` is not in the libopenmpi-dev package, which would explain Travis's inability to execute the test suite. It's fairly standard in Debian-based distros such as Ubuntu to split the development files (libfoo) from the runtime support (openmpi-bin, in this case), because there are purposes that need one but not the other. (e.g., just running a pre-compiled MPI application does not need libopenmpi-dev).

axch · 2017-04-26T13:17:28Z

OK, making good progress. The outstanding issues I see now are:

Trailing whitespace in code files (I think I got all of it)
Travis was missing the openmpi-bin package, which actually contains the mpirun command
If the Fortran core exits abnormally, mpirun complains about exiting without calling finalize. We should prevent that from happening. Probably the most robust way to do that would be to define a subroutine named something like clean_stop which finalizes MPI and then invokes stop.

When I run the test suite with the external solver executable, Hypre emits many errors like

ERROR detected by Hypre ...  BEGIN
ERROR -- hypre_PCGSolve: INFs and/or NaNs detected in input.
User probably placed non-numerics in supplied b.
Returning error flag += 101.  Program not terminated.
ERROR detected by Hypre ...  END

What's going on there? (That run eventually crashes with the core detecting a NaN, which is how I discovered the previous infelicity.)

(Minor) for some reason, the test output also includes a dump, to standard output, of eta.av.000000001. Was that a debug print @edoddridge forgot to turn back off?

Conflicts: .travis.yml

…e/aronnax into external-matrix-solver

These loops are not needed when running on a single core, and by all accounts didn't work as intended anyway.

and may or may not print an unhappy message, depending on whether the 'happy' input is true.

edoddridge · 2017-04-26T21:29:36Z

That's a good idea to put in a clean_stop subroutine. I'll add it. The subroutine takes a logical argument happy that controls whether it prints a message to stdout describing the stop. I verified that it prints the expected message in the model fails, but don't know how to make the test suite expect an error from the fortran code and catch in a way that doesn't propagate it. If you know how to make the python do that, then all you need to do is increase the value of dt sufficiently and the model will reliably crash.

The screen dump actually comes from output_preservation_test.assert_outputs_close. If the assertion fails it prints the filename, the offending values, and the expected values of the relevant array. If you disable the assertion in assert_outputs_close and just raise an assertion error manually, you'll see that no arrays get printed.

The Hypre error was from me trying to get it ready for multiple processors and forgetting that the test suite didn't automatically run that section of code. It's been commented out now. It should probably have been deleted to keep the source code stylish, but given that parallelising this is a high priority it felt odd to delete prior attempts before finding one that works.

coveralls · 2017-04-26T21:35:18Z

Coverage remained the same at 82.27% when pulling 52e965e on external-matrix-solver into d14420d on master.

…ms the stop.

edoddridge · 2017-04-27T13:21:33Z

I tried cranking the convergence tolerance to within a whisker of machine precision, to see if that would make the two methods converge. It didn't. I think it's going to take careful inspection of simulations at higher resolution than the ones in the test suite to work out what is going on.

axch · 2017-04-27T13:50:17Z

aronnax.f90

@@ -1694,21 +1694,21 @@ subroutine Ext_solver(MPI_COMM_WORLD, hypre_A, hypre_grid, num_procs, &
    end do
  end do



This commit is a teachable moment.

"by all accounts didn't work as intended anyway" is not a very helpful phrase to see in a log message. What accounts? What was the reported problem? What evidence blames this particular part of the code?

Turning a loop that should execute exactly once (there being one process) into just invoking the loop body one time shouldn't be a solution to anything, because it shouldn't have any effect. If you suspected that the loop was iterating more than once, a simple print *, num_procs or print *, i should have assuaged your concern.

Aside: Why is this thing looping over the number of processes anyway? Shouldn't it just be filling its own box? But perhaps I don't understand MPI/Hypre yet.

The actual change here is bogus: instead of looping over all i from 0 to num_procs-1 inclusive, the code as written uses the old value of i from the enclosing scope, which by dumb luck happens to be defined. That the compiler accepted this is a consequence of this barbarous programming language forcing one to declare one's loop variables in advance, instead of defining them on the spot in the looping construct and taking them back out of scope when the looping construct ends, like all civilized systems do (for future reference, Python is not civilized in this sense either).

In this instance, i happens to be nx+1 = 11, which has the effect of reading the box boundaries from somewhere off the ends of the ilower and iupper arrays, i.e., making them up from whole cloth. When I tried it, I got ilower(11,1) ~ 32700, ilower(11,2) = 32, iupper(11,1) ~ 1.95 billion, iupper(11,2) ~ 1.91 billion. (Interestingly, they were the same in every time step, but varied somewhat across runs. I suppose it's actually reading from some nearby array that may have some floating-point input data or something.) I guess this causes Hypre to decide that you are setting values for an empty box, and leave hypre_b and hypre_x at some default value, like 0. Conversely, at extraction time, I guess Hypre simply doesn't modify the values array at all, and, though I haven't checked this, I guess this whole routine is now equivalent to etanew = etastar/dt**2. This, in turn, stops Hypre from complaining about NaNs in its inputs, and manifests as the test suite reporting "wrong answer". Commenting out the three calls to VectorSet/GetBoxValues entirely also exhibits the same external behavior.

Undoing this change and adding some more debug prints instead suggests that what's actually happening to cause those Hypre complaints is that the simulation blows up numerically over multiple iterations, eventually leading to NaNs, which Hypre catches before break_if_NaN does because the latter is only called at dump time. That's an argument in favor of break_if_NaN at every timestep.

Possible ways to proceed from here:

Revert this commit regardless.

Could try to further debug the simulation by "force of will" (i.e., adding print statements and trying to figure out what's happening)

Could step back and try to develop more effective test inspection. A one-number summary of the discrepancy in the first time point that has a discrepancy is not very informative; perhaps the test suite should make some visualizations of the differences when it fails.

Will the pysics-tests branch provide such debugging tools? Should we merge Hypre as still-broken-but-probably-fixable "experimental code" so that we can develop the physics-based testing well enough to figure out what's happening here?

axch · 2017-04-27T13:55:32Z

Re: cranking the tolerance: Were you experimenting with that while Hypre was being completely ignored (see my comment #89 (comment))?

axch · 2017-04-27T20:39:43Z

After conversation, it was decided to merge this PR as-is, on the following grounds:

It makes Aronnax an MPI application, which we want for using Hypre later (if nothing else)
It adds Hypre as a dependency (ditto)
It doesn't break the test suite
The Hypre code isn't known to work correctly, and isn't tested by the test suite, but we can live with that (note issues Verify that the Hypre commands are solving the intended equation #123 and Verify that Hypre is giving physically plausible answers #124).

This reverts commit 7c74684.

coveralls · 2017-04-27T20:44:40Z

Coverage remained the same at 82.27% when pulling b1a2b32 on external-matrix-solver into d14420d on master.

The test suite no longer uses valgrind by default due to issues with MPI and valgrind (see PR #89 and issue #125). Travis is currently failing because the `sudo apt-get install valgrind` is returning an error. Since the test suite doesn't use valgrind, this command is being flushed.

edoddridge added 30 commits March 25, 2017 12:11

remove lib/ from .gitignore

e24644e

add external library hypre

e5c01e3

add target for parallel build

d825ae2

add variables for external solver

a294673

add external solve variables to parameters.in

c0f1be6

use cpp to optionally inlcude mpi code

454fb1e

use cpp to optionally include mpi code in model code

c57cb99

mpi internal variables

f897b04

comment to explain cpp/MPI block in code

7c7a0bc

move namelist read for external solver namelist

0b48825

add new PRESSURE_SOLVER namelist to tests

17d2eab

change mpi flag to external sovler flag for clarity

073c1a4

change useMPI flag to useExtSolver

7bd8877

only read namelist if using external solver

9914802

remove external solver flag form parameters.in

851a068

domain decomposition for external solver

87efc79

hypre and mpi variables and domain decomposition

01f11bf

remove logical switch from external solver namelist

2e9bcc3

setting up the A matrix for external solver

8a1931f

compile options for parallel external solver

bc26621

remove logical swithc from external solver namelist in parameters.in

78514a3

ignore parallel executable

345d922

update external solver namelist in test paramters.in

9d7a1c5

Merge branch 'master' into external-matrix-solver

aa8f9c3

This brings the new layout and names to the external matrix solver branch. Conflicts: .gitignore Makefile aronnax.f90

missed conflicts in aronnax.f90

b15bd02

apparently multiple spaces does not equal a tab

bbadc9b

change working loop indicies to avoid conflict with main loop

08b383d

define mpi variable and add it to th main loop subroutine along with …

32cc437

…hypre matrix

remove other hypre interface

78e5b32

preparing to insert values into the hypre matrix A

bf3a86c

edoddridge added 3 commits April 20, 2017 11:24

style sweep of new subroutines

d861c16

rename subroutine since A is a matrix, not a vector

8f0eabb

Merge branch 'master' into external-matrix-solver

938ea6f

This was referenced Apr 23, 2017

Physically motivated tests #66

Open

Physics tests #120

Merged

axch added 3 commits April 26, 2017 08:53

Abstract the name of the test executable so it's easier to edit.

39dae5c

Strip trailing whitespace.

c0c9f6e

edoddridge added 4 commits April 26, 2017 16:27

Merge branch 'master' into external-matrix-solver

a0ed534

Conflicts: .travis.yml

Merge branch 'external-matrix-solver' of https://github.com/edoddridg…

1f4b355

…e/aronnax into external-matrix-solver

Remove loops around Hypre commands to fill values for b and x

7c74684

These loops are not needed when running on a single core, and by all accounts didn't work as intended anyway.

add clean_stop subroutine that finalises MPI

52e965e

and may or may not print an unhappy message, depending on whether the 'happy' input is true.

axch added 2 commits April 27, 2017 07:01

Flush redundant stop commands, since the clean_stop subroutine perfor…

fc71e1d

…ms the stop.

Ignore the external solver test executable too.

b1a2b32

axch reviewed Apr 27, 2017

View reviewed changes

Revert "Remove loops around Hypre commands to fill values for b and x"

c12a403

This reverts commit 7c74684.

axch merged commit cf7efd9 into master Apr 27, 2017

axch mentioned this pull request Apr 27, 2017

Use a faster algorithm for the matrix solver #1

Closed

This was referenced Apr 27, 2017

Reimplement Valgrind with mpi #125

Open

Set benchmark script to not run Hypre version by default #128

Closed

edoddridge deleted the external-matrix-solver branch May 4, 2017 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External matrix solver #89

External matrix solver #89

edoddridge commented Mar 29, 2017 •

edited

Loading

edoddridge commented Apr 20, 2017

axch commented Apr 26, 2017 •

edited by edoddridge

Loading

edoddridge commented Apr 26, 2017 •

edited

Loading

coveralls commented Apr 26, 2017

edoddridge commented Apr 27, 2017

axch Apr 27, 2017 •

edited

Loading

axch commented Apr 27, 2017

axch commented Apr 27, 2017 •

edited

Loading

coveralls commented Apr 27, 2017

		@@ -1694,21 +1694,21 @@ subroutine Ext_solver(MPI_COMM_WORLD, hypre_A, hypre_grid, num_procs, &
		end do
		end do

External matrix solver #89

External matrix solver #89

Conversation

edoddridge commented Mar 29, 2017 • edited Loading

edoddridge commented Apr 20, 2017

axch commented Apr 26, 2017 • edited by edoddridge Loading

edoddridge commented Apr 26, 2017 • edited Loading

coveralls commented Apr 26, 2017

edoddridge commented Apr 27, 2017

axch Apr 27, 2017 • edited Loading

Choose a reason for hiding this comment

axch commented Apr 27, 2017

axch commented Apr 27, 2017 • edited Loading

coveralls commented Apr 27, 2017

edoddridge commented Mar 29, 2017 •

edited

Loading

axch commented Apr 26, 2017 •

edited by edoddridge

Loading

edoddridge commented Apr 26, 2017 •

edited

Loading

axch Apr 27, 2017 •

edited

Loading

axch commented Apr 27, 2017 •

edited

Loading