Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External matrix solver #89

Merged
merged 115 commits into from
Apr 27, 2017
Merged

External matrix solver #89

merged 115 commits into from
Apr 27, 2017

Conversation

edoddridge
Copy link
Owner

@edoddridge edoddridge commented Mar 29, 2017

This implements a preconditioned conjugate gradient algorithm from Hypre to solve the implicit equation for eta, and closes #1.

Although the external library is built around MPI and is intrinsically parallel, this branch does not implement any parallelisation. One day that will have its own PR.

The solver implemented in this PR relies on two external libraries: Hypre and MPI. The Hypre dependency is dealt with by including Hypre as a submodule. I have not yet automated the process of downloading and compiling Hypre, but I think it should be possible. The MPI dependency requires the user to have a working version of MPI already installed.

To make this branch pass the test suite I had to disable the -Wall compiler flag. I tried to come up with some way of making sure all variables were used in all modes, or only declared in the modes in which they were used, but I failed. Possibly we could deal with this by moving the offending subroutines to a separate module, overloading them, and implementing generic interfaces. Doing that would be a fair amount of work, and should probably be discussed in its own issue and implemented in its own PR. Either way, if this branch is merged without a fix for this, a new issue should opened about the -Wall flag.

I tried running this version in the pytest test suite, but the numerics are sufficiently different that of the n-layer tests only test_f_plane test passes. This is another good reason to sort out #66 as soon as possible.

Benchmarking shows that this version is between 2 and 4 times faster than the original code, even though it is still purely serial. It also shows that the execution time does not grow anywhere near as rapidly as nx and ny increase (see #1, comment), which means its relative advantage increases as the domain size increases.

Summary of things to do when this is merged:

  • documentation about how to install dependencies
    • automate the installation of Hypre (optional)
  • file new ticket about -Wall compiler flag in test suite
  • parallelise the code, or at least make it mpi safe, so that we can use the external solver in parallel.
  • experiment with different solvers and preconditioners from Hypre to see if there are additional performance gains to be had with little effort (optional)

This brings the new layout and names to the external matrix solver branch.

Conflicts:
	.gitignore
	Makefile
	aronnax.f90
@edoddridge
Copy link
Owner Author

Actually, I forgot about my style sweep. I'll push that later today.

This was referenced Apr 23, 2017
Groveling the Ubuntu packages database at packages.ubuntu.com suggests
that `mpirun` is not in the libopenmpi-dev package, which would
explain Travis's inability to execute the test suite.  It's fairly
standard in Debian-based distros such as Ubuntu to split the
development files (libfoo) from the runtime support (openmpi-bin, in
this case), because there are purposes that need one but not the
other.  (e.g., just running a pre-compiled MPI application does not
need libopenmpi-dev).
@axch
Copy link
Collaborator

axch commented Apr 26, 2017

OK, making good progress. The outstanding issues I see now are:

  • Trailing whitespace in code files (I think I got all of it)
  • Travis was missing the openmpi-bin package, which actually contains the mpirun command
  • If the Fortran core exits abnormally, mpirun complains about exiting without calling finalize. We should prevent that from happening. Probably the most robust way to do that would be to define a subroutine named something like clean_stop which finalizes MPI and then invokes stop.
  • When I run the test suite with the external solver executable, Hypre emits many errors like
    ERROR detected by Hypre ...  BEGIN
    ERROR -- hypre_PCGSolve: INFs and/or NaNs detected in input.
    User probably placed non-numerics in supplied b.
    Returning error flag += 101.  Program not terminated.
    ERROR detected by Hypre ...  END
    
    What's going on there? (That run eventually crashes with the core detecting a NaN, which is how I discovered the previous infelicity.)
  • (Minor) for some reason, the test output also includes a dump, to standard output, of eta.av.000000001. Was that a debug print @edoddridge forgot to turn back off?

These loops are not needed when running on a single core, and by all accounts didn't work as intended anyway.
and may or may not print an unhappy message, depending on whether the 'happy' input is true.
@edoddridge
Copy link
Owner Author

edoddridge commented Apr 26, 2017

That's a good idea to put in a clean_stop subroutine. I'll add it. The subroutine takes a logical argument happy that controls whether it prints a message to stdout describing the stop. I verified that it prints the expected message in the model fails, but don't know how to make the test suite expect an error from the fortran code and catch in a way that doesn't propagate it. If you know how to make the python do that, then all you need to do is increase the value of dt sufficiently and the model will reliably crash.

The screen dump actually comes from output_preservation_test.assert_outputs_close. If the assertion fails it prints the filename, the offending values, and the expected values of the relevant array. If you disable the assertion in assert_outputs_close and just raise an assertion error manually, you'll see that no arrays get printed.

The Hypre error was from me trying to get it ready for multiple processors and forgetting that the test suite didn't automatically run that section of code. It's been commented out now. It should probably have been deleted to keep the source code stylish, but given that parallelising this is a high priority it felt odd to delete prior attempts before finding one that works.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 82.27% when pulling 52e965e on external-matrix-solver into d14420d on master.

@edoddridge
Copy link
Owner Author

I tried cranking the convergence tolerance to within a whisker of machine precision, to see if that would make the two methods converge. It didn't. I think it's going to take careful inspection of simulations at higher resolution than the ones in the test suite to work out what is going on.

@@ -1694,21 +1694,21 @@ subroutine Ext_solver(MPI_COMM_WORLD, hypre_A, hypre_grid, num_procs, &
end do
end do

Copy link
Collaborator

@axch axch Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is a teachable moment.

  1. "by all accounts didn't work as intended anyway" is not a very helpful phrase to see in a log message. What accounts? What was the reported problem? What evidence blames this particular part of the code?

  2. Turning a loop that should execute exactly once (there being one process) into just invoking the loop body one time shouldn't be a solution to anything, because it shouldn't have any effect. If you suspected that the loop was iterating more than once, a simple print *, num_procs or print *, i should have assuaged your concern.

    • Aside: Why is this thing looping over the number of processes anyway? Shouldn't it just be filling its own box? But perhaps I don't understand MPI/Hypre yet.
  3. The actual change here is bogus: instead of looping over all i from 0 to num_procs-1 inclusive, the code as written uses the old value of i from the enclosing scope, which by dumb luck happens to be defined. That the compiler accepted this is a consequence of this barbarous programming language forcing one to declare one's loop variables in advance, instead of defining them on the spot in the looping construct and taking them back out of scope when the looping construct ends, like all civilized systems do (for future reference, Python is not civilized in this sense either).

  4. In this instance, i happens to be nx+1 = 11, which has the effect of reading the box boundaries from somewhere off the ends of the ilower and iupper arrays, i.e., making them up from whole cloth. When I tried it, I got ilower(11,1) ~ 32700, ilower(11,2) = 32, iupper(11,1) ~ 1.95 billion, iupper(11,2) ~ 1.91 billion. (Interestingly, they were the same in every time step, but varied somewhat across runs. I suppose it's actually reading from some nearby array that may have some floating-point input data or something.) I guess this causes Hypre to decide that you are setting values for an empty box, and leave hypre_b and hypre_x at some default value, like 0. Conversely, at extraction time, I guess Hypre simply doesn't modify the values array at all, and, though I haven't checked this, I guess this whole routine is now equivalent to etanew = etastar/dt**2. This, in turn, stops Hypre from complaining about NaNs in its inputs, and manifests as the test suite reporting "wrong answer". Commenting out the three calls to VectorSet/GetBoxValues entirely also exhibits the same external behavior.

  5. Undoing this change and adding some more debug prints instead suggests that what's actually happening to cause those Hypre complaints is that the simulation blows up numerically over multiple iterations, eventually leading to NaNs, which Hypre catches before break_if_NaN does because the latter is only called at dump time. That's an argument in favor of break_if_NaN at every timestep.

Possible ways to proceed from here:

  • Revert this commit regardless.
  • Could try to further debug the simulation by "force of will" (i.e., adding print statements and trying to figure out what's happening)
  • Could step back and try to develop more effective test inspection. A one-number summary of the discrepancy in the first time point that has a discrepancy is not very informative; perhaps the test suite should make some visualizations of the differences when it fails.
  • Will the pysics-tests branch provide such debugging tools? Should we merge Hypre as still-broken-but-probably-fixable "experimental code" so that we can develop the physics-based testing well enough to figure out what's happening here?

@axch
Copy link
Collaborator

axch commented Apr 27, 2017

Re: cranking the tolerance: Were you experimenting with that while Hypre was being completely ignored (see my comment #89 (comment))?

@axch
Copy link
Collaborator

axch commented Apr 27, 2017

After conversation, it was decided to merge this PR as-is, on the following grounds:

@coveralls
Copy link

Coverage Status

Coverage remained the same at 82.27% when pulling b1a2b32 on external-matrix-solver into d14420d on master.

@axch axch merged commit cf7efd9 into master Apr 27, 2017
@edoddridge edoddridge deleted the external-matrix-solver branch May 4, 2017 23:00
edoddridge added a commit that referenced this pull request Jun 22, 2017
The test suite no longer uses valgrind by default due to issues with MPI and valgrind (see PR #89 and issue #125). Travis is currently failing because the `sudo apt-get install valgrind` is returning an error. Since the test suite doesn't use valgrind, this command is being flushed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a faster algorithm for the matrix solver
3 participants