-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External matrix solver #89
Conversation
This brings the new layout and names to the external matrix solver branch. Conflicts: .gitignore Makefile aronnax.f90
Actually, I forgot about my style sweep. I'll push that later today. |
Groveling the Ubuntu packages database at packages.ubuntu.com suggests that `mpirun` is not in the libopenmpi-dev package, which would explain Travis's inability to execute the test suite. It's fairly standard in Debian-based distros such as Ubuntu to split the development files (libfoo) from the runtime support (openmpi-bin, in this case), because there are purposes that need one but not the other. (e.g., just running a pre-compiled MPI application does not need libopenmpi-dev).
OK, making good progress. The outstanding issues I see now are:
|
Conflicts: .travis.yml
…e/aronnax into external-matrix-solver
These loops are not needed when running on a single core, and by all accounts didn't work as intended anyway.
and may or may not print an unhappy message, depending on whether the 'happy' input is true.
That's a good idea to put in a The screen dump actually comes from The Hypre error was from me trying to get it ready for multiple processors and forgetting that the test suite didn't automatically run that section of code. It's been commented out now. It should probably have been deleted to keep the source code stylish, but given that parallelising this is a high priority it felt odd to delete prior attempts before finding one that works. |
I tried cranking the convergence tolerance to within a whisker of machine precision, to see if that would make the two methods converge. It didn't. I think it's going to take careful inspection of simulations at higher resolution than the ones in the test suite to work out what is going on. |
@@ -1694,21 +1694,21 @@ subroutine Ext_solver(MPI_COMM_WORLD, hypre_A, hypre_grid, num_procs, & | |||
end do | |||
end do | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is a teachable moment.
-
"by all accounts didn't work as intended anyway" is not a very helpful phrase to see in a log message. What accounts? What was the reported problem? What evidence blames this particular part of the code?
-
Turning a loop that should execute exactly once (there being one process) into just invoking the loop body one time shouldn't be a solution to anything, because it shouldn't have any effect. If you suspected that the loop was iterating more than once, a simple
print *, num_procs
orprint *, i
should have assuaged your concern.- Aside: Why is this thing looping over the number of processes anyway? Shouldn't it just be filling its own box? But perhaps I don't understand MPI/Hypre yet.
-
The actual change here is bogus: instead of looping over all
i
from0
tonum_procs-1
inclusive, the code as written uses the old value ofi
from the enclosing scope, which by dumb luck happens to be defined. That the compiler accepted this is a consequence of this barbarous programming language forcing one to declare one's loop variables in advance, instead of defining them on the spot in the looping construct and taking them back out of scope when the looping construct ends, like all civilized systems do (for future reference, Python is not civilized in this sense either). -
In this instance,
i
happens to benx+1 = 11
, which has the effect of reading the box boundaries from somewhere off the ends of theilower
andiupper
arrays, i.e., making them up from whole cloth. When I tried it, I gotilower(11,1) ~ 32700, ilower(11,2) = 32, iupper(11,1) ~ 1.95 billion, iupper(11,2) ~ 1.91 billion
. (Interestingly, they were the same in every time step, but varied somewhat across runs. I suppose it's actually reading from some nearby array that may have some floating-point input data or something.) I guess this causes Hypre to decide that you are setting values for an empty box, and leavehypre_b
andhypre_x
at some default value, like 0. Conversely, at extraction time, I guess Hypre simply doesn't modify thevalues
array at all, and, though I haven't checked this, I guess this whole routine is now equivalent toetanew = etastar/dt**2
. This, in turn, stops Hypre from complaining about NaNs in its inputs, and manifests as the test suite reporting "wrong answer". Commenting out the three calls to VectorSet/GetBoxValues entirely also exhibits the same external behavior. -
Undoing this change and adding some more debug prints instead suggests that what's actually happening to cause those Hypre complaints is that the simulation blows up numerically over multiple iterations, eventually leading to NaNs, which Hypre catches before
break_if_NaN
does because the latter is only called at dump time. That's an argument in favor ofbreak_if_NaN
at every timestep.
Possible ways to proceed from here:
- Revert this commit regardless.
- Could try to further debug the simulation by "force of will" (i.e., adding print statements and trying to figure out what's happening)
- Could step back and try to develop more effective test inspection. A one-number summary of the discrepancy in the first time point that has a discrepancy is not very informative; perhaps the test suite should make some visualizations of the differences when it fails.
- Will the
pysics-tests
branch provide such debugging tools? Should we merge Hypre as still-broken-but-probably-fixable "experimental code" so that we can develop the physics-based testing well enough to figure out what's happening here?
Re: cranking the tolerance: Were you experimenting with that while Hypre was being completely ignored (see my comment #89 (comment))? |
After conversation, it was decided to merge this PR as-is, on the following grounds:
|
This reverts commit 7c74684.
This implements a preconditioned conjugate gradient algorithm from Hypre to solve the implicit equation for
eta
, and closes #1.Although the external library is built around MPI and is intrinsically parallel, this branch does not implement any parallelisation. One day that will have its own PR.
The solver implemented in this PR relies on two external libraries: Hypre and MPI. The Hypre dependency is dealt with by including Hypre as a submodule. I have not yet automated the process of downloading and compiling Hypre, but I think it should be possible. The MPI dependency requires the user to have a working version of MPI already installed.
To make this branch pass the test suite I had to disable the
-Wall
compiler flag. I tried to come up with some way of making sure all variables were used in all modes, or only declared in the modes in which they were used, but I failed. Possibly we could deal with this by moving the offending subroutines to a separate module, overloading them, and implementing generic interfaces. Doing that would be a fair amount of work, and should probably be discussed in its own issue and implemented in its own PR. Either way, if this branch is merged without a fix for this, a new issue should opened about the-Wall
flag.I tried running this version in the
pytest
test suite, but the numerics are sufficiently different that of the n-layer tests onlytest_f_plane
test passes. This is another good reason to sort out #66 as soon as possible.Benchmarking shows that this version is between 2 and 4 times faster than the original code, even though it is still purely serial. It also shows that the execution time does not grow anywhere near as rapidly as
nx
andny
increase (see #1, comment), which means its relative advantage increases as the domain size increases.Summary of things to do when this is merged:
-Wall
compiler flag in test suite