Output model data in a portable and friendly format #30

edoddridge · 2017-03-05T03:20:47Z

The output are currently dumped as unformatted Fortran native files. These come with no metadata, are not particularly friendly to deal with, and may not be portable across systems.

Other options include:

Outputting in human readable/text files (this will lead to substantial storage requirements, and is probably slow).
Outputting in NetCDF. This would mean that the data and metadata are associated with each other, we can even add information about the model version, the runtime environment, and many other things. However, it introduces a dependency, and thus complicates the compilation and installation of the model. Implementing this is one of the solutions suggested for Make refactoring test suite portable across machines #27.

axch · 2017-03-05T15:10:20Z

Somehow it didn't connect before that NetCDF is a binary format with embedded strings, rather than a text format. That means it shouldn't be materially larger than the native stuff, and, given the wide array of software that exists for working with it, I now think the product that is MIM should definitely produce output in NetCDF. The presence of metadata in the format also makes me less antsy about storing (some) model outputs in git, as they will be relatively interpretable.

This leaves open these follow-on questions:

Which version of the NetCDF standard to follow? Both v3 and v4 are allegedly actively supported. This probably comes down to tool support available to the users, and whether v4 has any ability to express metadata that is important for MIM to express. Apparently, NetCDF 4 incorporates (parts of?) the HDF5 format; do we care about that? Which formats are more prevalent in oceanography?
What metadata conventions to follow? Are the Climate and Forecast conventions http://cfconventions.org/ widely enough accepted in physical oceanography to be a no-brainer?
Whether to implement NetCDF output in the Fortran core or in Python post-processing?
- Pro core: Fewer moving parts and exposed edges at run time (namely, the raw dumps never touch the disk, so there is no chance of mishandling them).
- Pro core: Less total work at runtime (avoids one disk read/write cycle per output file, and the Fortran NetCDF library is very unlikely to be slower than the Python one), though this is unlikely to matter much, b/c output is linear-time and not done at every time-step anyway.
- Pro core: The output format is defined in one place, regardless of how many frontends we have; the core becomes a more complete product in its own right, and additional frontends (e.g., Matlab) become easier to implement.
- Pro core: The post-processing code will not need to read the Fortran-native format. The need to do that is currently our only dependency on scipy, which may be nice to get rid of. However, that module doesn't depend on the rest of scipy, so can be copied into MIM, reducing the dependency to being on just numpy.
- Pro post: Easier to implement.
- Pro post: Easier to make more user-controllable (e.g., allowing the user to turn the conversion off, or fiddle with some parameters of it (if it has parameters)).
- Pro post: Easier to auto-detect whether the NetCDF library is present on the system and change behavior accordingly, if that is desirable.
- Pro post: If the performance of format conversion becomes a problem, it would be possible to arrange for it to happen in parallel with the simulation proper, though that's fiddly, likely to be brittle, and not very useful if the simulation itself is parallelized enough to use up the machine's cores.
- Pro post: No need for the core to depend on a third-party NetCDF library. Who knows how Fortran handles package dependencies; such may complicate installation. In Python, scipy already has a NetCDF module, and there are presumably stand-alone ones as well, if avoiding a scipy dependency is desirable.
- Side note: I don't yet understand the set of Fortran I/O options, but it may be possible to use something less system-dependent than "unformatted" but more built-in than NetCDF to communicate between the core and the post-processing.

edoddridge · 2017-03-06T21:21:26Z

I agree that using NetCDF for the output is a highly desirable behaviour. Seems to me we are agreed on this, and now we just need to decide how to implement it.

You've laid out the pros and cons of each choice pretty well.

Placing the NetCDF functionality in the core will complicate the build process and introduces an external dependency for the model. At the moment the compilation is very straightforward, and while I'm loathe to sacrifice that simplicity, NetCDF output is one of the few reasons I would (the other main one being parallelisation).

As far as which version, I would lean towards NetCDF4, even though scipy.io only supports V3. I've seen both versions being used in the wild, but I don't see a strong case for choosing the older standard. There are very mature python libraries for dealing with NetCDF4 files, and MATLAB can read both formats.

I think that making the output CF compliant is indeed a no-brainer. There are a number of data analysis suites that more or less assume this (see e.g. Iris), though some of them can deal with non-CF compliant input data.

Here's some information about the Fortran NetCDF library. It will definitely complicate the build process, but I think the trade off is probably worth it - provided the user manual has a sufficiently helpful walk through.

In summary - my preference is for CF compliant, NetCDF4 output produced by the Fortran core, but I'm willing to be talked out of those preferences.

edoddridge · 2017-03-07T15:44:45Z

There's also another option - wrap the Fortran program in python. Does this option have downsides in terms of speed or complicating any future desires to make it run in parallel? I don't know much about the process.

axch · 2017-03-08T14:35:25Z

Any sort of Python wrapping should have a negligible effect on performance, or on parallelism, provided the chunks of work done by the wrappee are large enough. For instance, I expect it wouldn't be materially slower to have the integration loop be managed by Python, so long as computing all the tendencies were still one big chunk of Fortran code. In fact, if numpy or scipy has optimized loops for stencil computations (which it may?), it may not be a terrible exercise to rewrite (a simplified version of) the model entirely in Python+numpy and see what performance looks like. Even if the verdict is that it's terrible, that version can be used as a sanity check on the results from the Fortran. Or, perhaps, we could construct a very simple benchmark program to test this hypothesis before doing a rewrite.

edoddridge · 2017-03-08T20:44:00Z

That's good to know.

Given that python dependencies should be easier to solve than Fortran dependencies, perhaps a python wrapping is the best option?

axch added the usability label Mar 14, 2017

axch modified the milestone: Draft the MIM paper for Geoscientific Modeling Mar 14, 2017

This was referenced Apr 19, 2017

Product shape #101

Merged

Incrementally complete the packaging plan #102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output model data in a portable and friendly format #30

Output model data in a portable and friendly format #30

edoddridge commented Mar 5, 2017 •

edited

Loading

axch commented Mar 5, 2017

edoddridge commented Mar 6, 2017

edoddridge commented Mar 7, 2017

axch commented Mar 8, 2017 •

edited

Loading

edoddridge commented Mar 8, 2017

Output model data in a portable and friendly format #30

Output model data in a portable and friendly format #30

Comments

edoddridge commented Mar 5, 2017 • edited Loading

axch commented Mar 5, 2017

edoddridge commented Mar 6, 2017

edoddridge commented Mar 7, 2017

axch commented Mar 8, 2017 • edited Loading

edoddridge commented Mar 8, 2017

edoddridge commented Mar 5, 2017 •

edited

Loading

axch commented Mar 8, 2017 •

edited

Loading