Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis-ready chunking of diagnostic output files #203

Open
aekiss opened this issue Aug 15, 2024 · 6 comments
Open

Analysis-ready chunking of diagnostic output files #203

aekiss opened this issue Aug 15, 2024 · 6 comments

Comments

@aekiss
Copy link
Contributor

aekiss commented Aug 15, 2024

Following from @Thomas-Moore-Creative's talk today, we should think about the NetCDF chunking we use to write to disk, so that the native chunking is OK for typical workflows.

Note that in a compressed, chunked NetCDF file, if you access any data in a chunk, the whole chunk needs to be read and uncompressed. So that can be a pitfall if the chunking doesn't match the access requirements, e.g. chunks are too big in the wrong dimensions. e.g. we had that problem with ERA5 forcing in ACCESS-OM2: COSIMA/access-om2#242

Maybe we should set up a discussion/poll on the forum?

Related:

@Thomas-Moore-Creative
Copy link

@aekiss - after all my bluster about how important the choice of "native chunking" on the raw output is, what do we know about the limitations ( if any) on different models ability to control chunking of output at run-time? Where do modellers have that control in, say, MOM6? Is that dependent on / limited by how the model tiling is setup?

A recent conversation I had with @dougiesquire mused about choosing native chunking that was suited for and facilitated easier rechunking later. One of the problems that comes up is if you, for example, have very large chunks and are forced to load into memory most or all of the dataset to rechunk into another chunking arrangement.

That being said I'm not clear what the current COSIMA native chunking is and if it would need or benefit from change? ( other products I've come across very much do )

@aekiss
Copy link
Contributor Author

aekiss commented Aug 15, 2024

Good questions.

In terms of output directly from the model components,

Model runs are broken into short segments to fit into queue limits (so segments are shortest at high resolution, e.g. a few months) so post-processing would be required to change the chunking in time.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 15, 2024

The other consideration is the impact of chunking on IO performance of the model itself (which can become a bottleneck at high resolution). There's a lot of discussion of this in https://gmd.copernicus.org/articles/13/1885/2020/

It would be nice if there was a compromise that worked well both for runtime performance and analysis, but maybe these are incompatible and raw model outputs would require post-processing to suit analysis.

@anton-seaice
Copy link
Contributor

I believe MOM chunksizes are set in the fml namelist:

&fms2_io_nml
    ncchksz = 4194304
...

which is 4MB. I think part of the goal in having that size quite small is that it avoids splitting the chunks during analysis as much as practical (and some other reason about cache sizes I guess?)

It's hard to imagine model output having a chunksizes in time of anything other than 1. Like either it needs:

  • keeping the model output in memory for multiple time averages (maybe possible as we don't seem to be very memory limited),
  • or writing the chunks "out of order". e.g. if the time chunksize is 31, to write output at the end of each model day would need the model to only write to every 31st place in the output ... which sounds slow.

So I think its a question of how much extra time do we want running the model, vs how much extra time is it in analysis ?

@angus-g
Copy link

angus-g commented Aug 15, 2024

I think that is a poorly-named parameter that refers only to the internal library chunking (and maybe even only for NetCDF classic files, rather than the HDF5-backed NetCDF4 files). The per-dimension chunking is defined in netcdf var_def calls, and needs an array of chunksizes, rather than figuring it out from an overall chunk size. I think it is indeed the case that it depends on the IO_LAYOUT in the case of diagnostic output.

@anton-seaice
Copy link
Contributor

Thanks Angus! We might need to revisit ncchksz which is more of a cache size when we tune the IO_LAYOUT. And that makes sense the chunksize related to IO_LAYOUT in x/y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants