-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDF5 documentation lacking and a question #463
Comments
Need to clarify a bit. Is a CDF5 file being accessed using NC_PNETCDF or by using |
Via NC_CDF5 flag
|
Do you mean "silent" the unexpected values appear in files for write and vise versa for read? My understanding of NetCDF is it first finds the number of contiguous requests (in file layout) from arguments start and count, and then runs a loop to write one request at a time. I assume your case is to write a contiguous chunk of size > 2GB. I have never tried this before. It is likely some internal codes need to be adjusted to handle this kind of request. |
Yes, it would explain the problem my users are experiencing if requests for more than 2 GB from a CDF5 file do not return the requested amount. There is no problem doing this from a netCDF4 file. Again, this is just a hypothesis, I'm trying to track down a mysterious bug, and this could be it. Need confirmation. |
I have now replicated workflows that strongly suggest there are undocumented differences between put/get results for large (4-8 GB) variables between CDF5 and NETCDF4 files. I don't have proof that this is a netCDF issue rather than an NCO issue, though @wkliao suggests that CDF5 get/put in the netCDF library was never tested for requests > 2 GB (and PnetCDF explicitly does not support single requests > 2 GB). This issue prevents the DOE ACME MPAS model runs at high resolution (that are archived in CDF5 format) from being properly analyzed, so it affects users now. Is anyone (@wkliao ?) interested in verifying whether CDF5 has put/get limits? My circumstantial methods and evidence are that a billion doubles all equal to one do average to one for NETCDF4 data, and don't for CDF5 data...
which results in
The first ncap2 command above requires the latest NCO snapshot to support CDF5. Note that these commands both place a few other variables, named "one" and "two", around the large variable "dta_1e9" in the output file, so dta_1e9 is not the only variable. When dta_1e9 is the only variable, the CDF5-based workflow yields the correct answer! So, if my hypothesis is correct, CDF5 variables larger than some threshold size (possibly 2 GB?) are not written and/or read correctly when the nc_put/get_var() is one request for the entire variable, and there are other variables in the dataset. The behavior is identical on netCDF 4.4.1.1 and today's daily snapshot of 4.5.1-development. |
Since the CDF5 code in libsrc came from pnetcdf originally, |
I tested a short netcdf program to mimic the I/O @czender described (the codes are shown below). It creates 3 variables, namely var, one, and two. var is a 1D array of type double of 2^9 elements. Variables one and two are scalars. The 3 variables are first written to a new file and then read back to calculate the average. However, I could not reproduce the problem with this program. @czender could you use this program and modify it close to what you are doing with the NCO operations?
|
Thanks @wkliao. This is a good starting point. I made a few changes to more closely follow the NCO code path. However, I still get the same results you do. Will keep trying... |
@wkliao I notice something I do not understand about CDF5: I have a netCDF4 file with two variables whose on-disk sizes compute as 9 GB and 3 GB, respectively. By "computes as" I mean multiplying the dimension sizes times the size of NC_DOUBLE. And the netCDF4 (uncompressed) filesize is, indeed, 12 GB, as I expect. Yet when I convert that netCDF4 file to CDF5, the total size changes from 12 GB to 9 GB. In other words, it looks like CDF5 makes use of compression, or does not allocate wasted space for _FillValues, or something like that. If you understand what I'm talking about, please explain why CDF5 consumes less filespace than expected... |
The file size should be 12GB. Do you have a test program that can reproduce this? |
Uh oh. It's hard to wrap my head around all this. The mysterious issues only appear with huge files that are hard to manipulated. The issues when processing with NCO may be due to NCO, but in order to verify I have to use another toolkit. CDO does not yet support CDF5. And nccopy either fails (with -V) to extract the variables I want, or (with -v) extracts all the variables so I can't mimic the NCO workflow...
Until I find another way to subset variables from a CDF5 file, I'm stuck. @WardF would you please instruct me how to use nccopy to subset certain variables from a CDF5 file? I think the above output demonstrates that nccopy (yes, the latest 4.5.x) has some breakage with the -v and -V options. |
I will take a look at -v/-V and see what's going on. The original files are obviously quite large, I'll see if I can recreate this locally with a file on hand. |
The in.nc file on which the above commands were performed to demonstrate nccopy -v/-V weirdness, is tiny. Same behavior should occur with any file you like. |
@DennisHeimbigner This feels like an issue we've seen and (I thought) addressed, recently. Does this ring any bells for you? Maybe the fix is on a branch that I neglected to merge. Going to look now. |
Ok. Similar issue, although the issue claims it is 64-bit offset only and this is not the case. I'll update the original issue. I can copy |
Found a code stanza in nccopy.c starting at line 1451. The comment seems of interest here.
|
Have you tried to disable this optimization to see if it then starts working ok? |
@wkliao This 9 GB CDF5 file contains two variables whose uncompressed sizes are 9 GB and 3 GB and so should require 12 GB of disk space to store. Inspection with ncdump/ncks shows there are data in both variables. When I convert it to netCDF4, the resulting file is, indeed, 12 GB. Can you tell me anything about whether the CDF5 file is legal, or corrupt, or when/where/how in the writing process it may have been truncated? |
@DennisHeimbigner Yes I found the issue, it is unrelated to optimization (in terms of the nccopy issue, not what @czender has observed with file sizes). I'm working on a fix right now. |
Ok, I think I have a fix for the |
Regarding to file size issue, firstly all classical formats, including CDF-5, do no compression. There is a PnetCDF utility program called ncoffsets. Command "ncoffsets file.nc" prints the starting and ending file offsets of individual variables defined in a classical file. When used for the above CDF-5 file, it should print the ending offset 12GB for the second variable, but command "ls -l" can still show a number less than that. Please give it a try and let me know. |
@WardF please let me know when the nccopy fixes are in master so I can check whether nccopy and NCO give the same answers when subsetting huge CDF5 files. |
@wkliao need to know if you think intercepting nc_put_var?_*() to split single CDF5 write requests for data buffers larger than N into multiple write requests of buffers smaller than N will avoid this bug. And, if so, what is N? And do you still think that CDF5 reads are not affected? |
When running python tests against libnetcdf built the patch from @wkliao, I see the following (on 64-bit systems only). This will need to be sorted out before merging this fix in or saying that it 'fixes' the problem.
It's possible it is a problem with the python test; I'll ask our in-house python guys and see what they say :) |
I've determined the python test which is failing is |
The test is as follows; does anything leap out?
|
The bug appears when defining more than one large variables in a new file. So split a large put request to smaller ones will not fix the bug. If you are developing a workaround in NCO, then I suggest to check the number of large variables and create a new file that contains only one large variable and make sure the large variable is defined the last. I still believe the bug affects writes only, as the fixes I developed are in the subroutines only called by the file header writer. However, it is better to have a test program to check. |
@wklian does "large" in your message above mean 2 GiB or 4 GiB or ...? |
It is mentioned on one of my previous posts. Here it is copy and pasted. Large variables here means their size each is > 2^31-3 bytes for CDF-1 and 2^32-3 bytes for CDF-2. See NetCDF Format Limitations |
I don't understand. I'm talking about writing a CDF5 file with netCDF 4.4.x. Not CDF1 or CDF2. What is the largest variable I can safely write as the last variable in a CDF5 file? |
Sorry. Let me re-phrase, when using netCDF 4.4.x to create a new CDF-5 file, the file can only contain one large variable at most and it must be defined last. The large variable is of size > 2^32-3 bytes. To be honest, I really do not recommend a workaround for netCDF 4.4.x, because the above suggestion has never fully been tested. This suggestion is based on my understanding to the root of the bug. |
That's OK. I'm not writing a workaround. I'm writing a diagnostic WARNING message for those who, in the future, with NCO 4.6.9+, attempt to write a CDF5 file with netCDF 4.4.x that may trigger the bug. |
Ward- you should be able to create a c program equivalent to that python program |
I've added a couple of configure-time options for disabling cdf5 support. Eventually this will be set automatically for 32-bit platforms. I may turn it off by default for the next release candidate just for the sake of expediency, but I don't want to cause any problems for dependent packages. @czender does NCO assume cdf5 support? Or does it query the netcdf library for it? |
Thanks for asking. NCO assumes CDF5 when linked to 4.4.x or greater, i.e., if(NC_LIB_VERSION >= 440){ CDF5 stuff }else{ WARN no CDF5 support} We could shift to, e.g., an #ifdef HAVE_CDF5 method if given a heads-up. |
@czender |
By the time we're done, having CDF5 support would be the case in 99% of installations. However, being able to toggle it lets us craft a release while still working to sort out the issues outlined above. I'm going to refresh my memory on the above and also go review the @wkliao pull requests so that we can move forward. |
Umm, sort of. We can cue from nc-config --has-cdf5 with v. 4.5.x. Earlier versions will not have this and nc-config will crash if asked about CDF5. So NCO would need to implement a multi-staged rule where autoconf/cmake first finds the version, makes that machine-parseable, then sets HAVE_CDF5 to No for < 4.4.x, to Yes for 4.4.x, and query nc-config for >= 4.5.x. Or something like that. Or NCO could just do nothing and file with UNKNOWN_FORMAT on CDF5 files. |
It appears that netCDF 4.5.0 has been released without a fix for this CDF5 issue. My collaborators want netCDF with dependable CDF5 on 64-bit machines and do not care about 32-bit environments. Is it likely their issues will be addressed in 4.5.1? Or is future netCDF support for CDF5 uncertain? Or...? |
It will be addressed in 4.5.1 insofar as we will enforce no cdf5 writing on 32-bit. You can enable cdf5 in 4.5.0 at configure time with —enable-cdf5. The release had languished enough that disabling it (by default) and getting 4.5.0 out the door was necessary. |
To be clear, 4.5.0 does not seem to address the problem I reported in this issue. Will some version of #478, that appears to solve the problem I reported, and that fixes the test I wrote, be in 4.5.1? |
There have been no recent updates on this. Is a fix still planned? for 4.6.0? In my other hat as a GCM developer this is the most critical netCDF bug I am aware of because it prevents analysis of high-resolution simulations conducted with CDF5 format, and there is no easy workaround. There is some resistance/inertia to netCDF4 in the GCM community because PnetCDF and CDF5 have advantages in parallel speed and familiarity. However, this persistent bug is causing at least one GCM group to seriously consider alternatives to CDF5. |
The fix is in #478. At least in my opinion it is ready. |
Getting #478 merged now. |
Had a couple bumps with #478 on ARM, re-evaluating now. |
I have merged this into HPC netCDF and it passes all test. (But I am not testing on my ARM. I will turn that on...) |
Fixed with #478 being merged. |
Upstream changes: ## 4.6.1 - March 15, 2018 * [Bug Fix] Corrected an issue which could result in a dap4 failure. See [Github #888](Unidata/netcdf-c#888) for more information. * [Bug Fix][Enhancement] Allow `nccopy` to control output filter suppresion. See [Github #894](Unidata/netcdf-c#894) for more information. * [Enhancement] Reverted some new behaviors that, while in line with the netCDF specification, broke existing workflows. See [Github #843](Unidata/netcdf-c#843) for more information. * [Bug Fix] Improved support for CRT builds with Visual Studio, improves zlib detection in hdf5 library. See [Github #853](Unidata/netcdf-c#853) for more information. * [Enhancement][Internal] Moved HDF4 into a distinct dispatch layer. See [Github #849](Unidata/netcdf-c#849) for more information. ## 4.6.0 - January 24, 2018 * [Enhancement] Full support for using HDF5 dynamic filters, both for reading and writing. See the file docs/filters.md. * [Enhancement] Added an option to enable strict null-byte padding for headers; this padding was specified in the spec but was not enforced. Enabling this option will allow you to check your files, as it will return an E_NULLPAD error. It is possible for these files to have been written by older versions of libnetcdf. There is no effective problem caused by this lack of null padding, so enabling these options is informational only. The options for `configure` and `cmake` are `--enable-strict-null-byte-header-padding` and `-DENABLE_STRICT_NULL_BYTE_HEADER_PADDING`, respectively. See [Github #657](Unidata/netcdf-c#657) for more information. * [Enhancement] Reverted behavior/handling of out-of-range attribute values to pre-4.5.0 default. See [Github #512](Unidata/netcdf-c#512) for more information. * [Bug] Fixed error in tst_parallel2.c. See [Github #545](Unidata/netcdf-c#545) for more information. * [Bug] Fixed handling of corrupt files + proper offset handling for hdf5 files. See [Github #552](Unidata/netcdf-c#552) for more information. * [Bug] Corrected a memory overflow in `tst_h_dimscales`, see [Github #511](Unidata/netcdf-c#511), [Github #505](Unidata/netcdf-c#505), [Github #363](Unidata/netcdf-c#363) and [Github #244](Unidata/netcdf-c#244) for more information. ## 4.5.0 - October 20, 2017 * Corrected an issue which could potential result in a hang while using parallel file I/O. See [Github #449](Unidata/netcdf-c#449) for more information. * Addressed an issue with `ncdump` not properly handling dates on a 366 day calendar. See [GitHub #359](Unidata/netcdf-c#359) for more information. ### 4.5.0-rc3 - September 29, 2017 * [Update] Due to ongoing issues, native CDF5 support has been disabled by **default**. You can use the options mentioned below (`--enable-cdf5` or `-DENABLE_CDF5=TRUE` for `configure` or `cmake`, respectively). Just be aware that for the time being, Reading/Writing CDF5 files on 32-bit platforms may result in unexpected behavior when using extremely large variables. For 32-bit platforms it is best to continue using `NC_FORMAT_64BIT_OFFSET`. * [Bug] Corrected an issue where older versions of curl might fail. See [GitHub #487](Unidata/netcdf-c#487) for more information. * [Enhancement] Added options to enable/disable `CDF5` support at configure time for autotools and cmake-based builds. The options are `--enable/disable-cdf5` and `ENABLE_CDF5`, respectively. See [Github #484](Unidata/netcdf-c#484) for more information. * [Bug Fix] Corrected an issue when subsetting a netcdf3 file via `nccopy -v/-V`. See [Github #425](Unidata/netcdf-c#425) and [Github #463](Unidata/netcdf-c#463) for more information. * [Bug Fix] Corrected `--has-dap` and `--has-dap4` output for cmake-based builds. See [GitHub #473](Unidata/netcdf-c#473) for more information. * [Bug Fix] Corrected an issue where `NC_64BIT_DATA` files were being read incorrectly by ncdump, despite the data having been written correctly. See [GitHub #457](Unidata/netcdf-c#457) for more information. * [Bug Fix] Corrected a potential stack buffer overflow. See [GitHub #450](Unidata/netcdf-c#450) for more information. ### 4.5.0-rc2 - August 7, 2017 * [Bug Fix] Addressed an issue with how cmake was implementing large file support on 32-bit systems. See [GitHub #385](Unidata/netcdf-c#385) for more information. * [Bug Fix] Addressed an issue where ncgen would not respect keyword case. See [GitHub #310](Unidata/netcdf-c#310) for more information. ### 4.5.0-rc1 - June 5, 2017 * [Enhancement] DAP4 is now included. Since dap2 is the default for urls, dap4 must be specified by (1) using "dap4:" as the url protocol, or (2) appending "#protocol=dap4" to the end of the url, or (3) appending "#dap4" to the end of the url Note that dap4 is enabled by default but remote-testing is disbled until the testserver situation is resolved. * [Enhancement] The remote testing server can now be specified with the `--with-testserver` option to ./configure. * [Enhancement] Modified netCDF4 to use ASCII for NC_CHAR. See [Github Pull request #316](Unidata/netcdf-c#316) for more information. * [Bug Fix] Corrected an error with how dimsizes might be read. See [Github #410](Unidata/netcdf-c#410) for more information. * [Bug Fix] Corrected an issue where 'make check' would fail if 'make' or 'make all' had not run first. See [Github #339](Unidata/netcdf-c#339) for more information. * [Bug Fix] Corrected an issue on Windows with Large file tests. See [Github #385](Unidata/netcdf-c#385]) for more information. * [Bug Fix] Corrected an issue with diskless file access, see [Pull Request #400](Unidata/netcdf-c#400) and [Pull Request #403](Unidata/netcdf-c#403) for more information. * [Upgrade] The bash based test scripts have been upgraded to use a common test_common.sh include file that isolates build specific information. * [Upgrade] The bash based test scripts have been upgraded to use a common test_common.sh include file that isolates build specific information. * [Refactor] the oc2 library is no longer independent of the main netcdf-c library. For example, it now uses ncuri, nclist, and ncbytes instead of its homegrown equivalents. * [Bug Fix] `NC_EGLOBAL` is now properly returned when attempting to set a global `_FillValue` attribute. See [GitHub #388](Unidata/netcdf-c#388) and [GitHub #389](Unidata/netcdf-c#389) for more information. * [Bug Fix] Corrected an issue where data loss would occur when `_FillValue` was mistakenly allowed to be redefined. See [Github #390](Unidata/netcdf-c#390), [GitHub #387](Unidata/netcdf-c#387) for more information. * [Upgrade][Bug] Corrected an issue regarding how "orphaned" DAS attributes were handled. See [GitHub #376](Unidata/netcdf-c#376) for more information. * [Upgrade] Update utf8proc.[ch] to use the version now maintained by the Julia Language project (https://github.com/JuliaLang/utf8proc/blob/master/LICENSE.md). * [Bug] Addressed conversion problem with Windows sscanf. This primarily affected some OPeNDAP URLs on Windows. See [GitHub #365](Unidata/netcdf-c#365) and [GitHub #366](Unidata/netcdf-c#366) for more information. * [Enhancement] Added support for HDF5 collective metadata operations when available. Patch submitted by Greg Sjaardema, see [Pull request #335](Unidata/netcdf-c#335) for more information. * [Bug] Addressed a potential type punning issue. See [GitHub #351](Unidata/netcdf-c#351) for more information. * [Bug] Addressed an issue where netCDF wouldn't build on Windows systems using MSVC 2012. See [GitHub #304](Unidata/netcdf-c#304) for more information. * [Bug] Fixed an issue related to potential type punning, see [GitHub #344](Unidata/netcdf-c#344) for more information. * [Enhancement] Incorporated an enhancement provided by Greg Sjaardema, which may improve read/write times for some complex files. Basically, linked lists were replaced in some locations where it was safe to use an array/table. See [Pull request #328](Unidata/netcdf-c#328) for more information.
A user applying NCO arithmetic to large data files in the "new" CDF5 format is encountering issues that result (silently) in bad answers. I'm trying to track down the problem. I have not found the netCDF documentation on CDF5 format limitations on this page and so one purpose of this issue to request the addition of CDF5 limits there (assuming that's the right place for it). The table also needs reformatting.
This PnetCDF page says
Forgetting about MPI, and only considering serial netCDF environments, does the 2GiB put/get request limit above apply to any netCDF file formats, including the CDF5 format? NCO does not limit its request sizes, so I wonder if that could be the problem...
The text was updated successfully, but these errors were encountered: