Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDF-5 fix: let NC_var.len be the true size of variable #478

Merged
merged 17 commits into from
Feb 4, 2018

Conversation

wkliao
Copy link
Contributor

@wkliao wkliao commented Sep 7, 2017

This PR should fix the problem discussed #463

@DennisHeimbigner
Copy link
Collaborator

Ok, after checking, this change is incorrect. The use of UINT MAX is used as a flag to
indicated 64-bit-offset. So we need to keep it. A better long term solution is to change
all of the current size_t fields like ->len and ->recsize to be unsigned long long.
Then we can properly compute lengths without having too worry about that special flag.
However, we do need to be careful in writing out the file length to keep the flag in the
netcdf file indicating large file size.

@DennisHeimbigner
Copy link
Collaborator

Whatever fix we get, I think it is important to test the following 6 configurations.
(64bit,32bit) X (netcdf3,cdf5,64-bit-offset).
We also need to make sure to enable large file tests and to test with
Charlie's ncap2 tests.

@WardF
Copy link
Member

WardF commented Sep 11, 2017

When running python tests against libnetcdf built with this patch, I see the following (on 64-bit systems only).

This will need to be sorted out before merging this fix in or saying that it 'fixes' the problem.

netcdf4-python version: 1.3.0
HDF5 lib version:       1.8.19
netcdf lib version:     4.5.1-development
numpy version           1.11.0
...............................foo_bar
.http://remotetest.unidata.ucar.edu/thredds/dodsC/testdods/testData.nc => /tmp/occookieKmOOvt
..............................................python: /home/tester/netcdf-c/libsrc/nc3internal.c:794: NC_endef: Assertion `ncp->begin_rec >= ncp->old->begin_rec' failed.

@wkliao
Copy link
Contributor Author

wkliao commented Sep 12, 2017

Where to obtain netcdf4-python version: 1.3.0?
I only see 1.2.9.

@wkliao
Copy link
Contributor Author

wkliao commented Sep 12, 2017

@WardF

I do not have a netcdf4-python installed nearby.
Could you do a quick check of this new additional patch against the python test program?

@wkliao
Copy link
Contributor Author

wkliao commented Sep 12, 2017

@WardF

I think you were using branch gh478 to run this test, because the assert statement in line 794 of libsrc/nc3internal.c in gh478 has become line 797 in my patch.

It appears that gh478 does not include my patch (this pr 478).

@WardF
Copy link
Member

WardF commented Sep 12, 2017

You are correct. gh478 does not contain your patch. I merged it locally to a temporary branch and observed the failures. I will test the updated patch against NetCDF python tomorrow and follow up here.

@wkliao
Copy link
Contributor Author

wkliao commented Sep 12, 2017

Regarding to developing test programs for 32-bit machines, I think NetCDF may not be able to support CDF-5 on 32-bit machines. The main obstacle is "size_t" being a 4-byte integer on 32-bit platforms, and most of the netCDF APIs have arguments of type size_t. For instance, one cannot define a dimension of size > 2^32, because argument "len" is of type size_t in the "nc_def_dim".

     int nc_def_dim (int ncid, const char *name, size_t len, int *dimidp);

To support CDF-5 on 32-bit machines, NetCDF needs to change its APIs, which will not be backward compatible. Maybe the only option is to disable CDF-5 on 32-bit machine?

@WardF
Copy link
Member

WardF commented Sep 12, 2017

@wkliao 1.3.0 is coming through from the master test being used. I'll check 1.2.9 as well.

@WardF
Copy link
Member

WardF commented Sep 12, 2017

1.2.9 also fails in the same place. The CDF-5 on 32-bit machines is problematic and I will have to think about it. In the short term we definitely need to add a way at configure time to make CDF5 support optional. If we are on a 32-bit platform we can disable it by default, possibly, if that is the approach we decide to take. In any event it would give us an avenue for providing a release while working around CDF5-specific issues.

For what it is worth, this netcdf-python failure is only manifesting on 64-bit platforms, not 32-bit platforms.

@wkliao
Copy link
Contributor Author

wkliao commented Sep 13, 2017

That python program, tst_cdf5.py, itself is also problematic. Line 7 sets "dimsize" to the maximum value of a signed 64-bit integer, 9223372036854775807 or NC_MAX_INT64. (By the way, the comment should be fixed.) Then, line 19 defines a 1-D variable of type unsigned int with NC_MAX_INT64 elements, which makes the variable size (=NC_MAX_INT64 x 4 bytes) bigger than an unsigned 64-bit integer can represent (overflow). In this case, NetCDF should throw an error code, such as NC_EINTOVERFLOW. By the way, in PnetCDF, the limit will be max signed 64-bit integer (X_INT64_MAX - 3), not unsigned, as we use MPI_Offset which is a signed long long.

...
  7 dimsize = np.iinfo(np.int64).max # max unsigned 64 bit integer
...
 17         d = nc.createDimension('dim',dimsize) # 64-bit dimension
 18         # create an 8-bit unsigned integer variable
 19         v = nc.createVariable('var',np.uint8,'dim')

I wrote a C program to mimic the python test program and it ran without error. However, the correct behavior should be trowing NC_EVARSIZE.

% cat big_dim.c
#include <stdio.h>
#include <netcdf.h>

#define ERR {if(err!=NC_NOERR){printf("Error at line %d in %s: %s\n", __LINE__,__FILE__, nc_strerror(err));nerrs++;}}

int main(int argc, char *argv[])
{
    int err, nerrs=0, ncid, dimid, varid;
    err = nc_create("cdf5.nc", NC_CLOBBER|NC_64BIT_DATA, &ncid); ERR;
    err = nc_def_dim(ncid, "dim", NC_MAX_INT64, &dimid); ERR;
    err = nc_def_var(ncid, "var1", NC_UINT, 1, &dimid, &varid); ERR
    err = nc_set_fill(ncid, NC_NOFILL, NULL); ERR
    err = nc_close(ncid); ERR
    return (nerrs > 0);
}

@wkliao
Copy link
Contributor Author

wkliao commented Sep 13, 2017

This patch 8b3d32c checks variable size against (X_INT64_MAX - 3) for CDF-5 files and throws NC_EVARSIZE. Below is a revised test program to check the expected error code.

#include <stdio.h>
#include <netcdf.h>

#define ERR {if(err!=NC_NOERR){printf("Error at line %d in %s: %s\n", __LINE__,__FILE__, nc_strerror(err));nerrs++;}}

#define EXP_ERR(exp,err) { \
    if (err != exp) { \
        nerrs++; \
        printf("Error at line %d in %s: expecting %s but got %d\n", \
        __LINE__,__FILE__,#exp, err); \
    } \
}

int main(int argc, char *argv[])
{
    int err, nerrs=0, ncid, dimid, varid;
    err = nc_create("cdf5.nc", NC_CLOBBER|NC_64BIT_DATA, &ncid); ERR;
    err = nc_def_dim(ncid, "dim", NC_MAX_INT64, &dimid); ERR;
    err = nc_def_var(ncid, "var1", NC_UINT, 1, &dimid, &varid); ERR
    err = nc_set_fill(ncid, NC_NOFILL, NULL); ERR
    err = nc_close(ncid); EXP_ERR(NC_EVARSIZE,err)
    return (nerrs > 0);
}

@wkliao
Copy link
Contributor Author

wkliao commented Sep 13, 2017

I think throwing NC_EVARSIZE in nc_def_var is better than in nc_enddef.
The test program is revised accordingly.

#include <stdio.h>
#include <netcdf.h>

#define ERR {if(err!=NC_NOERR){printf("Error at line %d in %s: %s\n", __LINE__,__FILE__, nc_strerror(err));nerrs++;}}

#define EXP_ERR(exp,err) { \
    if (err != exp) { \
        nerrs++; \
        printf("Error at line %d in %s: expecting %s but got %d\n", \
        __LINE__,__FILE__,#exp, err); \
    } \
}

int main(int argc, char *argv[])
{
    int err, nerrs=0, ncid, dimid[2], varid[2];
    err = nc_create("cdf5.nc", NC_CLOBBER|NC_64BIT_DATA, &ncid); ERR;
    err = nc_def_dim(ncid, "dim0", NC_UNLIMITED, &dimid[0]); ERR;
    err = nc_def_dim(ncid, "dim1", NC_MAX_INT64, &dimid[1]); ERR;

    err = nc_def_var(ncid, "var0", NC_UINT, 1, &dimid[1], &varid[0]);
    EXP_ERR(NC_EVARSIZE,err)

    err = nc_def_var(ncid, "var1", NC_UINT, 2, &dimid[0], &varid[1]);
    EXP_ERR(NC_EVARSIZE,err)

    err = nc_set_fill(ncid, NC_NOFILL, NULL); ERR
    err = nc_close(ncid); ERR
    return (nerrs > 0);
}

…n) of all variables follows the same increasing order as they were defined.
@wkliao
Copy link
Contributor Author

wkliao commented Sep 15, 2017

This patch 294c734 checks file offsets of all variables and reports NC_ENOTNC when opening the corrupted CDF-5 file generated from nccopy case in #463

@DennisHeimbigner
Copy link
Collaborator

Can we get a summary comment about what all this pr is doing?

@wkliao
Copy link
Contributor Author

wkliao commented Sep 18, 2017

The last 5 commits added two test programs and a corrupted CDF-5 file for testing.
tst_open_cdf5.c opens a corrupted CDF5 file, bad_cdf5_begin.nc, and reports NC_ENOTNC.
tst_cdf5_begin.c defines one big variable followed by a small variable, writes partial data to the (buggy) overlapped area and reads back for validation.

Without the patches from this pull request, NetCDF should fail both test programs.
The commit messages should have provided more details.
Let me know if you have questions for a particular commit.

@WardF WardF added this to the 4.5.0 milestone Sep 25, 2017
@WardF WardF requested review from DennisHeimbigner and removed request for DennisHeimbigner February 2, 2018 18:09
@WardF
Copy link
Member

WardF commented Feb 2, 2018

Failure building on Windows; should be a simple fix, once that's sorted will get it merged.

Copy link
Collaborator

@DennisHeimbigner DennisHeimbigner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the pr conversation

@DennisHeimbigner
Copy link
Collaborator

I do not understand the
+AM_CONDITIONAL(ENABLE_CDF5, [test "$ac_cv_sizeof_size_t" -ge "8"])
in configure.ac. We do not need to use size_t internally to netcdf-c but rather
can directly use unsigned long long, which is guaranteed to be 8bytes.
The only time we need to downsize from unsigned long long is at the point where
we write the length to the file (in v1hpg.c, code change is already in pr).

Also,, much of this change should not be cdf5 specific. There is no reason
not to properly handle the file size for netcdf-3 files in general, even if
the various max file sizes are different.

Also, if this is correct
https://stackoverflow.com/questions/9073667/where-to-find-the-complete-definition-of-off-t-type
off_t suffers from the same varying size problems as size_t So we should not use
it anywhere. If we want an 8byte value, then we need to guarantee an 8byte value
independent of the machine word size.

@wkliao
Copy link
Contributor Author

wkliao commented Feb 2, 2018

The reason of NetCDF-4 being unable to support CDF-5 is explained in my earlier post of this PR on Sep 12, 2017.

The off_t issue was explained in my post of #632. That PR is also applied to CDF-1 and 2.
I also discussed AC_SYS_LARGEFILE in #375.

@wkliao
Copy link
Contributor Author

wkliao commented Feb 2, 2018

I meant to say NetCDF-4 is unable to support CDF-5 when size_t is 4-byte.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Feb 2, 2018

Sorry, but I am missing something., I need to see an explicit statement about why
using off_t is necessary as opposed to using long long or unsigned long long.
PR #632 does not seem to address that.

@DennisHeimbigner
Copy link
Collaborator

I meant to say NetCDF-4 is unable to support CDF-5 when size_t is 4-byte.
(You said netcdf-4, but you meant, I think, netcdf-3).,
In any case, I agree with this for the reasons stated. Perhaps I misread your changes.
My point is that internally the netcdf-c library should avoid using
size_t or off_t to represent file sizes (or offsets) because of their varying size
depending on the machine word size. We need to internally use an explicit 8byte
type to represent the file size and convert it only when necessary.

@DennisHeimbigner
Copy link
Collaborator

There is probably a larger point here. We should also never be using the type "long"
anywhere. It is a useless type.

@wkliao
Copy link
Contributor Author

wkliao commented Feb 2, 2018

I did mean NetCDF-4, specifically NetCDF-4 APIs. Most of the arguments in NetCDF-4 APIs are of type size_t and size_t is always 4-byte on 32bit machines. This makes NetCDF-4 not possible to support CDF-5 properly on 32bit machine.

Can you point out the internal variables that should be defined as long long?
I can take a look.

@DennisHeimbigner
Copy link
Collaborator

At least these:
In nc3internal.h:
struct NC_var
fields: dsizes, begin, len

struct NC3_INFO:
fields: xsz begin_var, begin_rec, recsize, 

In v1hpg.c:
struct v1hs
fields: offset extent (note will require casting when passing to nciop)
function: v1h_put_NC_var
variable vsize (added by your code)

In nc3internal.c:
function NC_check_vlens:
variables: vlen_max

Note: ncio.h has a number, but that api probably needs fixing
as a separate effort.

@DennisHeimbigner
Copy link
Collaborator

Also, if you can compile with proper conversion warning, you should
be able to catch some other required changes/castings.

@wkliao
Copy link
Contributor Author

wkliao commented Feb 3, 2018

With AC_SYS_LARGEFILE in configure.ac, off_t will automatically set to an 8-byte integer if the 32-bit machine supports large file access. So, the only question left is for those variables of type size_t. If we agree to disable CDF-5 wherever size_t is not 8-byte, then the rest of compile-time type casting warnings becomes harmless.

Adding -Wconversion to CFLAGS will produce a long list of compile-time warnings about type casting, majority of them are not CDF-5 related. I also believe they probably are harmless. Of course, removing all such warnings is the best, but requires a big effort.

Copy link
Collaborator

@DennisHeimbigner DennisHeimbigner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not ever rely on a flag to determine
the type size; this introduces a "hidden" dependency between the code and the type.
It is always safer to use a known size all the time.

@DennisHeimbigner
Copy link
Collaborator

More to the point, if we are going to do this, then let's do it right and in the cleanest
possible fashion. Which, IMO means do not rely on types whose size can change.

@wkliao
Copy link
Contributor Author

wkliao commented Feb 3, 2018

What you asked is out of scope of this PR. Based on your argument, I wonder why you don't apply the same standard to all other PRs?

This PR passes all the tests on both 64 and 32 bit machines. I do not see the point of blocking this PR using a problem that has been in NetCDF for a long time.

I also disagree your argument of one should not ever rely on a flag to determine the type size. That has been a common practice in autotools. It really depends on whether you can use it properly. We can debate this for a long time, but the most important question is whether you would like this PR in NetCDF to fix the CDF5 problem observed/reported by @czender or not.

@DennisHeimbigner
Copy link
Collaborator

I wonder why you don't apply the same standard to all other PRs?
I do, sometimes.

IMO, the whole varying type size thing was a hack because the long long
type was not well supported a long time ago. Now it is supported and efficient,
so it is time to move on from the old way of doing things.

I will leave the call to Ward. If he takes your PR as is, then I am ok with that.
I will follow up later with another PR to complete the fix to get rid of off_t.
I am good either way.

@edhartnett
Copy link
Contributor

@wkliao unfortunately I can't help get this PR merged to Unidata netcdf. If you find it useful, I am maintaining HPC NetCDF https://github.com/HPC-NetCDF/netcdf-c which has this PR merged, as well as most of the other PRs pending.

HPC Netcdf is a drop-in replacement for Unidata netcdf, with advanced HPC features. (It is intended that all new features from HPC Netcdf will also be back-ported to Unidata netcdf, but that may take a while due to limited Unidata resources.)

@WardF
Copy link
Member

WardF commented Feb 3, 2018

Happy Saturday all; I’ve just checked my work email, and am getting caught up on this. Lacking a demonstrated issue, I think the best course of action will be to accept this PR and if there is a potential issue that needs addressed we can do it via a follow up issue/pull request discussion.

Let me merge a PR that fixes a window test and then I’ll follow on with this one. I’ll get that done now.

@DennisHeimbigner
Copy link
Collaborator

@wkliao
I thought about it; You are correct and I apologize.
You provided a perfectly reasonable PR to fix a specific
problem. I should not push you to solve a different problem
just because I perceive an overlap.

@DennisHeimbigner DennisHeimbigner dismissed their stale review February 4, 2018 00:48

It was ill considered

@wkliao
Copy link
Contributor Author

wkliao commented Feb 4, 2018

We all want the best for NetCDF, just approaching it from different angles.

@edhartnett, HPC NetCDF is an interesting work. You have recently contributed lots of patches to NetCDF. I wish I can be as energetic as you.

@WardF WardF merged commit b4a2947 into Unidata:master Feb 4, 2018
@edhartnett
Copy link
Contributor

@wkliao I don't believe in letting the grass grow under my feet.

HPC NetCDF is about to get more interesting. ;-)

@wkliao wkliao deleted the cdf5_var_len branch September 15, 2018 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants