Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncgen crashes on particular cdl file #323

Closed
WardF opened this issue Oct 28, 2016 · 42 comments
Closed

ncgen crashes on particular cdl file #323

WardF opened this issue Oct 28, 2016 · 42 comments

Comments

@WardF
Copy link
Member

WardF commented Oct 28, 2016

The following CDL crashes when ncgen -k nc4 [filename] is executed on it:

netcdf stations {
types:
  compound obs_t {
    float lat ;
    float lon ;
    string stid ;
    double time ;
    float temperature ;
  }; // obs_t
dimensions:
  n = UNLIMITED ;
variables:
  obs_t obs(n) ;

data:

 obs = {39.9, -104.9, "KDEN", 7776000, 15},
       {40, -105, "KBOU", 7776000, 16} ;
}

Moving temperature before time resolves the crash. Changing the type from float to double fixes the crash. Running ncgen -k nc4 -l c test.cdl > test.c, compiling and running test.c generates a proper binary netcdf file as well.

Log files were generated for each run, and captured below. The issue appears to be that, in the failure case, the size of the compound type is incorrectly computed as 28 instead of 32.

ncgen_temp_moved_below_time.out.txt
ncgen.out.txt
test.out.txt

@WardF
Copy link
Member Author

WardF commented Oct 28, 2016

Starting diagnosis in init_types.

@WardF
Copy link
Member Author

WardF commented Oct 28, 2016

Size is being computed in semantics.c:computesize(), ~line 640.

@dopplershift
Copy link
Member

This reeks of some kind of padding issue. Like where the double field needs a 64-bit aligned address or something.

@WardF
Copy link
Member Author

WardF commented Oct 28, 2016

I had to switch gears after identifying the line in semantics.c, but I suspect you are correct.

@DennisHeimbigner
Copy link
Collaborator

On what os are you running this? Under cygwin on x86, it does not fail for me.
I can easily imagine that the compound type size is not being computed correctly.
What happens is that I have to analyze the compound type to figure out the alignment.
of the fields in the compound type. This has always been a source of potential bugs
since it is potentially dependent on os, compiler, and hardware.

@dopplershift
Copy link
Member

Running on macOS, 64-bit. Do you really mean x86, or is it x86-64?

@WardF
Copy link
Member Author

WardF commented Oct 28, 2016

I've tested on Linux and OSX, both 64-bit

@DennisHeimbigner
Copy link
Collaborator

The issue is probably compiler related and involves
32bit vs 64 bit. I will need access to a true 64 bit linux
installation to debug it.
=Dennis

On 10/28/2016 4:27 PM, Ryan May wrote:

Running on macOS, 64-bit. Do you really mean x86, or is it x86-64?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P23JlNInaD1q1anzG14eKMqz4jY41ks5q4nbngaJpZM4KjwtR.

@WardF
Copy link
Member Author

WardF commented Oct 31, 2016

Created a branch, gh323, and incorporated the cdl provided by Ryan into the ncgen netcdf4 tests.

WardF added a commit that referenced this issue Oct 31, 2016
@WardF
Copy link
Member Author

WardF commented Nov 14, 2016

@dmh were you able to get access to a 64-bit environment?

@DennisHeimbigner
Copy link
Collaborator

Not yet. If you want to take a stab at it, the probable
source of the error is ncgen/offset.c. This code attempts
to figure out the struct field alignments. The code
was taken from HDF5, and it may be time to revisit
that original HDF5 code to look for discrepancies.
=Dennis

On 11/14/2016 11:22 AM, Ward Fisher wrote:

@dmh https://github.com/dmh were you able to get access to a 64-bit
environment?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P24mn_d4UGodqmO4DinBn3n8GM8THks5q-KbmgaJpZM4KjwtR.

@WardF
Copy link
Member Author

WardF commented Nov 14, 2016

note-to-self: semantics.c:640.

@WardF
Copy link
Member Author

WardF commented Nov 14, 2016

Read: start, += getpadding(), += field->typesize

Loop offset (float-double) offset (double-float)
0 0,0,4 (float) 0,0,4 (float)
1 4,4,8 (float) 4,4,8 (float)
2 8,8,16 (string) 8,8,16 (string)
3 16,16,20 (float) 16,16,24 (double)
4 20,24,32 (double) 24,24,28 (float)

Clearly, order matters when computing padding/offset, and the latter order crashes hdf5 for some reason. Should it? Probably not.

@WardF
Copy link
Member Author

WardF commented Nov 14, 2016

Looking at the C code generated by ncgen, it uses the NC_COMPOUND_OFFSET macro which in turn invokes the offsetof C macro, to compute offsets. @DennisHeimbigner Is there any reason not to use the same/similar code for ncgen? I don't know the historic reason for using the hdf5-inspired offsets.c file.

@DennisHeimbigner
Copy link
Collaborator

The problem is compile-time vs run time. The c-code can use
the NC_COMPOUND_OFFSET because it gets compiled. Generating
the nc file directly (-lb) requires the run-time equivalent.
If memory serves, there is no run-time equivalent of NC_COMPOUND_OFFSET.
=Dennis

On 11/14/2016 3:24 PM, Ward Fisher wrote:

Looking at the C code generated by ncgen, it uses the
|NC_COMPOUND_OFFSET| macro which in turn invokes the |offsetof| C macro,
to compute offsets. @DennisHeimbigner
https://github.com/DennisHeimbigner Is there any reason not to use the
same/similar code for ncgen? I don't know the historic reason for using
the hdf5-inspired offsets.c file.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P29rP5OI1XtNHlgJtZD0siK37huO6ks5q-N-DgaJpZM4KjwtR.

@DennisHeimbigner
Copy link
Collaborator

BTW, order sometimes does matter (though rarely).
As a rule, the starting offset of a field is rounded up
to the nearest multiple of the size of the field. so given
struct { char x[3]; int y;} , the offset of x will be 0, but the offset of y will be 4, with a 1 char padding
between x and y.
On the other hand,
struct { char x[3]; double y;} , the offset of x will be 0, but the offset of y will be 8.
Technically, this is compiler dependent, but in practice, all of them adhere to the gcc rules.
Note also that using a packed structure changes things completely. I suspect that none of this is
documented anywhere outside the code.

The reason all of this is important is that netcdf-4 tried to make it possible to define a
C struct type with various fields, and then allow direct read/write of data between that
struct and the underlying HDF5 file. In HDF5 (and netcdf-4), defining a field in a compound
type requires specifying its offset from the beginning of the structure. So it was necessary for
the ncgen to properly determine the offsets at run-time.

I should note that there is undocumented code (ncaux.[ch]) that allows a user to
build a compound type without having to specify the offsets: they are computed using
essentially the same code as used by ncgen.

@WardF
Copy link
Member Author

WardF commented Nov 15, 2016

Breadcrumbs: Section 4.3.2.1

@WardF
Copy link
Member Author

WardF commented Nov 15, 2016

Technically, this is compiler dependent, but in practice, all of them adhere to the gcc rules.
Note also that using a packed structure changes things completely. I suspect that none of this is
documented anywhere outside the code.

I need to check to see what Visual Studio does; I hope it also adheres to gcc rules but who knows, considering all the other weird stuff they've done.

@WardF
Copy link
Member Author

WardF commented Nov 15, 2016

Misc. notes:

gh323.c = C code generated by ncgen -l c -k nc4 gh323.cdl > gh323.c

  • Sample CDL where a compound data type including a double followed by a smaller data type such as int or char results in either a crash, such as seen in the CDL above, _or silently fails after writing garbage to the binary netcdf file_.
  • gh323.c (temporarily stored in ncgen/ while debugging this) contains diagnostic info, and confirms the offsets computed are the same as those in the table above, but the final struct size is 32 not 28.
  • Trying to nail down behavior and why expected behavior isn't manifesting lead me to this page on wikipedia re: Data structure alignment, where the following statement is made:

It is important to note that the last member is padded with the number of bytes required so that the total size of the structure should be a multiple of the largest alignment of any structure member (alignment(int) in this case, which = 4 on linux-32bit/gcc)[citation needed].

That squares with what my understanding but I'll have to turn to other sources for a definitive source. This also explains why size = 28 is failing. ~~~However, attempting to correct this in semantics.c:~654 with some computations fixes the test case for this issue, but causes a bunch of new failures in ncdump/tst_ncgen4.sh that have yet to be diagnosed/investigated. Before going down that rabbit hole I'll need to convince myself that the approach is correct.~~~

UPDATE: I corrected my fix (I was using the wrong variable to compute the "largest alignment multiplier"). Correcting this corrected the failures previously reported re: tst_ncgen4.sh. I still need to convince myself this fix is correct, but I'm optimistic.

@WardF
Copy link
Member Author

WardF commented Nov 15, 2016

Cleanup todo list

Scratchpad for temporary files in netcdf-c/ that need to be addressed before merging into master.

  • ncgen/gh323_ncgen.c -> Remove
  • RELEASE_NOTES.md -> Update
  • gdb.txt -> Remove
  • align_test.c -> Remove
  • ncgen/gh323-c.cdl -> Remove
  • ncgen/gh323-b.cdl -> Remove
  • ncgen/gh323.cdl -> Rename
  • ncgen/run_nc4_tests.sh -> Update w/ gh323.cdl name, message.

@WardF
Copy link
Member Author

WardF commented Nov 15, 2016

Testing on ARM, Windows (with visual studio) appear to be working so far. Assuming these work, I will see about issuing a pull request to make code review easier.

@DennisHeimbigner
Copy link
Collaborator

I pushed a branch called issue323.dmh.
can you please try the following.

  1. extract ncgen/offsets.[ch] from that branch
  2. compile on a 64-bit machine as: ${CC} -DOFFSETTEST offsets.c -o offsets.exe
  3. execute offsets.exe and send me the output.

@WardF
Copy link
Member Author

WardF commented Nov 15, 2016

Hi Dennis,

The output is below. I've also gotten something that appears to be fixing
the issue in, in branch gh323; I've outlined the fix I found on the github
page.

Here's the output:

null : size= 0 alignment= 0
char : size= 2 alignment= 1
unsigned char : size= 2 alignment= 1
short : size= 4 alignment= 2
unsigned short : size= 4 alignment= 2
int : size= 8 alignment= 4
unsigned int : size= 8 alignment= 4
long long : size=16 alignment= 8
unsigned long long : size=16 alignment= 8
float : size= 8 alignment= 4
double : size=16 alignment= 8
void* : size=16 alignment= 8
nc_vlen_t : size=24 alignment= 8
short vs null : size= 0 alignment= 0
short vs char : size= 4 alignment= 2
short vs unsigned char : size= 4 alignment= 2
short vs short : size= 4 alignment= 2
short vs unsigned short : size= 4 alignment= 2
short vs int : size= 8 alignment= 4
short vs unsigned int : size= 8 alignment= 4
short vs long long : size=16 alignment= 0
short vs unsigned long long : size=16 alignment= 0
short vs float : size= 8 alignment= 4
short vs double : size=16 alignment= 8
short vs void* : size=16 alignment= 8
short vs nc_vlen_t : size=16 alignment= 8
int vs null : size= 0 alignment= 0
int vs char : size= 8 alignment= 2
int vs unsigned char : size= 8 alignment= 2
int vs short : size= 8 alignment= 2
int vs unsigned short : size= 8 alignment= 2
int vs int : size= 8 alignment= 4
int vs unsigned int : size= 8 alignment= 4
int vs long long : size=16 alignment= 8
int vs unsigned long long : size=16 alignment= 8
int vs float : size= 8 alignment= 4
int vs double : size=16 alignment= 8
int vs void* : size=16 alignment= 8
int vs nc_vlen_t : size=16 alignment= 8

On Tue, Nov 15, 2016 at 3:20 PM, DennisHeimbigner notifications@github.com
wrote:


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEH-UoxrHKcLC01bwdb2P46lOPr6Tnm9ks5q-jA5gaJpZM4KjwtR
.

@DennisHeimbigner
Copy link
Collaborator

I think the situation is a lot worse than I realized.
The tests that need to run are:

  1. write file on 32 bit machine, read on a 64 bit machine
  2. write on a 64 bit machine, read on a 32 bit machine.
    The think I am afraid of is this: the offsets encoded into
    the hdf5/netcdf-4 file are determined by the machine on which
    the file was written. If the machine doing the reading is 64-bit
    and has different offsets as set by the compiler, then the cross-
    machine reads will fail.
    =Dennis

On 11/15/2016 3:24 PM, Ward Fisher wrote:

Hi Dennis,

The output is below. I've also gotten something that appears to be fixing
the issue in, in branch gh323; I've outlined the fix I found on the github
page.

Here's the output:

null : size= 0 alignment= 0
char : size= 2 alignment= 1
unsigned char : size= 2 alignment= 1
short : size= 4 alignment= 2
unsigned short : size= 4 alignment= 2
int : size= 8 alignment= 4
unsigned int : size= 8 alignment= 4
long long : size=16 alignment= 8
unsigned long long : size=16 alignment= 8
float : size= 8 alignment= 4
double : size=16 alignment= 8
void* : size=16 alignment= 8
nc_vlen_t : size=24 alignment= 8
short vs null : size= 0 alignment= 0
short vs char : size= 4 alignment= 2
short vs unsigned char : size= 4 alignment= 2
short vs short : size= 4 alignment= 2
short vs unsigned short : size= 4 alignment= 2
short vs int : size= 8 alignment= 4
short vs unsigned int : size= 8 alignment= 4
short vs long long : size=16 alignment= 0
short vs unsigned long long : size=16 alignment= 0
short vs float : size= 8 alignment= 4
short vs double : size=16 alignment= 8
short vs void* : size=16 alignment= 8
short vs nc_vlen_t : size=16 alignment= 8
int vs null : size= 0 alignment= 0
int vs char : size= 8 alignment= 2
int vs unsigned char : size= 8 alignment= 2
int vs short : size= 8 alignment= 2
int vs unsigned short : size= 8 alignment= 2
int vs int : size= 8 alignment= 4
int vs unsigned int : size= 8 alignment= 4
int vs long long : size=16 alignment= 8
int vs unsigned long long : size=16 alignment= 8
int vs float : size= 8 alignment= 4
int vs double : size=16 alignment= 8
int vs void* : size=16 alignment= 8
int vs nc_vlen_t : size=16 alignment= 8

On Tue, Nov 15, 2016 at 3:20 PM, DennisHeimbigner notifications@github.com
wrote:


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread

https://github.com/notifications/unsubscribe-auth/AEH-UoxrHKcLC01bwdb2P46lOPr6Tnm9ks5q-jA5gaJpZM4KjwtR
.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P2-uRYJJJtpP_vdjXEC6w5y3gfb_sks5q-jEigaJpZM4KjwtR.

@WardF
Copy link
Member Author

WardF commented Nov 16, 2016

I've tested 1 & 2 and haven't had a problem reading either; I tested branch gh323 with the fix described above and there was no problem reading the files generated on 32/64-machines by their opposite. I've attached the files here, the filename indicates which platform the file was written on.

I thought the file offsets were handled by the I/O layer; otherwise, wouldn't netCDF4 files have never been machine-independent in the first place? I'll test these on Windows as well, and ARM when I'm in the office tomorrow.

@WardF
Copy link
Member Author

WardF commented Nov 16, 2016

Update: The files are reading just fine between platforms, but are being written differently. Specifically, it is being written incorrectly on the 64-bit platform. So, more work is needed.

Written on 32-bit platform:

netcdf compound_datasize_test_32 {
types:
  compound obs_t {
    float lat ;
    float lon ;
    string stid ;
    double time ;
    float temperature ;
  }; // obs_t
dimensions:
    n = UNLIMITED ; // (2 currently)
variables:
    obs_t obs(n) ;
data:

 obs = {39.9, -104.9, "KDEN", 7776000, 15}, {40, -105, "KBOU", 7776000, 16} ;

Written on 64-bit platform

netcdf compound_datasize_test_64 {
types:
  compound obs_t {
    float lat ;
    float lon ;
    string stid ;
    double time ;
    float temperature ;
  }; // obs_t
dimensions:
    n = UNLIMITED ; // (2 currently)
variables:
    obs_t obs(n) ;
data:

 obs = {39.9, -104.9, "KDEN", 7776000, 15}, {-105, 0, "KBOU", 7776000, 16} ;

The results remain the same whether I read from a 32-bit or 64-bit platform.

@DennisHeimbigner
Copy link
Collaborator

Note this netcdf API function:
EXTERNL int
nc_insert_compound(int ncid, nc_type xtype, const char *name,
size_t offset, nc_type field_typeid);
Note that in the C code, the offset is determined by the offset of the C struct definition.
Someone (the HDF5 layer) is translating the offsets, apparently.

@WardF
Copy link
Member Author

WardF commented Nov 16, 2016

Hi Dennis; I had noticed the nc_insert_compound() function; when I looked into what the HDF5 function was doing, they were also using the offsetof C macro for runtime scripts. Still not sure what's going on with ncgen but I've just sat down to try to debug this issue with data being written incorrectly on 64-bit systems. Do you have any insight into a fix?

@WardF
Copy link
Member Author

WardF commented Nov 16, 2016

Additionally, I notice the tests in run_nc4_tests.sh only run ncgen to see if it exits with 0; no check to ensure the data is correct is made. I'm updating these tests to make sure they generate what is expected.

@WardF
Copy link
Member Author

WardF commented Nov 16, 2016

Ok, current status; the crash is gone but, on 64-bit systems, data is getting scrambled in the CDL file as represented above. The fix is the same as it was for the crash; moving the float below the double solves the issue.

@DennisHeimbigner
Copy link
Collaborator

This is all beginning to come back. I did an H5Dump on a netcdf-4 file
containing a compound type and it looks like the offsets are not stored in the file.
Assuming that, then the HDF5 library translates between the file layout and the in-memory
layout on the fly. So the data in the file is machine and compiler independent but
the in-memory layout is not.
Since I took the offsets.c code from HDF5, I hope that what we are seeing is not also
a problem with the hdf5 library.

@DennisHeimbigner
Copy link
Collaborator

Also, let me make sure I understand. You seem to be saying that even when you round up
the size of the whole structure, 64-bit still fails, correct? does it matter if the variable
is dimensioned?

@DennisHeimbigner
Copy link
Collaborator

So I was looking at the code in semantics.c and it seems to do what you
want about bumping the size of the structure. I think the variable largsize
is unneeded, but whatever.

@WardF
Copy link
Member Author

WardF commented Nov 17, 2016

Note from meeting w/ Dennis: probably something in the byte buffer construction code. It looks like the first 4 bytes of the data array (starting at the second element) are being discarded.

@DennisHeimbigner
Copy link
Collaborator

Ok, I am testing a fix now. Sadly, I was not only mishandling padding between
struct instances, but also within in some case. I have added another test to
try out nested compounds. Seems to work. Let you know when the code passes
distcheck.

@WardF
Copy link
Member Author

WardF commented Nov 17, 2016

Thanks Dennis, I did a spot test on my end and, under cmake at least, everything appears to be fixed. I'll wait for travis to confirm the fix, and assuming everything passes I'll flesh out the release notes as well as making sure that all the test scripts are checking the output when ncgen is run.

@DennisHeimbigner
Copy link
Collaborator

distcheck passed, so I guess we are just waiting on travis.
BTW, we will need to consider adding more tests with various
kinds of compound nesting and intermixing with e.g. vlens.

@WardF
Copy link
Member Author

WardF commented Nov 17, 2016

Travis has passed, as has 32-bit ARM. Testing on Windows now with Visual Studio (32, 64 bit). Will continue to flesh out tests and such, but assuming I don't find any failures I will merge into master and create a maintenance release ASAP.

@dopplershift
Copy link
Member

As an aside, are you interested in AppVeyor as a Travis-like thing for windows? I can help get you set up--I'm just no good at windows building.

@dopplershift
Copy link
Member

(I have a Unidata org for it)

@WardF
Copy link
Member Author

WardF commented Nov 18, 2016

@dopplershift possibly! I'll investigate it and follow up with you; the required dependencies are the only thing I wonder about.

WardF added a commit that referenced this issue Nov 18, 2016
@WardF WardF modified the milestones: 4.4.2, 4.4.1.1 Dec 28, 2016
@WardF
Copy link
Member Author

WardF commented Mar 24, 2017

Apparently I never closed this; reviewing to ensure it's solved and then closing.

@WardF WardF closed this as completed May 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants