Add per Dataset encoding support #655

thehesiod · 2017-04-30T07:46:32Z

resolves Dataset class should support encoding parameter to override global attribute #654
fixes warning related to nc_open_mem change
bumps up cython to 0.20 for basestring support

thehesiod · 2017-04-30T17:27:19Z

btw to make it more backwards compatible we instead change the default encoding to None, with a dynamic fallback in the encoding.__get__ lookup to default_encoding instead of fixing it during __init__. thoughts?

# Conflicts: # netCDF4/_netCDF4.pyx

thehesiod · 2017-05-01T03:58:21Z

that recursion bug was a doozy, put a warning for future devs

still learning cython :)

thehesiod · 2017-05-01T04:53:13Z

cool, looks like this PR is ready for review!

jswhit · 2017-05-09T18:34:52Z

I'm waiting on this to see how the discussion on netcdf-c plays out (Unidata/netcdf-c#402)

jswhit · 2017-05-15T20:47:15Z

What if different NC_STRING variables within the same dataset contain data with different encodings? I suppose this could be handled with the proposed _Encoding attribute.

The whole thing is a mess and any solution I can think of (including this one) seems fragile and kludgy. At least we have a solution now that works as long as you know the encoding and you only are reading data from one Dataset at a time.

thehesiod · 2017-05-15T23:34:51Z

ya, we need per variable encoding fallbacks. Let me know how you'd like to handle that. Some ideas:

user specified callback that takes one parameter (variable name) and returns encoding to use in case _Encoding not specified and it's a char type.
a Dataset.get_variable method which you can use to specify and encoding.

I can then code something up for you to look at. Right now I'm using a custom branch as I need to be able to specify the encoding. As I open multiple datasets in parallel.

jswhit · 2017-05-16T16:00:29Z

Is your use case (for specifying an encoding) mainly for attributes, or variable data?

thehesiod · 2017-05-16T17:14:45Z

variable data, based on the conversations in the netcdf-c thread it sounds like attribute should all be forced to utf-8 (which I can change in this PR too).

jswhit · 2017-05-16T18:03:54Z

The reason I ask is if you are mainly concerned about attributes, we could add a kwarg encoding to getncattr. You would then have to use getncattr to retrieve string attributes, but you could specify the encoding on a per attribute basis (no need for a Dataset encoding parameter).

jswhit · 2017-05-16T18:04:33Z

Could you post one of the MADIS files that you are dealing with?

thehesiod · 2017-05-16T19:00:06Z

https://madis-data.cprk.ncep.noaa.gov/madisPublic1/data/archive/2017/02/01/LDAD/mesonet/netCDF/20170201_0000.gz

found it for variables['stationName'][77]

[b'F' b'1' b'L' b'X' b'J' b'-' b'1' b'3' b' ' b'A' b'n' b'n' b'\x9c' b'u'
 b'l' b'l' b'i' b'n' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' '
 b' ' b' ' b' ' b' ' b' ' b' ' b' ' b' ' b'F' b'R' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'']

which does not decode to UTF-8.

presumably: 'F1LXJ-13 Annœullin FR ' in CP1252 encoding

jswhit · 2017-05-16T22:27:30Z

thanks @thehesiod. I've created an alternate 'solution' in the 'encoding' branch. Instead of setting the encoding as a Dataset init parameter, I look for an _Encoding attribute for character and vlen string variables. For string variables, if _Encoding is not set, 'utf-8' is used. For character arrays, if _Encoding is set, then a numpy array of fixed length strings is returned by automatically calling chartostring (the rightmost dimension of the variable is assumed to the the length of the strings). If _Encoding is not set, you get the previous behavior (an array of single characters is returned). So, in your case you would have to add _Encoding="cp1252" as a variable attribute, either using NCO or by opening the file in append mode and adding the attribute before you read the data.

For attributes, I added a 'encoding' kwarg to getncattr.

I know adding a new attribute to all the files is probably not a good solution for you, but I'm still a bit confused about what the problem is you are trying to solve. With the current master, you can read the variable stationName and convert it to an array of strings using chartostring - but it will use the global module variable 'default_encoding'. Wouldn't simply adding a kwarg 'encoding' to chartostring solve your problem? (this is also done in the 'encoding' branch).

To be specific, here's what I'm suggesting (using the encoding branch):

from netCDF4 import Dataset, chartostring
nc = Dataset('20170201_0000')
chararr = nc['stationName'][:]
strarr = chartostring(chararr,encoding='cp1252')
print strarr[77]
nc.close()

or alternately

from netCDF4 import Dataset
nc = Dataset('20170201_0000','a')
nc['stationName']._Encoding = 'cp1252'
strarr = nc['stationName'][:]
print strarr[77]
nc.close()

thehesiod · 2017-05-16T23:33:34Z

ya thinking about it more doesn't make sense to need an encoding init param for char variable data, not sure why I thought I needed this. Closing this PR

thehesiod added 2 commits April 30, 2017 00:46

add support for encoding

f45764e

add support for encoding_errors pass through

de5e8f8

thehesiod added 5 commits April 30, 2017 12:32

regen .c file

a600fbd

Merge branch 'nc_open_mem' into encoding

640fd16

# Conflicts: # netCDF4/_netCDF4.pyx

Merge remote-tracking branch 'Unidata/master'

00a2f66

Merge remote-tracking branch 'Unidata/master' into encoding

5dc58c3

fix recursion

3de0707

thehesiod added 3 commits April 30, 2017 21:07

fix unittests

1012a00

still learning cython :)

hopefully fix py2 tests

7274308

bump up to cython 0.20 for basestring support

4518f5d

thehesiod added 2 commits April 30, 2017 22:13

Merge branch 'master' into encoding

11f7a70

hide encoding from utility class exposure

0a750ec

jswhit mentioned this pull request May 2, 2017

Dataset class should support encoding parameter to override global attribute #654

Closed

thehesiod added 2 commits May 2, 2017 12:41

fix doc hints

2bc5b92

regen .c

6defaee

thehesiod closed this May 16, 2017

thehesiod deleted the encoding branch June 14, 2017 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per Dataset encoding support #655

Add per Dataset encoding support #655

thehesiod commented Apr 30, 2017 •

edited

Loading

thehesiod commented Apr 30, 2017 •

edited

Loading

thehesiod commented May 1, 2017 •

edited

Loading

thehesiod commented May 1, 2017

jswhit commented May 9, 2017

jswhit commented May 15, 2017

thehesiod commented May 15, 2017 •

edited

Loading

jswhit commented May 16, 2017

thehesiod commented May 16, 2017

jswhit commented May 16, 2017

jswhit commented May 16, 2017

thehesiod commented May 16, 2017 •

edited

Loading

jswhit commented May 16, 2017 •

edited

Loading

thehesiod commented May 16, 2017 •

edited

Loading

Add per Dataset encoding support #655

Add per Dataset encoding support #655

Conversation

thehesiod commented Apr 30, 2017 • edited Loading

thehesiod commented Apr 30, 2017 • edited Loading

thehesiod commented May 1, 2017 • edited Loading

thehesiod commented May 1, 2017

jswhit commented May 9, 2017

jswhit commented May 15, 2017

thehesiod commented May 15, 2017 • edited Loading

jswhit commented May 16, 2017

thehesiod commented May 16, 2017

jswhit commented May 16, 2017

jswhit commented May 16, 2017

thehesiod commented May 16, 2017 • edited Loading

jswhit commented May 16, 2017 • edited Loading

thehesiod commented May 16, 2017 • edited Loading

thehesiod commented Apr 30, 2017 •

edited

Loading

thehesiod commented Apr 30, 2017 •

edited

Loading

thehesiod commented May 1, 2017 •

edited

Loading

thehesiod commented May 15, 2017 •

edited

Loading

thehesiod commented May 16, 2017 •

edited

Loading

jswhit commented May 16, 2017 •

edited

Loading

thehesiod commented May 16, 2017 •

edited

Loading