Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py3k consistent unicode handling #1782

Closed
wants to merge 17 commits into from
Closed

Conversation

QuLogic
Copy link
Member

@QuLogic QuLogic commented Sep 11, 2015

Convert all strings to Unicode as soon as possible. I had to add in some decode/encode without a specified encoding because I am not familiar with all the formats. This is probably consistent with Python 2, but it's not good practice. If you are more familiar with the correct encoding for a specific format, please let me know so I can make it explicit.

@QuLogic QuLogic mentioned this pull request Sep 11, 2015
14 tasks
@@ -913,7 +913,7 @@ fc_extras
# Set the cube global attributes.
for attr_name, attr_value in six.iteritems(cf_var.cf_group.global_attributes):
try:
if isinstance(attr_value, unicode):
if isinstance(attr_value, six.text_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path is for converting unicode to str to avoid print-outs using the u prefix syntax where possible. That problem doesn't apply in Python 3 so I'd be tempted to use if six.PY2 and isinstance(attr_value, unicode): to make that clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, good point. I sort of ignored what was in the if, but it makes sense to be explicit for when Python2 finally dies in a million years.

@QuLogic QuLogic modified the milestone: v1.9 Sep 12, 2015
@@ -2302,7 +2306,8 @@ def _as_list_of_coords(self, names_or_coords):
Convert a name, coord, or list of names/coords to a list of coords.
"""
# If not iterable, convert to list of a single item
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little misleading with py3k strings having gained an __iter__ method...

@pp-mo
Copy link
Member

pp-mo commented Nov 3, 2015

"Update:"
I've now checked out all the commits here independently, of which some are just fine and others raise potential problems with user-code backwards compatibility (mostly highlighted by @rhattersley above, but also maybe a few more).

I discussed offline with @rhattersley, and we conclude that it's best to tread softly on user code compatibility, so we propose :

  • for 1.9 : add version-specific code where needed to preserve backwards compatibility in Python 2 (i.e. no behavioural changes)
  • then for 2.0 : remove this extra complexity wherever possible : develop clean common code (i.e. essentially what we already have here) and document the Python2 behaviour changes in the release notes.

Note that we don't suggest adding any "deprecation" tests or notes. A major release means we can change things, as long as it is minor behavioural details and not the API.

So, what I think we should do with this PR :

  • First cherry-pick the already-ok changes and put them up for merge
    -- but n.b. that does not deliver something that will test under Python 3.
  • Then reconsider and fix all the rest. Combine with other pending changes currently in QuLogic:py3k, to fix all the tests. Then raise a secondary PR that resolves all that and implements full tests in Python 2+3.

I'm intending to make a start on organising that tomorrow (sorry for a slow day today!).

@pp-mo
Copy link
Member

pp-mo commented Nov 5, 2015

I hope I'm finally getting close to delivering something on this.
Firstly, I have scanned all the individual commits in this PR, and categorised them into those I think are good as-is, and those which need work to preserve existing Python 2 behaviour.
Which I think goes like this:

9619cdf ok  py3k: Decode subprocess output in license header check.
45328f6 ok  py3k: Always use characters for string types.
732f6bf ok  py3k: Handle Unicode in Groupby.
875178a ok  py3k: Handle Unicode in ProtoCube merging.
f9cc0f9     py3k: Use Unicode in coord collapse.
9717793     py3k: Correct Unicode usage in merge test.
33f2e50     py3k: Set strings to Unicode in NAME loader.
3ffd5de ok  py3k: Correctly handle Unicode coords in plots.
a02e863     py3k: Handle strings from netCDF consistently.
6682bce     py3k: Change coord categories to Unicode.
b759876     py3k: Correctly handle string input to slicers.
244ced6 ok  py3k: Skip encoding test on Python 3.
35d3547 ok  py3k: Decode grib_load output.
199bec1     py3k: Decode strings from NIMROD files.
f52aa8d     py3k: Change GRIB originating centres to Unicode.
37dc9f2     py3k: Change merge tests to Unicode.
8980645 ok  py3k: Use six to get the Unicode type.

@pp-mo pp-mo mentioned this pull request Nov 5, 2015
@pp-mo
Copy link
Member

pp-mo commented Nov 5, 2015

Future strategy for fixing this up :

@pp-mo
Copy link
Member

pp-mo commented Nov 5, 2015

(1) cherry-pick all the "ok" commits from the above onto a new branch + make a PR

See : #1825

If no-one objects, I may merge this anyway (as I've only grouped specific parts of what @QuLogic did)

@QuLogic
Copy link
Member Author

QuLogic commented Nov 5, 2015

I've been thinking about this a little bit more, and I remembered something that might cause a little trouble if you go native str on both 2 and 3. On Python 3, the CDL/CML (don't remember which) that's produced will say unicode (even though there's no such thing in Py3) but on Python 2 it will say string. There will be some trouble with the tests in that regard.

This was referenced Nov 6, 2015
@pp-mo
Copy link
Member

pp-mo commented Nov 6, 2015

@QuLogic ... if you go native str on both 2 and 3 ...

Thanks for looking.
From this, I'm not sure if you were looking at the original version of of #1825 here ?
I included there a commit that I had actually meant to exclude. Then I re-published it, quite rapidly I thought but maybe not quick enough ?!

Anyway, that same proposal is now here : #1827

I thought that was working out ok, in that CDL files will now say "string" for 'ordinary' output in both cases - i.e. bytestrings in Python2 and unicode in Python3. It will however say "unicode" in Python 2 (only), and "bytes" in Python 3, if those less usual types are found.

@pp-mo
Copy link
Member

pp-mo commented Nov 6, 2015

produce a branch containing all the non-"ok" commits

I've decided now it is simpler for everyone if I post separate PRs for different areas.
That way others can chip in more easily, if wanted ( @rhattersley ? 😉 )

So, I'll put up PRs for the various aspects I'm addressing as I go.
Then afterwards I'll make a master list of all your unmerged commits (= non-"ok"as listed above) and what I'm proposing for each.

@pp-mo
Copy link
Member

pp-mo commented Nov 6, 2015

I'll put up PRs for the various aspects I'm addressing as I go.

In case you missed it, here's the first ...

replacement PR : #1827
Changes xml() output for coords with string datatype.
Addresses these original @QuLogic commits ...

37dc9f2     py3k: Change merge tests to Unicode.
6682bce     py3k: Change coord categories to Unicode.
9717793     py3k: Correct Unicode usage in merge test.

@pp-mo pp-mo mentioned this pull request Nov 6, 2015
@pp-mo
Copy link
Member

pp-mo commented Nov 6, 2015

another "replacement" PR : #1828
Fixes detection of single/multiple args to coord/coords() in Python 3.
Reimplements the original @QuLogic commit :
b759876 py3k: Correctly handle string input to slicers.

This was referenced Nov 6, 2015
@pp-mo
Copy link
Member

pp-mo commented Nov 6, 2015

another "replacement" PR : #1830
Fixes decoding of netcdf string data in Python 3.
Reimplements the original @QuLogic commit :
a02e863 py3k: Handle strings from netCDF consistently.

@pp-mo pp-mo mentioned this pull request Nov 6, 2015
@pp-mo
Copy link
Member

pp-mo commented Nov 6, 2015

another "replacement" PR : #1832
Fixes collapse of string coordinates in Python 3.
Reimplements the original @QuLogic commit :
f9cc0f9 py3k: Use Unicode in coord collapse.

@QuLogic
Copy link
Member Author

QuLogic commented Nov 6, 2015

say "string" for 'ordinary' output in both cases - i.e. bytestrings in Python2 and unicode in Python3.

OK, perhaps I am misremembering. I thought it would say unicode on Python 3 as well, but if it works, then I guess everything's good.

@QuLogic
Copy link
Member Author

QuLogic commented Nov 6, 2015

Ah, I hadn't read #1825/#1827 yet, but now I see why it works properly.

@pp-mo
Copy link
Member

pp-mo commented Nov 7, 2015

I missed one (#1829 = nimrod loading).
I have also made an additional PR which disables all grib tests for Python 3 : #1833

I think I have now addressed all the 'non-"ok"' commits on this original PR.
When all those are merged with #1833, all tests again pass under Python 3.

Here's a summary of what happened to all the original 'non-"ok"' commits :

sha descr result
6682bce py3k: Change coord categories to Unicode. #1827
37dc9f2 py3k: Change merge tests to Unicode. ( also #1827 )
9717793 py3k: Correct Unicode usage in merge test. ( also #1827 )
f52aa8d py3k: Change GRIB originating centres to Unicode. no longer needed
199bec1 py3k: Decode strings from NIMROD files. #1829
b759876 py3k: Correctly handle string input to slicers. #1828
a02e863 py3k: Handle strings from netCDF consistently. #1830
33f2e50 py3k: Set strings to Unicode in NAME loader. no longer needed -- see following comment
f9cc0f9 py3k: Use Unicode in coord collapse. #1832

@pp-mo
Copy link
Member

pp-mo commented Nov 7, 2015

Regarding 33f2e50 py3k: Set strings to Unicode in NAME loader. :

On revisiting this code, I found that the file-format is entirely text-based, so there is really no need to open the file in binary mode.
So, I am now proposing to leave this code completely unchanged.

Given the xml() fixes from #1827, the test iris.tests.test_name then passes.

@pp-mo
Copy link
Member

pp-mo commented Nov 13, 2015

Nearly there (i.e. back where @QuLogic was at the start of this!)
On his original branch https://github.com/QuLogic/iris/tree/py3k there are also 2 more commits to consider:

First is obvious.
The second is about...?

@pp-mo
Copy link
Member

pp-mo commented Nov 13, 2015

Re: previous comment on the older commit "py3k: Handle new naming of extension libraries."
I still can't work out what this was for.
@QuLogic can you explain what prompted this change and why/if we need it?

@rhattersley
Copy link
Member

@QuLogic can you explain what prompted this change and why/if we need it?

Is it to do with generating docs for iris.fileformats.pp_packing?

@QuLogic
Copy link
Member Author

QuLogic commented Nov 14, 2015

That's correct; sphinx requires the module name. Trimming the extension is fine on Python 2, but on Python 3, the name of a C extension is <module>.cpython-XYm.so. If you just trim the extension there, you get <module>.cpython-XYm as the module name which sphinx will of course fail to import.

@QuLogic
Copy link
Member Author

QuLogic commented Nov 18, 2015

Everything necessary from here has been split off into separate PRs, so I am going to close this one.

@QuLogic QuLogic closed this Nov 18, 2015
@QuLogic QuLogic deleted the py3k-unicode branch November 23, 2015 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants