-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: ncdump: Fix printing of UTF-8 in "character" data variables #2916
Comments
The interpretation of single characters has been discussed on several occasions. |
Attributes print the raw character string without substituting octal codes. This is again with 4.9.2, and netcdf-3 and netcdf-4 formats both seem to work exactly the same way with UTF-8. Test file attached.
|
Good, so we don't have to make any adjustment for attributes. |
I think the right solution for Labeling such as Notice that if the consumer assumes the correct character set in advance, and uses matching settings in terminal windows and text readers, then |
That multi-line rendering has been traditional In my opinion, the only thing that needs adjusting is the gratuitous insertion of octal codes. There must have been a good reason for this in some case, but I do not recall that history. Possibly a new control for that behavior is needed, if there is still a need for octal codes. The rendering method for both character attributes and netcdf-4 strings sets a good precedent for how type character "strings" should be rendered by |
Now that I understand this issue better, my title is incorrect. It should be something like "ncdump: Print raw character strings without character substitution". But let's leave it as is, for continuity. UTF-8 is the most common modern case. |
I did some investigation. The reason that octal values are printed is because the ncdump code |
@DennisHeimbigner Thanks for looking into that. Yes I also saw that, in three different places for some reason. I agree, print a full sequence, possibly excepting for appropriate escape codes and line breaks. I have not looked into whether printing data type string adds any escape codes, such as, for actual control characters. |
I am testing the fix. One consequence is that some test cases fail |
re: Issue Unidata#2916 Currently, ncdump prints char-valued variables as a mix of ascii and octal characters. The octal format is used for non-printable ascii character values. This PR changes this to print the char variable values as raw binary. This means in practice that utf-8 tags are properly interpreted and printed as utf-8.
See #2921 |
This is a feature request. I tested this with netcdf-c 4.9.2 on mac and linux, but I think it applies to all previous netcdf versions.
Ncdump has this convenient feature. When printing type character, it concatenates single chars along the right hand dimension, and prints them as strings. This is nice for embedding actual string data in classic netcdf-3 format where there is no actual "string" data type.
However, ncdump does not correctly print UTF-8 strings when embedded like this in type character. UTF-8 multi-byte characters in the data are split into single bytes, and printed as octal escape sequences. See example printout below, and the netcdf-4 file for this demo is attached here.
Can ncdump be updated to print such UTF-8 character strings correctly? I notice that "string" data type in netcdf-4 is handled differently, and UTF-8 is rendered correctly in that case.
The text was updated successfully, but these errors were encountered: