Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: ncdump: Fix printing of UTF-8 in "character" data variables #2916

Closed
Dave-Allured opened this issue May 3, 2024 · 10 comments · Fixed by #2921
Closed

Request: ncdump: Fix printing of UTF-8 in "character" data variables #2916

Dave-Allured opened this issue May 3, 2024 · 10 comments · Fixed by #2921

Comments

@Dave-Allured
Copy link
Contributor

Dave-Allured commented May 3, 2024

This is a feature request. I tested this with netcdf-c 4.9.2 on mac and linux, but I think it applies to all previous netcdf versions.

Ncdump has this convenient feature. When printing type character, it concatenates single chars along the right hand dimension, and prints them as strings. This is nice for embedding actual string data in classic netcdf-3 format where there is no actual "string" data type.

However, ncdump does not correctly print UTF-8 strings when embedded like this in type character. UTF-8 multi-byte characters in the data are split into single bytes, and printed as octal escape sequences. See example printout below, and the netcdf-4 file for this demo is attached here.

Can ncdump be updated to print such UTF-8 character strings correctly? I notice that "string" data type in netcdf-4 is handled differently, and UTF-8 is rendered correctly in that case.

netcdf utf8.test {
dimensions:
	site = 4 ;
	len25 = 25 ;
variables:
	string name1(site) ;
	char name2(site, len25) ;
data:

 name1 = "Sainte-Anne-de-Bellevue  ",   // name1(0)
    "Aéroport de Montréal 1 ",   // name1(1)
    "Rivière-des-Prairies    ",   // name1(2)
    "Chenier                  ";  // name1(3)
    
 name2 =
  "Sainte-Anne-de-Bellevue  ",  // name2(0,24)
    "A\303\251roport de Montr\303\251al 1 ",  // name2(1,24)
    "Rivi\303\250re-des-Prairies    ",  // name2(2,24)
    "Chenier                  ";  // name2(3,24)
    }
@DennisHeimbigner
Copy link
Collaborator

The interpretation of single characters has been discussed on several occasions.
If I recall correctly one idea was to add an "_encoding" attribute to variables to indicate
how to print them. Would this be an appropriate extension your feature?
In any case, I think this could be done and seems like a reasonable change.
BTW, I forget; what do character valued attributes print?

@Dave-Allured
Copy link
Contributor Author

what do character valued attributes print?

Attributes print the raw character string without substituting octal codes. This is again with 4.9.2, and netcdf-3 and netcdf-4 formats both seem to work exactly the same way with UTF-8. Test file attached.

netcdf att-test {
variables:
	int x ;
		x:att1 = "Sainte-Anne-de-Bellevue" ;
		x:att2 = "Aéroport de Montréal 1" ;

// global attributes:
		:gatt1 = "Sainte-Anne-de-Bellevue" ;
		:gatt2 = "Aéroport de Montréal 1" ;
data:

 x = 99 ;
}

@DennisHeimbigner
Copy link
Collaborator

Good, so we don't have to make any adjustment for attributes.
It occurs to me that there is one other issue.
If the char typed variable is a multi-dimensional array, then
as I recall, ncdump only dumps the last dimension as a
contiguous string of characters (again needs testing).
So an array would look like a sequence of individual strings.
Are you ok with that?

@Dave-Allured
Copy link
Contributor Author

Dave-Allured commented May 3, 2024

The interpretation of single characters has been discussed on several occasions.
If I recall correctly one idea was to add an "_encoding" attribute to variables to indicate how to print them. Would this be an appropriate extension your feature?

I think the right solution for ncdump is to completely avoid character interpretation, and to print raw character strings directly to standard out. This appears to be the traditional behavior for both netcdf-4 string data type, and for type character attributes in all netcdf formats.

Labeling such as _encoding should be left as a signal for user consumption, and should not affect the raw output method used by ncdump.

Notice that if the consumer assumes the correct character set in advance, and uses matching settings in terminal windows and text readers, then ncdump raw character string output will be correct for ALL character sets, not just UTF-8 or ASCII.

@Dave-Allured
Copy link
Contributor Author

If the char typed variable is a multi-dimensional array, then
as I recall, ncdump only dumps the last dimension as a
contiguous string of characters (again needs testing).
So an array would look like a sequence of individual strings.
Are you ok with that?

That multi-line rendering has been traditional ncdump character behavior for decades, and I have made extensive use of that. Please do not change that.

In my opinion, the only thing that needs adjusting is the gratuitous insertion of octal codes. There must have been a good reason for this in some case, but I do not recall that history. Possibly a new control for that behavior is needed, if there is still a need for octal codes.

The rendering method for both character attributes and netcdf-4 strings sets a good precedent for how type character "strings" should be rendered by ncdump.

@Dave-Allured
Copy link
Contributor Author

Now that I understand this issue better, my title is incorrect. It should be something like "ncdump: Print raw character strings without character substitution". But let's leave it as is, for continuity. UTF-8 is the most common modern case.

@Dave-Allured Dave-Allured changed the title Request: ncdump: Print character data type as UTF-8 Request: ncdump: Fix printing of UTF-8 in "character" data variables May 3, 2024
@DennisHeimbigner
Copy link
Collaborator

I did some investigation. The reason that octal values are printed is because the ncdump code
walks the character data char by char and prints each char individually. In order to get
utf-8, it needs to be changed to print the whole sequence at the same time.

@Dave-Allured
Copy link
Contributor Author

Dave-Allured commented May 6, 2024

@DennisHeimbigner Thanks for looking into that. Yes I also saw that, in three different places for some reason. I agree, print a full sequence, possibly excepting for appropriate escape codes and line breaks. I have not looked into whether printing data type string adds any escape codes, such as, for actual control characters.

@DennisHeimbigner
Copy link
Collaborator

I am testing the fix. One consequence is that some test cases fail
because they were expecting octal output. May need to consider
an ncdump option to enable this change.

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue May 7, 2024
re: Issue Unidata#2916

Currently, ncdump prints char-valued variables as a mix
of ascii and octal characters. The octal format is used
for non-printable ascii character values.

This PR changes this to print the char variable values
as raw binary. This means in practice that utf-8 tags
are properly interpreted and printed as utf-8.
@DennisHeimbigner
Copy link
Collaborator

See #2921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants