Request: ncdump: Fix printing of UTF-8 in "character" data variables #2916

Dave-Allured · 2024-05-03T01:16:16Z

This is a feature request. I tested this with netcdf-c 4.9.2 on mac and linux, but I think it applies to all previous netcdf versions.

Ncdump has this convenient feature. When printing type character, it concatenates single chars along the right hand dimension, and prints them as strings. This is nice for embedding actual string data in classic netcdf-3 format where there is no actual "string" data type.

However, ncdump does not correctly print UTF-8 strings when embedded like this in type character. UTF-8 multi-byte characters in the data are split into single bytes, and printed as octal escape sequences. See example printout below, and the netcdf-4 file for this demo is attached here.

Can ncdump be updated to print such UTF-8 character strings correctly? I notice that "string" data type in netcdf-4 is handled differently, and UTF-8 is rendered correctly in that case.

netcdf utf8.test {
dimensions:
	site = 4 ;
	len25 = 25 ;
variables:
	string name1(site) ;
	char name2(site, len25) ;
data:

 name1 = "Sainte-Anne-de-Bellevue  ",   // name1(0)
    "Aéroport de Montréal 1 ",   // name1(1)
    "Rivière-des-Prairies    ",   // name1(2)
    "Chenier                  ";  // name1(3)
    
 name2 =
  "Sainte-Anne-de-Bellevue  ",  // name2(0,24)
    "A\303\251roport de Montr\303\251al 1 ",  // name2(1,24)
    "Rivi\303\250re-des-Prairies    ",  // name2(2,24)
    "Chenier                  ";  // name2(3,24)
    }

The text was updated successfully, but these errors were encountered:

DennisHeimbigner · 2024-05-03T02:59:00Z

The interpretation of single characters has been discussed on several occasions.
If I recall correctly one idea was to add an "_encoding" attribute to variables to indicate
how to print them. Would this be an appropriate extension your feature?
In any case, I think this could be done and seems like a reasonable change.
BTW, I forget; what do character valued attributes print?

Dave-Allured · 2024-05-03T14:55:46Z

what do character valued attributes print?

Attributes print the raw character string without substituting octal codes. This is again with 4.9.2, and netcdf-3 and netcdf-4 formats both seem to work exactly the same way with UTF-8. Test file attached.

netcdf att-test {
variables:
	int x ;
		x:att1 = "Sainte-Anne-de-Bellevue" ;
		x:att2 = "Aéroport de Montréal 1" ;

// global attributes:
		:gatt1 = "Sainte-Anne-de-Bellevue" ;
		:gatt2 = "Aéroport de Montréal 1" ;
data:

 x = 99 ;
}

DennisHeimbigner · 2024-05-03T15:00:59Z

Good, so we don't have to make any adjustment for attributes.
It occurs to me that there is one other issue.
If the char typed variable is a multi-dimensional array, then
as I recall, ncdump only dumps the last dimension as a
contiguous string of characters (again needs testing).
So an array would look like a sequence of individual strings.
Are you ok with that?

Dave-Allured · 2024-05-03T15:20:30Z

The interpretation of single characters has been discussed on several occasions.
If I recall correctly one idea was to add an "_encoding" attribute to variables to indicate how to print them. Would this be an appropriate extension your feature?

I think the right solution for ncdump is to completely avoid character interpretation, and to print raw character strings directly to standard out. This appears to be the traditional behavior for both netcdf-4 string data type, and for type character attributes in all netcdf formats.

Labeling such as _encoding should be left as a signal for user consumption, and should not affect the raw output method used by ncdump.

Notice that if the consumer assumes the correct character set in advance, and uses matching settings in terminal windows and text readers, then ncdump raw character string output will be correct for ALL character sets, not just UTF-8 or ASCII.

Dave-Allured · 2024-05-03T15:33:55Z

If the char typed variable is a multi-dimensional array, then
as I recall, ncdump only dumps the last dimension as a
contiguous string of characters (again needs testing).
So an array would look like a sequence of individual strings.
Are you ok with that?

That multi-line rendering has been traditional ncdump character behavior for decades, and I have made extensive use of that. Please do not change that.

In my opinion, the only thing that needs adjusting is the gratuitous insertion of octal codes. There must have been a good reason for this in some case, but I do not recall that history. Possibly a new control for that behavior is needed, if there is still a need for octal codes.

The rendering method for both character attributes and netcdf-4 strings sets a good precedent for how type character "strings" should be rendered by ncdump.

Dave-Allured · 2024-05-03T15:42:17Z

Now that I understand this issue better, my title is incorrect. It should be something like "ncdump: Print raw character strings without character substitution". But let's leave it as is, for continuity. UTF-8 is the most common modern case.

DennisHeimbigner · 2024-05-06T20:36:22Z

I did some investigation. The reason that octal values are printed is because the ncdump code
walks the character data char by char and prints each char individually. In order to get
utf-8, it needs to be changed to print the whole sequence at the same time.

Dave-Allured · 2024-05-06T23:15:10Z

@DennisHeimbigner Thanks for looking into that. Yes I also saw that, in three different places for some reason. I agree, print a full sequence, possibly excepting for appropriate escape codes and line breaks. I have not looked into whether printing data type string adds any escape codes, such as, for actual control characters.

DennisHeimbigner · 2024-05-07T15:07:31Z

I am testing the fix. One consequence is that some test cases fail
because they were expecting octal output. May need to consider
an ncdump option to enable this change.

re: Issue Unidata#2916 Currently, ncdump prints char-valued variables as a mix of ascii and octal characters. The octal format is used for non-printable ascii character values. This PR changes this to print the char variable values as raw binary. This means in practice that utf-8 tags are properly interpreted and printed as utf-8.

DennisHeimbigner · 2024-05-07T16:38:37Z

See #2921

Dave-Allured changed the title ~~Request: ncdump: Print character data type as UTF-8~~ Request: ncdump: Fix printing of UTF-8 in "character" data variables May 3, 2024

DennisHeimbigner mentioned this issue May 7, 2024

Modify ncdump to print char-valued variables as utf8. #2921

Merged

WardF closed this as completed in #2921 May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: ncdump: Fix printing of UTF-8 in "character" data variables #2916

Request: ncdump: Fix printing of UTF-8 in "character" data variables #2916

Dave-Allured commented May 3, 2024 •

edited

Loading

DennisHeimbigner commented May 3, 2024

Dave-Allured commented May 3, 2024

DennisHeimbigner commented May 3, 2024

Dave-Allured commented May 3, 2024 •

edited

Loading

Dave-Allured commented May 3, 2024

Dave-Allured commented May 3, 2024

DennisHeimbigner commented May 6, 2024

Dave-Allured commented May 6, 2024 •

edited

Loading

DennisHeimbigner commented May 7, 2024

DennisHeimbigner commented May 7, 2024

Request: ncdump: Fix printing of UTF-8 in "character" data variables #2916

Request: ncdump: Fix printing of UTF-8 in "character" data variables #2916

Comments

Dave-Allured commented May 3, 2024 • edited Loading

DennisHeimbigner commented May 3, 2024

Dave-Allured commented May 3, 2024

DennisHeimbigner commented May 3, 2024

Dave-Allured commented May 3, 2024 • edited Loading

Dave-Allured commented May 3, 2024

Dave-Allured commented May 3, 2024

DennisHeimbigner commented May 6, 2024

Dave-Allured commented May 6, 2024 • edited Loading

DennisHeimbigner commented May 7, 2024

DennisHeimbigner commented May 7, 2024

Dave-Allured commented May 3, 2024 •

edited

Loading

Dave-Allured commented May 3, 2024 •

edited

Loading

Dave-Allured commented May 6, 2024 •

edited

Loading