Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

od -c => od -tc: od -c is a compat-only XSI extension ~equivalent to LC_CTYPE=C od -tc and not universally available #2922

Merged
merged 1 commit into from
Oct 4, 2023

Conversation

nabijaczleweli
Copy link
Contributor

@nabijaczleweli nabijaczleweli commented Oct 4, 2023

Cf. http://ro.ws.co.ls/od#STANDARDS and your favourite POSIX PDF.

@nabijaczleweli nabijaczleweli changed the title od -c => od -tc: od -c is a compat-only XSI extension equivalent to LC_CTYPE=C od -tc and not universally available od -c => od -tc: od -c is a compat-only XSI extension ~equivalent to LC_CTYPE=C od -tc and not universally available Oct 4, 2023
@nicowilliams
Copy link
Contributor

If all checks pass then LGTM.

@emanuele6
Copy link
Member

It does not look good to me.

The title is just outright wrong. You can say that LC_CTYPE=C od -c is equivalent to od -tc, but not that LC_CTYPE=C od -tc is equivalent to od -c.

Also calling XSI compliance an extension hardly makes sense. What portability problem are you trying to fix?

@emanuele6
Copy link
Member

emanuele6 commented Oct 4, 2023

The title is just outright wrong. You can say that LC_CTYPE=C od -c is equivalent to od -tc, but not that LC_CTYPE=C od -tc is equivalent to od -c.

Actually that is not true either.

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
As far as I can tell, the only difference between -c and -t c is that -c is only guaranteed to print backslash sequences for NUL 0x00 \0, BS 0x08 \b, FF 0x0c \f, NL 0x0a \n, CR 0x0d \r, and HT 0x09 \t. While -t c is guaranteed to print special backslash sequences for more characters. And that -c is marked XSI, while -t (-t c) is not.

WIth busybox, GNU, (Free)BSD, they are exactly the same.

Both use the LC_CTYPE locale the same way: to process bytes into characters (POSIX is just saying that for weird systems that handle text, and binary files differently), and to decide what characters are non-graph, and what characters are.

@emanuele6
Copy link
Member

If you really want to want to replace -c with -t c because -c is considered legacy, I guess that's fine.

Do you actually have a system that has got -t c, but not -c? Probably not, so it will at most have the opposite effect and reduce the portability of the od command, but this test suite likely cannot even run on systems that support od -c, but not od -t c so it does not matter.

But let's not misleadingly bringing up LC_CTYPE=C that has nothing to do with -c vs -t c, implies there is some deep difference between the two that just does not exist, and provides wrong information that confuses everyone. :/

If the only reason why we are making this change is because -c is obsolete syntax, I also want to bring up that -oarg over -o arg is also considered obsolete syntax by POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap12.html#tag_12_01.

@nabijaczleweli
Copy link
Contributor Author

Are they the same or is POSIX wrong or are implementations wrong?

The spec says they aren't the same because they backslashise a different set of characters. This is neither here nor there really.
It did also use to say they were the same even when they weren't.

POSIX is wrong: no implementation of -c has ever interpreted the data as characters. Confer http://ro.ws.co.ls/od#Standards. Or prowl through historical implementations yourself! Your pick.

In this case the XSI text says "characters" because this is what the sysvr4 manual says:
image

Some may also recognise the fact that previous sysv manuals also said the same thing, and those definitely weren't localised (don't worry, the sysvr4 od implementation isn't either; IIRC it tried to be but it's broken). Some would say this is because they're using "characters" to mean bytes because they're yanks and multi-byte encodings weren't quite in vogue yet.

Compare the sysiii manual, which says the same thing:
image
except that it specifies the encoding used instead of not doing that.

Implementations are wrong:

Compare coreutils:

$ echo ąęć | /bin/od -c
0000000 304 205 304 231 304 207  \n
0000007
$ echo ąęć | /bin/od -tc
0000000 304 205 304 231 304 207  \n
0000007

where -tc and -c are in LC_CTYPE=C.

This is also https://bugs.debian.org/1037048

Compare NetBSD:
image
where -tc and -c are in LC_CTYPE=C.

Compare FreeBSD:

# echo ąęć | od -tc
0000000    ą  **   ę  **   ć  **  \n
0000007
# echo ąęć | od -c
0000000    ą  **   ę  **   ć  **  \n
0000007

where -tc and -c are in the current LC_CTYPE (this is incompatible with other and historical od -c implementations).

Compare OpenBSD:
actually don't since it doesn't have locales. Some would say this also makes it conformant since it's always in LC_CTYPE=C.

Compare voreutils:

$ echo ąęć | od -c
od: invalid option -- 'c'
usage: od [-v] [-j skip] [-N max] [-w[idth]] [-A n|x|d|o] [-t type...]... [--endian={little|big}] [file]...

type: {a|c}                     [z[Z]]
      {x|u|d|o}[1|2|4|8|C|S|I|L][z[Z]]
      f        [4|8|16 |F|D|L]  [z[Z]]
$ echo ąęć | od -tc
0000000   ą  **   ę  **   ć  **  \n
0000007

There is a fundamental difference between -c and -tc: -c is, like many XSI-shaded segments, borderline-unchanged from the sysvr4 manual. The sysvr4 manual had an error and so POSIX has an error.

Actually, why am I telling you, here's what XPG3 says in 1998 (described as "identical to XPG2"):
image

Notice anything? Ah yeah, it's bytes.

In 1991 (final drafts, likely much earlier) POSIX 1003.2 defines -tc in its present-day form. Even then, the History of Decisions Made says
image
image

In 1994, XPG4 (SUSv1) merges into the POSIX usage the XPG3 usage and thus we get
image
image
how they invented this is anyone's guess (the CHANGE HISTORY is thus:
image
the editor probably thought about compatibility and saw the sysvr4 manual which says it operates on "characters" and didn't think to check whether it wasn't lying

just like they thought it was the same as -tc. for good measure since this line couldn't be any more fucked).

Note the EX(tension) shading (now we'd call it XSI, or, as 202x Draft 3 politely puts it,
image

so. y'know. it's an extension. considering your program is supposed to be portable to the BSD, you shouldn't be using these extensions, since the BSD is not compliant with the SVID^WX/Open Systems Interfaces).

Whereas -tc has been invented and defined in rigorous detail and this hasn't really changed.

On that note: how do you expect od -c to behave when it encounters multibyte characters? Because it doesn't actually say. If you take the modern POSIX description at face value, it may make Some sense in something like KOI-8. What would it do if the character is two bytes? Or six? It doesn't say. It's so deeply absent that one wonders if, perhaps, writing that it processes characters may be an error?

Oh and also, just to drive this again again again again:
image


Unclear to me where you're getting this. The USG quite clearly says that (a) -t c and -tc are both valid and mean the same thing and (b) you shouldn't make programs (like pr) that accept both -e and -eargument but not -e argument (what they call "optional option-arguments" –

The Utility Syntax Guidelines in Utility Syntax Guidelines require that the option be a separate argument from its option-argument and that option-arguments not be optional, but there are some exceptions in POSIX.1-2017 to ensure continued operation of historical applications

vs

Guideline 7:
Option-arguments should not be optional.

). This is why pr has an exemptions list for -e, -i, and -s.

@nicowilliams
Copy link
Contributor

nicowilliams commented Oct 4, 2023

We use od -c only in tests, and we don't depend on its output -- it's there just so that failures are easier to diagnose from the tests/test-suite.log. It suffices that od -c outputs something that would be meaningful to someone reviewing tests/test-suite.log.

Also, we don't use non-ASCII text in the cases where we do use od -c in tests/shtest.

The only objection I'd have is if od -tc is not universally implemented or not universally implemented as usefully as od -c.

Are there systems where od -c is not implemented or not implemented as usefully as od -tc?

EDIT: Oh, you're concerned that od -c can be removed at any time.

@nabijaczleweli
Copy link
Contributor Author

There are, to my knowledge, no extant od implementations that don't have -tc.

It's been the baseline standard for 31 years so this is unsurprising.

@nicowilliams nicowilliams merged commit 0bce9fb into jqlang:master Oct 4, 2023
28 checks passed
@nicowilliams
Copy link
Contributor

Thanks!

@emanuele6
Copy link
Member

@nicowilliams Now we have a commit that says "od -c => od -tc: od -c is an XSI extension equivalent to LC_CTYPE=C od -tc and not universally available" even though "od -c is equivalent to LC_CTYPE=C od -tc" is nonsense.


"Compare voreutils:"

You wrote that set of tools.


POSIX is wrong: no implementation of -c has ever interpreted the data as characters. Confer http://ro.ws.co.ls/od#Standards. Or prowl through historical implementations yourself! Your pick.

No. Again, you are misinterpreting the badly worded text of POSIX. In some systems, binary and text files are encoded differently, and in those systems, the getchar() library function and similar based may perform a conversion of the bytes read based on LC_CTYPE. Since od is a tool that deals with binary files and not text, it is specified.

POSIX literally says that "bytes shall be interpreted as characters specified by the current setting of the LC_CTYPE locale category" for -t a and -t c even though you insist it only does that for -c.

The type specifier character c specifies that bytes shall be interpreted as characters specified by the current setting of the LC_CTYPE locale category. Characters listed in the table in XBD File Format Notation ( '\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v' ) shall be written as the corresponding escape sequences, except that shall be written as a single and a NUL shall be written as '\0'. Other non-printable characters shall be written as one three-digit octal number for each byte in the character. Printable multi-byte characters shall be written in the area corresponding to the first byte of the character; the two-character sequence "**" shall be written in the area corresponding to each remaining byte in the character, as an indication that the character is continued. When either the -j skip or -N count option is specified along with the c type specifier, and this results in an attempt to start or finish in the middle of a multi-byte character, the result is implementation-defined.


POSIX is wrong: no implementation of -c has ever interpreted the data as characters. Confer http://ro.ws.co.ls/od#Standards. Or prowl through historical implementations yourself! Your pick.

Stop fooling. YOU WROTE THAT PAGE based on your incorrect interpretation of the POSIX text. You have nothing, but self-fabricated pages and utilities to suggest a deep difference in behaviour between -t c (that intereprets multibyte seqeuences based on the encoding specified by the locale and prints a codepoint) and -c (that reads bytes). That is not realitiy and it is also not what the POSIX text is trying to suggest.


Unclear to me where you're getting this. The USG quite clearly says that (a) -t c and -tc are both valid and mean the same thing and (b) you shouldn't make programs (like pr) that accept both -e and -eargument but not -e argument (what they call "optional option-arguments" –

YOUR documentation for od says

-cxdsbo and [+]skip[{.|b|B}…] are shaded XSI and are provided for compatibility with legacy applications only. Don't use them, guard against them by always specifying a flag (-A is well-suited). The POSIX requirements contradict Version 7 AT&T UNIX (imply only [+]skip[.][b] and don't mention ‘.’ implying -Ad). -f is provided for compatibility with AT&T System V Release 4 UNIX and 4.2BSD. -DFOSX aren't.

Likewise, POSIX says that -o arg, -t c, etc are only supported for historical applications even though support is required. As I said.


Again, I was aware that -c is only required by POSIX for XSI-conformance, but all the other reasoning you have given for the change, and you insistence on the fact that -c and -t c are fundamentally different is just nonsense.

@nicowilliams
Copy link
Contributor

I've reverted the commit.

emanuele6 added a commit that referenced this pull request Oct 5, 2023
This reverts commit 0e70f7a.

There is no reason to revert this change.

In #2922, I only disagreed with the commit message suggesting that
  LC_CTYPE=C od -t c    is   equivalent to   od -c

The only documented differences are that -tc is required to be
influenced by -N and -j, while -c is not, and that -c is required to
only support a subset of the backslash sequences that -tc should
support.
@emanuele6
Copy link
Member

emanuele6 commented Oct 5, 2023

There was no reason to revert this. Removing the use of the XSI extension so tests can run correctly with their non-XSI conformant od is absolutely fair.

I was only disagreeing with the claim that od -tc vs od -c have any difference other than the set of backslash sequences they are required to support.

I re-reverted the commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants