Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: NUL-delimited output #1271

Closed
charles-dyfis-net opened this issue Nov 6, 2016 · 44 comments · Fixed by #1990
Closed

Feature request: NUL-delimited output #1271

charles-dyfis-net opened this issue Nov 6, 2016 · 44 comments · Fixed by #1990

Comments

@charles-dyfis-net
Copy link

Right now, the standard-practice way to read an array from jq into a shell-script is to use raw output and parse on newlines.

However, JSON strings can contain literal newlines; this makes such parsing error-prone.

NUL-delimited output, allowing IFS= read -r -d '' string to read exactly one C string unambiguously, would resolve this.

@eric-brechemier
Copy link
Contributor

@charles-dyfis-net is it not simpler in this case to keep newline escaping, instead of using raw output? This allows to keep a single item per line, which is easier to loop over in a shell script:

input.json

[
  "LF\nLF",
  "TAB\tTAB",
  "FF\fFF"
]

Filter

.[]

Command Line

$ jq '.[]' input.json

Output

"LF\nLF"
"TAB\tTAB"
"FF\fFF"

Otherwise, you can actually add a character of your choice at the end of each line, directly from your jq filter:

Filter + NUL

.[]
| ( . + "\u0000")

Command Line + NUL

$ jq '.[] | ( . + "\u0000")' input.json

Output + NUL

"LF\nLF\u0000"
"TAB\tTAB\u0000"
"FF\fFF\u0000"

Command Line + NUL as Raw (View as Hex)

$ jq -r '.[] | ( . + "\u0000")' input.json | xxd

Output + NUL as Raw (Viewed as Hex)

0000000: 4c46 0a4c 4600 0a54 4142 0954 4142 000a  LF.LF..TAB.TAB..
0000010: 4646 0c46 4600 0a                        FF.FF..

@charles-dyfis-net
Copy link
Author

Thank you -- I actually have a few StackOverflow answers I'm going to want to amend in light of the patterns suggested in this ticket.

That said, this still would be a desirable feature to have.

Newline escaping requires the consumer's code to perform unescaping -- while printf '%b' is POSIX-defined, it's hardly common idiom, and without extensions such as bash's printf -v, command substitutions used to invoke it are themselves side-effecting, strippping trailing newlines. Moreover, lack of such unescaping is only visible/obvious in the error case, whereas reading a NUL-delimited stream as a line-delimited stream or the inverse is an easily-detected corner case. Moreover, whereas common tools (xargs -0, sort -z, etc) can deal with NUL-delimited streams, very few correctly grok "newline-delimited-text, but with the specific correct set of escape sequences".

The patterns given here are helpful: though \x00\x0a is a bit harder to process on the consumer side than just \x00 (for purposes of xargs -0 &c), it's certainly better than where we were without them.

@thedward
Copy link

@charles-dyfis-net

If you use -j instead of -r then it won't output the newline (\u00a0) characters.

@wtlangford
Copy link
Contributor

JSON (at least RFC 7159 JSON) does not permit unescaped ASCII control
characters (U+0000 ~ U+001F), which contains the newline/linefeed
character. jq neither accepts nor outputs JSON strings containing newlines.

I'm not sure how you've come across this as an issue. Can you show me a
use case for this?

On Fri, Nov 11, 2016 at 10:17 AM Thedward Blevins notifications@github.com
wrote:

@charles-dyfis-net https://github.com/charles-dyfis-net

If you use -j instead of -r then it won't output the newline (\u00a0)
characters.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1271 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADQ4V_lnpBbkMDRAfZsyxTcTRwM0e776ks5q9IclgaJpZM4Kqnc8
.

@charles-dyfis-net
Copy link
Author

charles-dyfis-net commented Nov 12, 2016

@wtlangford, gladly.

Consider the following contrived example:

#!/usr/bin/env bash
input_json='[{"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r item; do
  printf 'Shell script interpreted item as: %q\n' "$item"
  printf '...as a literal: <<<%s>>>\n' "$item"
done < <(jq -r '.[] | .value' <<<"$input_json")

...where the intended output is (something equivalent to -- not all ksh-derivative shells implement printf %q in exactly the same way):

Shell script interpreted item as: $'I am\na multiline\nvalue\twith a tab'
...as a literal: <<<I am
a multiline
value   with a tab>>>
Shell script interpreted item as: I\ am\ a\ second\ value
...as a literal: <<<I am a second value>>>

Instead, as given above, the actual output is:

Shell script interpreted item as: I\ am
...as a literal: <<<I am>>>
Shell script interpreted item as: a\ multiline
...as a literal: <<<a multiline>>>
Shell script interpreted item as: $'value\twith a tab'
...as a literal: <<<value       with a tab>>>
Shell script interpreted item as: I\ am\ a\ second\ value
...as a literal: <<<I am a second value>>>

Now, to fix this, we can use NUL delimiters. That would modify our expression to be something like the following:

#!/usr/bin/env bash
input_json='[{"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r -d '' item; do
  printf 'Shell script interpreted item as: %q\n' "$item"
  printf '...as a literal: <<<%s>>>\n' "$item"
done < <(jq -j '.[] | .value | (. + "\u0000")' <<<"$input_json")

...and it does in fact work exactly as desired. The only problem is that it requires the user to use some idioms that aren't completely obvious unless they read this ticket. :)

@wtlangford
Copy link
Contributor

Ah. I see, you're using the raw output mode. It does, as you've found,
output unescaped newline characters, as it outputs the value of the json
strings and not the strings themselves. :)

I see your use case now. I'm not strictly averse to adding a new flag, but
at the same time, we try not to add new flags to the binary. I'd
definitely like to see some form of this added to the wiki, though.

On Fri, Nov 11, 2016, 22:58 Charles Duffy notifications@github.com wrote:

@wtlangford https://github.com/wtlangford, gladly.

Consider the following contrived example:

input_json='[{"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r item; do
printf 'Shell script interpreted item as: %q\n' "$item"
printf '...as a literal: <<<%s>>>\n' "$item"
done < <(jq -r '.[] | .value' <<<"$input_json")

...where the intended output is:

Shell script interpreted item as: $'I am\n a multiline\nvalue\twith a tab'
...as a literal: <<>>
Shell script interpreted item as: 'I am a second value'
...as a literal: <<>>

Instead, as given above, the actual output is:

Shell script interpreted item as: I\ am
...as a literal: <<>>
Shell script interpreted item as: a\ multiline
...as a literal: <<>>
Shell script interpreted item as: $'value\twith a tab'
...as a literal: <<>>
Shell script interpreted item as: I\ am\ a\ second\ value
...as a literal: <<>>


Now, to fix this, we can use NUL delimiters. That would modify our
expression to be something like the following:

input_json='[{"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r -d '' item; do
printf 'Shell script interpreted item as: %q\n' "$item"
printf '...as a literal: <<<%s>>>\n' "$item"
done < <(jq -j '.[] | .value | (. + "\u0000")' <<<"$input_json")

...and it does in fact work exactly as desired. The only problem is that
it requires the user to use some idioms that aren't completely obvious
unless they read this jq ticket. :)


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#1271 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADQ4VwESUHerOjuEUQpxjswLC1LMwSqfks5q9TlkgaJpZM4Kqnc8
.

@eric-brechemier
Copy link
Contributor

@charles-dyfis-net you could also keep the list of values encoded as JSON, then use jq again within the loop to decode each JSON value into a raw string:

#!/bin/sh
{
  jq '.[] | .value' << INPUT_JSON
[
  {"value": "I am\na multiline\nvalue\twith a tab"},
  {"value": "I am a second value"}
]
INPUT_JSON
} | {
  while read -r jsonString
  do
    printf 'JSON Value: <<<%s>>>\n' "$jsonString"
    printf 'Text Value: <<<%s>>>\n' "$( jq -r -n "$jsonString")"
  done
}
JSON Value: <<<"I am\na multiline\nvalue\twith a tab">>>
Text Value: <<<I am
a multiline
value   with a tab>>>
JSON Value: <<<"I am a second value">>>
Text Value: <<<I am a second value>>>

The conversion from JSON to text is done in jq -r -n "$jsonString": the JSON string is provided as a filter, using -n flag, which prints itself as a raw string, using -r flag.

@charles-dyfis-net
Copy link
Author

@eric-brechemier, noted, though that's considerably less efficient than a single jq run.

I think I'm entirely happy with @wtlangford's suggestion of treating this as a doc enhancement rather than a software enhancement -- now it's just a question of whether and when I have the time to assign this to myself and generate a wiki edit incorporating the many suggestions given here. :)

@eric-brechemier
Copy link
Contributor

@wtlangford without adding a new flag, you could repurpose the -j flag to accept an optional argument:

-j # join with empty character
--join-output='\u0000' # join with NUL

@pkoppstein
Copy link
Contributor

pkoppstein commented Nov 17, 2016

It seems to me that the matter of enhancing jq to support "joining with NUL" is of rather low priority, and certainly much lower than several other issues (notably the release of jq 1.6).

In any case, I suspect that most users who actually have the need to join with NUL can simply use the idiom:

   jq -c ..... | tr '\n' '\0'

That is, I suspect that most such users are working in an environment that has tr.

If using tr is not an option, then chances are that using the -c option in some other way, perhaps in conjunction with jq's support for @TSV and/or "\u0000", will suffice to solve the problem at hand.

Rather than expending the very limited resources available on supporting NUL-as-delimiter, I believe it would be far better to enhance support for the application/json-seq MIME type. Specifically, it should be easy to use jq to accept a JSON stream as input but produce json-seq as output (and vice versa), but currently the --seq option does not provide the flexibility to make this convenient.

(Note: To convert a stream of JSON texts to json-seq, one could use the form: jq -n --seq --slurpfile in <(STREAM) '$in[]' )

@charles-dyfis-net
Copy link
Author

@pkoppstein, tr does not address the use case given in the sample code above, wherein there is a need to distinguish between literal newlines and delimiter newlines, and conflating the two (as by converting all newlines to delimiters) will cause the very ambiguity this feature (by selecting a delimiter not allowed in JSON strings even in escaped form) is intended to address.

@pkoppstein
Copy link
Contributor

@charles-dyfis-net - My point is that one can use jq -c (without the -r option) to insert the NULs, and then later on in the processing convert to "raw output" if that is really needed.

@charles-dyfis-net
Copy link
Author

charles-dyfis-net commented Nov 17, 2016

@pkoppstein, ...so what you have then is essentially the same proposal offered by @eric-brechemier of using multiple passes, with the same performance overhead -- which is to say, the need to invoke a separate instance of jq for each item of output to be processed to convert into ultimate raw form.

@pkoppstein
Copy link
Contributor

@charles-dyfis-net - My comments were mainly directed to the question of whether joining with NUL is really needed, not to the example which you yourself described as contrived.

For non-contrived problems, I suspect your concerns about efficiency are probably misplaced. Consider, for example, pipelines of the form:

while read -r line ; do MUNGE << "$line" | jq WHATEVER ; done < <(jq -c HEAVYLIFTING)

In realistic scenarios, the additional cost associated with the inner invocations of jq will almost certainly be relatively small, perhaps even to the point of insignificance if reasonable care is taken with the details.

The real issue here is probably #147

@eric-brechemier
Copy link
Contributor

Rather than expending the very limited resources available on supporting NUL-as-delimiter, I believe it would be far better to enhance support for the application/json-seq MIME type. Specifically, it should be easy to use jq to accept a JSON stream as input but produce json-seq as output (and vice versa), but currently the --seq option does not provide the flexibility to make this convenient.

@pkoppstein are you referring to this?

@pkoppstein
Copy link
Contributor

@eric-brechemier - That does seem to be related.

@nicowilliams
Copy link
Contributor

So, yeah, a -0 would actually be nice.

@pvdb
Copy link

pvdb commented Feb 12, 2018

So, yeah, a -0 would actually be nice.

Yes please... pretty, pretty please!

@pabs3
Copy link
Contributor

pabs3 commented Oct 14, 2019

I was thinking of working on this (it looks pretty simple), which option do people want?

  • -0 / --nul-output
  • -j/--join-output '\u0000'

Personally I think I would prefer the first one.

@pabs3
Copy link
Contributor

pabs3 commented Oct 14, 2019

I ended up implementing the first option, but I'll be happy to change the PR to the other option if people prefer that.

@eric-brechemier
Copy link
Contributor

Thanks! I suggested the second option to address the reluctance to introduce a new flag.
But using -0 directly would make the usage simpler.

@andrii-pukhalevych
Copy link

When we can expect release with this --nul-output support?

@pcworld
Copy link

pcworld commented Sep 16, 2021

Note that JSON strings can also contain null bytes ("\u0000"), which could break the --nul-output feature, which in some cases might be a security issue. I'm not sure if there's a way to split values properly that would be supported by POSIX shells.

@pabs3
Copy link
Contributor

pabs3 commented Sep 18, 2021 via email

@pabs3
Copy link
Contributor

pabs3 commented Sep 18, 2021 via email

pabs3 added a commit to pabs3/jq that referenced this issue Sep 18, 2021
pabs3 added a commit to pabs3/jq that referenced this issue Sep 21, 2021
@vdukhovni
Copy link

vdukhovni commented Oct 17, 2021

There is no, and cannot be a solution or work-around to failing to properly encode data to the syntax of the consuming application. The only reason that -0 works with xargs et. al. is that filenames found in the filesystem are NUL-terminated and can't contain ASCII NUL characters.

The -r option works correctly for output of properly encoded strings in some non-JSON format. These could be just raw lines if the output is plain text, but if it is expected to have some structure (consist of elements that are not necessarily "lines") then the jq program needs to generate that structure. This is fundamental in all applications that serialise and deserialise data. So no new features to try to paper over the problem are warranted or desirable.

As to documentation, the news won't reach the audience that most needs it, they'll just cargo-cult some naïve code and suffer the consequences.

If some guidance to the perplexed is to be delivered, it should be quite clear, that this is fundamentally a correctness issue that is germane to all programming languages and essentially all data formats. Yes, there can be security consequences to getting this wrong, but even absent a security issue, the result is liable to be wrong in various corner or even common cases.

In terms of working with shell commands, the jq interpreter has an @sh serialiser that robustly quotes strings as potential literal arguments for shell commands:

$ jq -nr '["echo","foo\nbar\nbaz", "$HOME"]| @sh'
'echo' 'foo
bar
baz' '$HOME'

and thus assuming the arguments are validated as part of building the shell command, one can be sure that the command is executed as intended, without deserialisation errors:

$ jq -nr '["echo","foo\nbar\nbaz", "$HOME"]| @sh' | sh -
foo
bar
baz $HOME

If the output is an SQL query, then the serialisation needs to be escaped correctly for the intended SQL dialect (perhaps not a job for JQ, and so one might pass JSON into some other tool that has an SQL API and can quote SQL data).

So while I am not ultimately opposed to some mention of the issues in the docs, I don't think the currently pending PR is the right way to handle this.

@pabs3
Copy link
Contributor

pabs3 commented Oct 17, 2021 via email

@vdukhovni
Copy link

I wonder if the -0 option should just get removed.

That would be my recommendation. IIRC it has not been released yet, and if so, it should not be released.

Probably also the -r option should be deprecated or removed too, in favour of external programs checking and transforming the JSON output of jq into the needed formats.

No, sorry, that would be completely unacceptable. It makes @csv and @sh for example, completely useless, and also various contexts where unstructured text output is fine, or the user's JQ program constructed a robust serialisation.

Just because some users are sloppy CANNOT mean that jq is then made unusable for everyone else. The cargo cultists can shoot themselves in the foot in any language, and jq is by far one of the safer choices.

They can also print raw strings in Python, Perl, ... and I don't see any warnings in those languages about the dangers of text output.

@pabs3
Copy link
Contributor

pabs3 commented Oct 17, 2021 via email

@vaab
Copy link

vaab commented Feb 8, 2023

@vdukhovni The @sh still require the equivalent to an evaluation in bash which is costly (and could be risky, and need to be treated with care). Second, neither YAML or JSON, nor a lot of other data contains the NUL char (or is expected to contain it). Most shell code will gain substantial time in most cases if jq would directly output raw strings (the -r case), and can separate values with NUL char with a -0 : the shell code wouldn't need to exist, you could pipe yq directly to other processes. If the output of a jq query happens to contains itself a 'rogue' NUL char, I would expect -0 to bail out with an explicit error, so that the calling code can warn the user that it's expectation were broken, as any normal syntax error.

You are suggesting that any code needs some formatting where actually here it is not the case: shell variables can hold any binary data that is not containing NUL char, and pipes handle any binary data. As long as a program can ensure that it's properly using NUL and ensuring what it separates doesn't have any NUL char inside, you'll be magnitude faster and safer than going through converters and formatters and re-interpretation of data.

As I see it, jq is for the command line, not the shell, but for processes. The command line is system programming : it is all about binary data, and NUL separated values. It is crude, but efficient. You are directly talking to xargs, find, grep, git etc...

For these reason, and for what it is worth, I'm not in favor or removing -0 option. But clearly in favor of bailing out with an error when using this option and one of the separated value contains a NUL char.

@vdukhovni
Copy link

What is the compelling use-case for extracting a stream of multi-line strings from a JSON document to feed into a program that supports NUL-separated inputs?

For xargs, cpio, ... the compelling use-case is that they can consume the output of find ... -print0. Where does jq enter into this picture.

I don't want to give users a false sense of security. Any "raw" output form (be it -r or the proposed -0) carries risks of various injection-style attacks, and the user should not assume safety.

The suggestion to fail if an item for raw output already contains a NUL does provide some safety, at the cost of throwing errors that should have been handled in some manner before attempting to serialise the data in question as a NUL-separated (terminated) stream.

If that's to be done, then one might argue that the same should be available (another option?) with newline-separated output, but even protecting against separator injection is not generally sufficient, sometimes injection of unexpected spaces or unexpected ../ path components, ... are also problematic.

So if such a feature is to be provided, it should be more general:

$ jq --raw-terminator <codepoint> ...

Would support -0 as well as the current -r but with guaranteed absence of newlines in each output item.
In such a case, it should also be configurable whether to skip the problem item or terminate.

All that said, I am not convinced there are compelling practical and then sufficiently safe use-cases for this sort of feature.

@pabs3
Copy link
Contributor

pabs3 commented Jul 10, 2023

The idea is that you have some JSON data and want to safely pass parts of it to other programs via either stdin or command-line arguments. So you process the data with jq, output the data with a safe separator (usually NUL) and use xargs to convert stdin to command-line arguments. For extra safety you pass an option processing terminator before the arguments.

curl https://example.com/foo.json | jq -0 '[].foo' | xargs -0 foo -- | ...
curl https://example.com/foo.json | jq -0 '[].foo' | sort -z | ...

Agreed that injection attacks are always possible. The existing documentation for -r and -j should mention this problem. Protecting against them can't be the sole responsibility of jq though, since you never know what people are passing the output of it to and how badly they are handling it. Even if jq only passes JSON along instead of raw data, subsequent commands could mishandle that too. The documentation probably should have a section on jq and safety listing all the possible attacks.

The handling of the failure when encountering output separators in the output data could be done by jq withholding all output until all of the input is processed. Or you could leave it to subsequent commands to handle the error exit code (likely via shell set -o pipefail) and partial output.

Without having the -0 feature, people are going to continue to do the jq -j '.foo + "\u0000" workaround when they want some semblance of safety, but still be subject to injection attacks. Of course, it is more likely they will just use -r, not think about newlines in the input and still be subject to the same attacks. It is unlikely they would bother to do the alternative of writing a script/program to process the JSON output of jq, if they were going to do that they would never have used jq in the first place.

You can see here the original context where I personally wanted to use -0, getting some date/ersion data from an API, safely comparing versions using dpkg, saving the data to files and doing git bisect on the results. Probably more of that could be done within jq that what I wrote, but it wouldn't be possible to do the dpkg version comparison without having the data leave json mode and go into raw mode then into dpkg commands. Other folks probably have other use-cases.

@nicowilliams
Copy link
Contributor

The idea is that you have some JSON data and want to safely pass parts of it to other programs via either stdin or command-line arguments. So you process the data with jq, output the data with a safe separator (usually NUL) and use xargs to convert stdin to command-line arguments. For extra safety you pass an option processing terminator before the arguments.

curl https://example.com/foo.json | jq -0 '[].foo' | xargs -0 foo -- | ...
curl https://example.com/foo.json | jq -0 '[].foo' | sort -z | ...

Ah, but if you just use jq -c then you don't need -0 because jq -c will not output newlines within the JSON text, only after each JSON text, therefore it is safe to run curl https://example.com/foo.json | jq -c '[].foo' | xargs foo -- | ....

@vdukhovni
Copy link

vdukhovni commented Jul 10, 2023

Thanks for the examples. FWIW, instead of attempting to carefully serialise whatever happened to come in, I'd have restricted the values to a known safe subset

$ curl -s https://snapshot.debian.org/mr/binary/perl/ |
  jq -r '.result[].binary_version | select(test("^[-.+:~0-9a-zA-Z]+$"))'

This is then safe to newline separate, and easier to work with. And I'd probably also take care with positional arguments that might look like short or long options, thus make sure to include a -- at the appropriate point in constructed command-lines:

sh -c '
    dpkg --compare-versions -- "$1" ge 5.24.1-3 &&
    dpkg --compare-versions -- "$1" le 5.28.1-6 &&
    printf "%s\0" "$1"'

Finally, it is still not clear to me whether the correct thing to do with unexpected values is to abort, or to just skip that value.
Safe serialisation of untrusted data sadly requires attention to detail. There's no silver bullet.
So even if there's sufficient user-community support for -0, it would have to come with sufficient disclaimers to not lead to a false sense of security. I'd still recommend being explicit about validation, and document some examples (perhaps in a project wiki linked from the manpage, if too intrusive in the main reference document).

@pabs3
Copy link
Contributor

pabs3 commented Jul 10, 2023

The jq -c option isn't useful here because the commands being passed data don't support JSON and -c outputs JSON.

I've updated the wiki page to include your dpkg -- suggestion, thanks. I'm not sure how I feel about the select suggestion.

There could be an option for choosing to skip or abort on separator bytes in the output items, that could be made mandatory for -0/-r/-j to ensure that people think about injection possibilities and corresponding error handling.

The manual page is reasonably long as-is, so it feels OK to add a new section about safety in general, then the -0/-r/-j documentation could refer to the subsection of that about separator injection. The wiki page idea sounds good for extra examples too.

@nicowilliams
Copy link
Contributor

The jq -c option isn't useful here because the commands being passed data don't support JSON and -c outputs JSON.

Ah, then do this:

a) use jq -j,
b) in your jq program check whether inputs have embedded delimiters and reject or map those,
c) output whatever outputs and a delimiter.

This is much more general than --nul-output. It does put the onus on you to make sure that your jq program does the right thing, and I think that's quite fair.

@pabs3
Copy link
Contributor

pabs3 commented Jul 10, 2023

That is a lot more complicated for folks who know shell much better than jq.
A command-line option would for -0 would make it much easier for them.
A subset of them will just YOLO it and use -r and skip b and c anyway.
The proposed mandatory skip/abort option would make the -r folks safer.

@nicowilliams
Copy link
Contributor

That is a lot more complicated for folks who know shell much better than jq. A command-line option would for -0 would make it much easier for them. A subset of them will just YOLO it and use -r and skip b and c anyway. The proposed mandatory skip/abort option would make the -r folks safer.

I'm thinking we'll make -0 mean NUL-delimited input, and we might keep (but deprecate?) --nul-output.

@pabs3
Copy link
Contributor

pabs3 commented Jul 10, 2023

Hmm, I thought jq always required JSON input, not randomly formatted input. Changing the meaning of an option is a major backwards compatibility issue too, so please don't do that.

PS: my request to update the documentation to mention injection issues was already rejected in #2350, I can resubmit that if it is wanted.

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 10, 2023

Hmm, I thought jq always required JSON input, not randomly formatted input.

There's -R which means "raw input".

Changing the meaning of an option is a major backwards compatibility issue too, so please don't do that.

-0 hasn't shipped in any version of jq.

PS: my request to update the documentation to mention injection issues was already rejected in #2350, I can resubmit that if it is wanted.

There's no need to re-submit it. I'll review #2350.

@nicowilliams
Copy link
Contributor

See also #2659.

@svdb0
Copy link

svdb0 commented Jul 10, 2023

Repeating what I said in #2659, if --nul-output is retained, I suggest renaming it --raw-output0, for the following reasons:

  • It is intuitive; the added 0 suggests 'as --raw-output but with null bytes'.
  • Some standard tools use a similar naming: find -print0, rsync --from0, xz --files0, du --files0-from, wc --files0-from.
  • --nul-output suggests a symmetry with --null-input (-n), but they are completely different.
  • If Feature request: support for streaming input delimited by null characters #2659 is accepted, you'll have matching --raw-input0 and --raw-output0.

There is another option though: you could have a format filter @null similar to @json, @sh, @csv, etc.
It is after all just another way to format your output.
Or (and?), more general: @delimited("\u0000").

I also like the idea suggested here, of having jq raise an error when the value to be output contains the terminator character.
In particular when it is treated as just another output format filter (@null/@delimited("\u0000")), because in this case it is just another instance of 'the value cannot be encoded in the output format', which could happen for other formats too (depending on the format, and even more so if jq were ever to support arbitrary byte sequences).
Because almost always, when you use some character(s) as a delimiter, you do not intend them to occur in the fields themselves.
And if you really do mean to do that, there's still (. + "\u0000"), but then it's a concious choice. Better to have the default be the more secure option.

I agree with @vdukhovni's comments in #1271 that if you're about to output a value containing your seperator, something else is probably wrong (e.g. proper input validation).
But the fact is that people will make mistakes, and many may not even be aware of the issues, and for such cases, raising an error will add a welcome extra layer of protection.

@vdukhovni
Copy link

I have no objections to @nul as an output format, to be used in combination with -j. The -0 option can then, as Nico suggested, be used more naturally as a parallel to -R on input, to read raw nul-delimited strings.

FWIW, I use NUL for the ASCII code point and NULL for the pointer, but if that's considered obscure/esoteric by others, I can live with @null (which to me also suggests the JSON null, which is unrelated).

Finally, I am not sure whether this should throw an error, or just drop non-conforming inputs. I'd be inclined to silently drop them, and if someone wants errors, they can arrange for that with explicit checks, or we could have two versions:

- @nul
- @enul 

With @enul throwing an error. Unlike command line flags, adding new conversion forms seems cleaner to me.
We could even add @nl and @enl. And then have safer (but usual disclaimers about residual syntax issues apply) new-line separated output (again via -j).

@svdb0
Copy link

svdb0 commented Jul 12, 2023

In #2660 we were discussing methods for handling errors — in that case in the input.
I think some of the same considerations hold here.
In particular, there are more options for what to do when an error is encountered, than silently drop and raise an error. See #2660 (comment) .

I would personally prefer a more general way to specify what to do with encoding errors, rather than having multiple versions of each relevant output format.

I'm also not in favour of silently dropping non-conforming characters as a default, as I'm against possibly surprising behaviour.
I hold the opinion that the default should be secure, and lowering security should be a conscious decision.

Perhaps this could be the way to override the default error handling behaviour, inspired by #2660 (comment) :

@null({unencodable:"skip"})

Regarding 'NUL'/'null':
I want to avoid making this a bikeshedding exercise, and I don't have a strong opinion on this, but for the consideration of the reader, some collected information:

The ASCII standard (also RFC 20) uses 'NUL' as an acronym for what it calls the 'null character'.

ISO/IEC 6429:1992 does the same thing, and so do UNICODE, and POSIX, referring to the ISO standard.
UNICODE also has U+2400 as the symbol for NULL, represented graphically as 'NUL' diagonally (␀).

The C standard does not mention 'NUL' at all, and only talks about the 'null character'.

ECMAScript mentions the code unit 0x0000 (NULL), U+0000 (NULL) but also \0 represents the <NUL> character.
The JSON standard does not refer to the character/byte/code point at all.

And for what it's worth, Wikipedia currently calls it the 'null character', 'often abbreviated as NUL (or NULL, though in some contexts that term is used for the null pointer)'.

So I'd say 'null' is the name of the character/byte/code point, but 'NUL' is a common abbreviation, which has the advantage of being unambiguous.
Which is better for the purpose of specifying an output format? I don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.