Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@b_str removes backslashes twice #39092

Open
mgkuhn opened this issue Jan 4, 2021 · 14 comments
Open

@b_str removes backslashes twice #39092

mgkuhn opened this issue Jan 4, 2021 · 14 comments
Labels
breaking This change will break code macros @macros strings "Strings!"
Milestone

Comments

@mgkuhn
Copy link
Contributor

mgkuhn commented Jan 4, 2021

The byte-array literals syntax

julia> @show b"hi\n";
b"hi\n" = UInt8[0x68, 0x69, 0x0a]

is currently implemented as

"""
    @b_str

Create an immutable byte (`UInt8`) vector using string syntax.
"""
macro b_str(s)
    v = codeunits(unescape_string(s))
    QuoteNode(v)
end

This implementation hides a rather counter-intuitive and undocumented property: in certain situations, the unescaping procedure to remove backslashes is applied twice. As a result, a user needs to use no less than five (5) backslashes to obtain the byte sequence of the ASCII string \":

julia> @show b"\\\\\"";
b"\\\\\"" = UInt8[0x5c, 0x22]

Julia's raw strings use the following escaping rule:

  • if a " is preceded by 2n+1 backslashes, these are replaced by n backslashes, and the " is passed through literally
  • if a " is preceded by 2n backslashes, these are replaced by n backslashes, and the " acts as the string terminator

(This is also the escaping mechanism that the Microsoft C runtime library uses when parsing quoted strings from the Windows command line into argv.)

This removal of backslashes before " occurs not only in raw strings, but in all non-standard string literals, which are just macros ending in _str. This can be seen from the trivial implementation of the macro behind raw string literals, which is just the identity function:

macro raw_str(s); s; end

Therefore, when b"\\\\\"" is processed, backslashes are removed in the following two steps:

  1. The raw-string parser replaces 5 = 2×2+1 backslashes in front of the " with 2 backslashes
  2. The call to the unescape_string() function by macro @b_str() replaces the remaining \\ with \.

This duplicate backspace reduction is entirely unnecessary in non-standard string literals where the corresponding macro calls unescape_string(), because that function does already perform the same \\\ and \"" mapping that is behind the 2n+1 rule of the raw-string processing. This redundant, duplicate processing is also likely to surprise users, especially since the documentation does not warn about this at all. It certainly surprised me!

There is a simple workaround in the case of @b_str(), namely to undo the backslash removal performed by the raw-string processing, using Base.escape_raw_string:

import Base.@b_str
macro b_str(s)
    v = codeunits(unescape_string(Base.escape_raw_string(s)))
    QuoteNode(v)
end

Now we get

julia> @show b"\\\"";
b"\\\"" = UInt8[0x5c, 0x22]

julia> @show b"\\\\\"";
b"\\\\\"" = UInt8[0x5c, 0x5c, 0x22]

which seems much more intuitive and unsurprising.

But @b_str() may be just one example of a type of non-standard string literal that further processes the string received with unescape_string(), or with any other function that uses backslashes as escape symbols, and therefore performs the same \\\ and \"" mapping. If this is indeed the case, then perhaps the compiler mechanics behind non-standard string literals should not remove any backslashes at all, and leave this to the author of the macro? The 2n+1 vs 2n rule would then merely be used to identify the terminating quotation mark, but all characters before that would be passed through to the macro untouched.

@mgkuhn mgkuhn added macros @macros strings "Strings!" labels Jan 4, 2021
@mgkuhn mgkuhn changed the title @b_str removes backslahes twice @b_str removes backslashes twice Jan 5, 2021
@vtjnash
Copy link
Member

vtjnash commented Jan 5, 2021

this in intentional, and, I believe, documented

@vtjnash vtjnash added the breaking This change will break code label Jan 5, 2021
@clarkevans
Copy link
Member

clarkevans commented Jan 5, 2021

@vtjnash The help for @b_str makes no reference to the semantics of @raw_str nor does it have examples that demonstrate these edge cases. Perhaps the only way to address this is to better document the unexpected behavior. Generally, ensure that all string macros, such as regex, provide documentation of these edge cases?

@clarkevans
Copy link
Member

@mgkuhn You are attempting to make this invariant hold?
@b_str("…") == b"…" for all "…"

@heetbeet
Copy link

heetbeet commented Jan 5, 2021

Should the invariant also hold that
@b_str("""…""") == b"""…""" for all """…"""

Because I don't think you will be able to achieve both. I can't find a good example to explain my suspicion though.

@mgkuhn
Copy link
Contributor Author

mgkuhn commented Jan 5, 2021

@clarkevans @heetbeet No, both your suggested invariants are neither reasonable goals nor achieveable: "..." interprets backslashes and so does @b_str, so concatenating both in @b_str("...") will still interpret backslashes twice:

julia> @show @b_str("\x5c\x6e");
b"\n" = UInt8[0x0a]

julia> @show b"\x5c\x6e";
b"\x5c\x6e" = UInt8[0x5c, 0x6e]

(Same with triple quotes, which make no difference here.)

@heetbeet
Copy link

heetbeet commented Jan 6, 2021

Okay I see I made a mistake in my code, and I expect the same happened to @clarkevans. Let's try again.

Should the invariant hold that

codeunits("…") == b"…" for all "…"
codeunits("""…""") == b"""…""" for all """…"""

@heetbeet
Copy link

heetbeet commented Jan 6, 2021

In my initial post I expected that this cannot hold for " and """ syntax simultaneously. But after consideration I changed my position. I couldn't find any counter example to support my claim. It seems both uses the same escape semantics except that the """ allows for un-escaped ", but since it still supports escaping \" -> " I think a sort of mapping can be build to support both, since any received raw " can be made escape proof by adding \". I'll try to add code examples.

@heetbeet
Copy link

heetbeet commented Jan 6, 2021

@mgkuhn seems like your revised code has exactly this property for the example I tried:

Before fixing b_str

b"""" """ == codeunits("""" """) #true
b"""\" \\""" == codeunits("""\" \\""") #true
b"\" \\" == codeunits("\" \\") # true
b"\\\\\\" == codeunits("\\\\\\") #false
b"""\\\" \\""" == codeunits("""\\\" \\""") #false
b"""\\\\" \\""" == codeunits("""\\\\" \\""") #false

After fixing b_str

import Base.@b_str

macro b_str(s)
   v = codeunits(unescape_string(Base.escape_raw_string(s)))
   QuoteNode(v)
end

b"""" """ == codeunits("""" """) #true
b"""\" \\""" == codeunits("""\" \\""") #true
b"\" \\" == codeunits("\" \\") # true
b"\\\\\\" == codeunits("\\\\\\") #true
b"""\\\" \\""" == codeunits("""\\\" \\""") #true
b"""\\\\" \\""" == codeunits("""\\\\" \\""") #true

@mgkuhn
Copy link
Contributor Author

mgkuhn commented Jan 6, 2021

@heetbeet None of your invariants can be true unless you exclude $, because "..." can also interpolate variable expressions (i.e., $ is a meta-character that splits what looks like a string literal into an array of values and wraps that with function calls that iterate over that array and join it with Base.print_to_string to a dynamically allocated string at runtime), whereas special-string literals do not interpolate (because they always are raw strings), and therefore can be processed by macros as compile-time string literals:

julia> a=1;

julia> @show b"$a";
b"$a" = UInt8[0x24, 0x61]

julia> @show codeunits("$a");
codeunits("$(a)") = UInt8[0x31]

Same for """, which again makes no difference here.

(I see how the discussion here evolves once more as evidence for widespread misunderstandings of how Julia's many different string literals work and relate to each other.)

@heetbeet
Copy link

heetbeet commented Jan 6, 2021

I see what you mean, I forgot about those.

@mgkuhn
Copy link
Contributor Author

mgkuhn commented Jan 6, 2021

@vtjnash What was the design rationale for the current behaviour?

Wouldn't it be cleaner to separate for special strings the following two operations:

  1. decide where the end delimiter is (done by the parser, using the 2n(+1) backslashes rule), and
  2. the interpretation and substitution of backslashes as escape characters (done by the special-string literal macro)

?

This separation could be introduced in a non-breaking way by offering a new, alternative interface for special-string literal macros, such that existing string literal macros continue to receive what they get at present (i.e., some backslashes removed).

@c42f
Copy link
Member

c42f commented Sep 7, 2022

Perhaps a sensible and generic fix for these kind of woes is to allow more flexibility in the string delimiters for custom string macros? (Also related #41041)

Then individual string macros wouldn't need weird heuristics to avoid double escaping - the generic answer if the user is having escaping issues would be to use another set of delimiters. Which exact delimiters are available? One possibility could be that either `` or "" begins a string when it's followed by the opposite quote type, with the (reversed/same?) delimiter at the other end of the string. It's currently a syntax error to juxtapose string literals so this syntax is probably available unless I've forgotten something. For example, the string "hi"

julia> :(x``"hi"``)
ERROR: syntax: cannot juxtapose string literal
Stacktrace:
 [1] top-level scope
   @ none:1

I'm imagining this mixed delimiter parsing as @x_cmd "hi", given that the quote starts with a backtick and it could be @x_str for ".

The rule might be that mixed delimiters can be an arbitrarily long sequence of length at least 3, and the user can always arrange for those to not be present in the string they're trying to quote.

(This is just one idea - perhaps there's other delimiters available?)

@adkabo
Copy link
Contributor

adkabo commented Sep 7, 2022

@c42f

Perhaps a sensible and generic fix for these kind of woes is to allow more flexibility in the string delimiters for custom string macros?

See #38948

@c42f
Copy link
Member

c42f commented Sep 7, 2022

Ah yes thanks. I thought I'd seen a longer discussion of this somewhere but couldn't find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code macros @macros strings "Strings!"
Projects
None yet
Development

No branches or pull requests

6 participants