Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Introduce concat_bytes!() to join [u8] and byte str analogous to concat! for str #2509

Merged
merged 6 commits into from
Jul 28, 2021

Conversation

estebank
Copy link
Contributor

@estebank estebank commented Aug 1, 2018

@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Aug 1, 2018
@SimonSapin
Copy link
Contributor

I’d like to see the Motivation section expanded. Why is this useful? Consider describing use cases and adding examples. Currently this section only paraphrases the summary.

The RFC’s title and motivation section mention “byte str” and “byte char”, which I find confusing because there is no such thing in my mental model. Rather, the language syntax contains: string literals of type &'static str, character literals of type char, byte string literals (or bytes literal) of type &'static [u8; N], and byte literals (singular byte) of type u8. In particular, a byte string literal is already an array of integers.

If this is added, I’d prefer it to be a separate macro. concat! today already accepts multiple inputs of different types but always returns &'static str. It would be surprising that adding another input can not only change the return type, but also completely change the handling of integer literals (from ASCII digits serialization to raw byte values).

For that separate macro, the name bconcat! is rather obscure. It is not typical of the standard library’s to do such aggressive abbreviations in public API naming. How about concat_bytes!?

@estebank
Copy link
Contributor Author

estebank commented Aug 1, 2018

@SimonSapin sounds reasonable!

I’d like to see the Motivation section expanded. Why is this useful? Consider describing use cases and adding examples. Currently this section only paraphrases the summary.

Will expand. I can definitely see the case where someone would want to do concat_bytes!(b"foo", 3, 2).

The RFC’s title and motivation section mention “byte str” and “byte char”, which I find confusing because there is no such thing in my mental model. Rather, the language syntax contains: string literals of type &'static str, character literals of type char, byte string literals (or bytes literal) of type &'static [u8; N], and byte literals (singular byte) of type u8. In particular, a byte string literal is already an array of integers.

The confusion seems to be around "byte char" (b'x'), which seems to be referred to as "byte literal" in the docs, but that seems slightly misleading to me as a single char can have multiple bytes.

If this is added, I’d prefer it to be a separate macro. concat! today already accepts multiple inputs of different types but always returns &'static str. It would be surprising that adding another input can not only change the return type, but also completely change the handling of integer literals (from ASCII digits serialization to raw byte values).

Very reasonable. It would be easy enough to point people in the right direction when extending the current error.

For that separate macro, the name bconcat! is rather obscure. It is not typical of the standard library’s to do such aggressive abbreviations in public API naming. How about concat_bytes!?

Works for me. The discoverability issue could be handled when someone attempts to use concat!(b"").

@fintelia

This comment has been minimized.

@joshtriplett
Copy link
Member

I agree with @SimonSapin's objection; I think this should use a separate macro concat_bytes!.

Apart from that, though, I'd love to see this.

@jsgf
Copy link

jsgf commented Aug 17, 2018

Whenever any of the arguments to concat!() is a byte literal, its output will be a byte literal, and the other arguments will be evaluated on their byte contents.

In general I agree with the suggestion of making the type with concat_bytes!, but I think it is also necessary - none of the parameters to concat!()/concat_bytes!() need be a string (concat!(1,2,3) => "123", for example), so there wouldn't be any type cue. You could consider trying to infer the type from the context, but that's getting complex.

Edit: I guess you could require a dummy b"" parameter, but that's pretty hacky.

@jsgf
Copy link

jsgf commented Aug 17, 2018

XX |     concat!(256, b"val");
   |             ^^^ this value is larger than `255`

I think this should have identical behaviour to string concat - ie emit b"256val" in this case. If you want an arbitrary byte then you could do b'\xff'.

@SimonSapin
Copy link
Contributor

Well, b'\xff' is 256_u8. It’s only another literal syntax for the same type, just like 0xFF_u8 is yet another. Technically since macro inputs are token tree concat_bytes! could tell the difference between the two, but since we make it look like a function that takes comma-separated values I think it would feel very weird for this "function" to behave so differently with two arguments that are spelled with different syntax but are the same at the type level and expression-value level.

@SimonSapin
Copy link
Contributor

Maybe, to avoid this confusion, concat_bytes! should not accept single-integer arguments. Only bytes arrays/slices (with either b"foo" byte string literal syntax or [b'f', 0x6F, 111] array literal syntax)

@jsgf
Copy link

jsgf commented Aug 18, 2018

My point is that concat!(123) -> "123": &str, and I think that concat_bytes!() should have identical behaviour, except that its result type is &[u8] - that is, they should accept the same inputs, and produce byte-for-byte identical output, albeit a slightly different type. The only extension is that concat_bytes!() should also accept binary string literals.

@estebank
Copy link
Contributor Author

@jsgf sounds reasonable for concat and concat_bytes to behave exactly the same with only a difference in the output type. The original PR opened without an RFCS did get concat to change behavior if an b"" was present, but it'd be too "magical" in practice for my taste.

@SimonSapin
Copy link
Contributor

That would be one way to define it and it’s not too terrible but:

  • It’s almost the same as concat!(…).as_bytes() (except it could return a known-size array rather than a slice)
  • It would implicitly encode Unicode characters as UTF-8, without a visible indication that this is happening. Compare with byte-string literals, where we’ve chosen to disallow non-ASCII characters entirely to "force" users to consider what character encodings they want. (Non-ASCII bytes can be included with hex backslash escapes like \xFF.)

I would prefer concat_bytes! to not accept "normal" (Unicode) string literals or character literals at all.

@estebank
Copy link
Contributor Author

It’s almost the same as concat!(…).as_bytes() (except it could return a known-size array rather than a slice)

For some reason I though arbitrary slices couldn't be const, alas they are. TIL.

I would prefer concat_bytes! to not accept "normal" (Unicode) string literals or character literals at all.

Given that the concat_bytes macro would be a builtin, we can perform very deep evaluation of the values being passed in. We could accept ascii only strings and fail otherwise, or (maybe more interestingly), we could warn on these and provide a way of doing it without having to rely on it, for example suggest concat_bytes!("felíz ", "año") -> concat_bytes!("fel\u{ED}z ", "a\u{F1}o") in the same way that b"" could have \xFF suggested. My reasoning being that unicode literals in strings make it clear how wide they are, making the concern less of a foot-gun.

@SimonSapin
Copy link
Contributor

"fel\u{ED}z " is an interesting case (like the rest of the U+0080 to U+00FF range) because the two-digit hexadecimal escape looks like it might represent a single \xED byte but it doesn’t. That string is represented as b"fel\xC3\xADz ", not b"fel\xEDz ".

This is exactly the kind of confusion I’d prefer to avoid.

We could have the proc macro inspect Unicode strings and reject those that contain non-ASCII characters, but at that point why not reject Unicode strings entirely and suggest using byte strings instead?

@estebank
Copy link
Contributor Author

@SimonSapin I'm ok with rejecting unicode strings, as long as we suggest the appropriate hex representation of values outside of the ascii range, as this seems like something easy to include and very useful for end users.

@SimonSapin
Copy link
Contributor

SimonSapin commented Aug 21, 2018

Sure, making a custom error message (that mentions UTF-8) for this case sounds good.

@Centril Centril added the A-macros-libstd Proposals that introduce new standard library macros label Nov 22, 2018
@jsgf

This comment has been minimized.

@whatisaphone
Copy link

I'd love to see this capability in the stdlib.

My motivation (which I don't think was mentioned yet) is building null-terminated strings at compile-time, for C interop.

macro_rules! zstr {
    ($str:literal) => {
        concat_bytes!($str, b"\0");
    };
}

zstr!(b"abc") == b"abc\0"

@Lokathor
Copy link
Contributor

Lokathor commented Jan 9, 2020

You can do that already with normal strings. I use a null_str! macro myself

@whatisaphone
Copy link

whatisaphone commented Jan 9, 2020

Very true! That only works with UTF-8 strings though. I'm unfortunately working with Windows-1252. I guess that was an important detail.

@Centril Centril added the T-lang Relevant to the language team, which will review and decide on the RFC. label Jan 10, 2020
@KodrAus KodrAus added the Libs-Tracked Libs issues that are tracked on the team's project board. label Jul 29, 2020
@joshtriplett
Copy link
Member

@nikomatsakis, @Mark-Simulacrum, and I talked about this one today. We generally think this is a good idea. We feel that, while as a builtin macro this is in the lang team's scope, we'd be comfortable with the @rust-lang/libs team evaluating this as if it were not a builtin macro (in which case it would be entirely within the scope of libs).

I do think this should be a separate macro like concat_bytes! rather than overloading concat!.

@Amanieu
Copy link
Member

Amanieu commented Aug 5, 2020

I'm happy to start an FCP on behalf of the libs team once the RFC text is updated to use concat_bytes!.

@jplatte
Copy link
Contributor

jplatte commented Oct 21, 2020

I'm interested in moving this forward. Should I open a PR on @estebank's branch to update the RFC text?

@joshtriplett
Copy link
Member

joshtriplett commented Oct 21, 2020

@jplatte You should coordinate with @estebank (here or via Zulip). A PR to help update the RFC text seems like a good step, though, after you've coordinated to confirm that'll work. And thank you!

@BurntSushi
Copy link
Member

This LGTM. Although, I don't think I'd mind seeing string/char literals supported in concat_bytes!. @SimonSapin mentioned above that this would just be similar to concat!("something").as_bytes(), but AIUI, it would enable things like, concat_bytes!("☃", b"\xFF", "foo"). That is, it would let you mix and match. An undoubtedly niche case, but is there any other way to do such a thing?

@m-ou-se
Copy link
Member

m-ou-se commented Mar 17, 2021

@rfcbot concern argument types

With concat!() it turned out that it accepting numbers/bools wasn't a great idea: println!(..) expanded to something like print!(concat!(.., "\n")), which caused println!(123) to be accepted, since concat!() converts it silently to a string. This proposed concat_bytes!() macro doesn't do number formatting, but would still accept integers as arguments, which are used as single bytes.

Considering this, I have two small related concerns about the types of arguments to accept:

  1. Inconsistency with concat!(): concat!(123, "\n") produces "123\n", but concat_bytes!(123, b"\n") produces b"\x7b\n".
  2. Making the same mistakes possible as with concat!(): a macro expanding to concat_bytes!($arg, b"\0") such as the zstr!() example above would also accept zstr!(123), which was probably not intended.

@SimonSapin
Copy link
Contributor

SimonSapin commented Mar 17, 2021

@BurntSushi If Unicode arguments are not supported directly you could still mix and match by converting arguments: concat_bytes!("☃".as_bytes(), b"\xFF")

@BurntSushi
Copy link
Member

@SimonSapin Ah thanks for that. I'm totally fine with that.

@SimonSapin
Copy link
Contributor

Oh sorry, I got confused with other macros whose expansion includes expression arguments. The point of this one is to expand to a single literal, so it needs to parse the token of its arguments which therefore can’t be arbitrary expressions.

@BurntSushi
Copy link
Member

@SimonSapin Ah, maybe that was why I thought the as_bytes() work-around wouldn't work when I wrote my first comment, but it was totally subconscious. Because of that limitation, how do you feel about allowing string literals in concat_bytes!? It's somewhat niche, but there doesn't seem to be a huge drawback to me for allowing it. I suppose we could also just punt on that since it seems like something that could be added backwards compatibly.

@SimonSapin
Copy link
Contributor

I’d prefer for the conversion and choice of encoding to be explicit when converting Unicode to bytes (even if the conversion is only type-level because the memory representation happens to match), but my pre-Rust-1.0 proposal to rename str::as_bytes to str::as_utf8 was not accepted so 🤷

@jplatte
Copy link
Contributor

jplatte commented Mar 18, 2021

Actually no, I don't see how that could possibly work. concat_bytes! operates on literals, not constant expressions.

@BurntSushi
Copy link
Member

Oh sorry, I got confused with other macros whose expansion includes expression arguments. The point of this one is to expand to a single literal, so it needs to parse the token of its arguments which therefore can’t be arbitrary expressions.

Yeah I share the sentiment. I think my retort would just be that it would otherwise be not possible to mix non-UTF-8 with UTF-8 via the nice syntax of Unicode strings. For me personally, I think that might outweigh the desire for encoding explicitness.

But I do not feel that strongly about this at all.

@jplatte
Copy link
Contributor

jplatte commented Mar 23, 2021

Is now a good time to also rename the file from 0000-byte-concat.md to 2509-byte-concat.md or is that automated and I don't need to do anything?

@joshtriplett
Copy link
Member

joshtriplett commented Apr 14, 2021

We discussed this in today's @rust-lang/libs meeting, and came to a resolution on the question of integer literals:

We'd like to reject bare integer literals, and only allow arrays of integer literals. So, concat_bytes!(b"hello", [1, 2, 3], "world") would be accepted, but concat_bytes!(b"hello", 1, 2, 3, "world") would not be.

It would also be nice to have a diagnostic with a suggestion, that suggests wrapping a series of integer literals with square brackets. So, concat_bytes!(b"hello", 1, 2, 3, "world") should get a rustfix-applicable suggestion to become concat_bytes!(b"hello", [1, 2, 3], "world").

@bstrie
Copy link
Contributor

bstrie commented May 5, 2021

Minor issue that doesn't need to block acceptance of the RFC, but recently there have been some discussions about whether or not new macros ought to be properly namespaced rather than just being effectively added to the prelude outright (I'm not even sure if macro namespacing existed or was mature when this RFC was originally written). Is this new item useful enough that it should live in the prelude, or should it rather be, say, std::array::concat_bytes?

@jplatte
Copy link
Contributor

jplatte commented May 27, 2021

@m-ou-se Is your concern resolved with the latest revision?

@m-ou-se
Copy link
Member

m-ou-se commented May 27, 2021

Is now a good time to also rename the file from 0000-byte-concat.md to 2509-byte-concat.md or is that automated and I don't need to do anything?

Usually that's done by whoever merges the RFC after the FCP is finished. But feel free to rename it yourself, to save that person one step in the rfc-merging process. ^^

Is your concern resolved with the latest revision?

Yes

@rfcbot resolve argument types

UTF-8

I'd personally prefer for this macro to accept regular string literals for UTF-8, but we can always add that later if that turns out to be useful, as it'd be backwards compatible.

@rfcbot rfcbot added final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. and removed proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. labels May 27, 2021
@rfcbot
Copy link
Collaborator

rfcbot commented May 27, 2021

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this RFC. to-announce and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Jun 6, 2021
@rfcbot
Copy link
Collaborator

rfcbot commented Jun 6, 2021

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

@jplatte
Copy link
Contributor

jplatte commented Jul 26, 2021

Ping @joshtriplett

@m-ou-se m-ou-se merged commit 431dadf into rust-lang:master Jul 28, 2021
@m-ou-se
Copy link
Member

m-ou-se commented Jul 28, 2021

🎉

This is now tracked in rust-lang/rust#87555

@Fishrock123
Copy link
Contributor

This RFC appears to contain a contradiction:

For example, concat_bytes!(42, b"va", b'l', [1, 2]) evaluates to
[42, 118, 97, 108, 1, 2].

And then later

An earlier version of this RFC proposed to support integer literals outside of
arrays, but that was rejected since it would make the output of
byte_concat!(123, b"\n") inconsistent with the equivalent concat!
invocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-macros-libstd Proposals that introduce new standard library macros disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this RFC. Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the RFC. to-announce
Projects
None yet
Development

Successfully merging this pull request may close these issues.