Support \u in string literals #1982

etalian · 2022-08-10T21:12:04Z

While writing doctests for string_literals.md, this came up:
Input:

var fret: String = "I would 'twere something that would fret the string,\n" +
                   "The master-cord on's \u{2764}\u{FE0F}!";

Output:

Stack trace:
 #0 0x000055c57065d50b backtrace .../llvm-project-14.0.6.src/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4277:13
 #1 0x000055c570a2779b llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /proc/self/cwd/external/llvm-project/llvm/lib/Support/Unix/Signals.inc:569:13
 #2 0x000055c5706eaa00 Carbon::Internal::ExitingStream::ExitingStream() /proc/self/cwd/./common/check_internal.h:35:3
 #3 0x000055c570925732 Carbon::UnescapeStringLiteral(llvm::StringRef, int, bool) /proc/self/cwd/common/string_helpers.cpp:85:11
 #4 0x000055c5708bd5e2 has_value .../llvm/14.0.6_1/bin/../include/c++/v1/optional:321:22
 #5 0x000055c5708bd5e2 operator bool .../llvm/14.0.6_1/bin/../include/c++/v1/optional:975:64
 #6 0x000055c5708bd5e2 operator==<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > .../llvm/14.0.6_1/bin/../include/c++/v1/optional:1277:31
 #7 0x000055c5708bd5e2 Carbon::ProcessSingleLineString(llvm::StringRef, Carbon::ParseAndLexContext&, unsigned long) /proc/self/cwd/explorer/syntax/lex_scan_helper.cpp:46:17
 #8 0x000055c5708ada92 yylex(Carbon::Arena*, void*, Carbon::ParseAndLexContext&) /proc/self/cwd/explorer/syntax/lexer.lpp:353:20
 #9 0x000055c5708c838a Carbon::Parser::parse() /proc/self/cwd/bazel-out/k8-fastbuild/bin/explorer/syntax/parser.cpp:0:38
#10 0x000055c5708c604d Carbon::Parser::operator()() /proc/self/cwd/bazel-out/k8-fastbuild/bin/explorer/syntax/parser.cpp:1098:5
#11 0x000055c5708a45e7 Carbon::ParseImpl(void*, Carbon::Arena*, std::__1::basic_string_view<char, std::__1::char_traits<char> >, bool) /proc/self/cwd/explorer/syntax/parse.cpp:31:60
#12 0x000055c5708a3503 Carbon::Parse(Carbon::Arena*, std::__1::basic_string_view<char, std::__1::char_traits<char> >, bool) /proc/self/cwd/explorer/syntax/parse.cpp:64:28
#13 0x000055c5706e316f index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:785:12
#14 0x000055c5706e316f index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1435:59
#15 0x000055c5706e316f __holds_alternative<1UL, Carbon::Error, Carbon::AST> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1461:14
#16 0x000055c5706e316f holds_alternative<Carbon::AST, Carbon::Error, Carbon::AST> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1467:10
#17 0x000055c5706e316f ok /proc/self/cwd/./common/error.h:91:36
#18 0x000055c5706e316f Carbon::Main(llvm::StringRef, int, char**) /proc/self/cwd/explorer/main.cpp:73:3
#19 0x000055c5706e214a index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:785:12
#20 0x000055c5706e214a index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1435:59
#21 0x000055c5706e214a __holds_alternative<1UL, Carbon::Error, Carbon::Success> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1461:14
#22 0x000055c5706e214a holds_alternative<Carbon::Success, Carbon::Error, Carbon::Success> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1467:10
#23 0x000055c5706e214a ok /proc/self/cwd/./common/error.h:91:36
#24 0x000055c5706e214a Carbon::ExplorerMain(llvm::StringRef, int, char**) /proc/self/cwd/explorer/main.cpp:91:69
#25 0x000055c5706e15a1 main /proc/self/cwd/explorer/main_bin.cpp:0:10
#26 0x00007f116997e0b3 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b3)
#27 0x000055c570621e0e _start (.../execroot/carbon/bazel-out/k8-fastbuild/bin/explorer/explorer+0x4e6e0e)
FATAL failure at common/string_helpers.cpp:85: \u is not yet supported in string literals

The text was updated successfully, but these errors were encountered:

jonmeow · 2022-08-10T21:38:19Z

Note, this is just explicitly failing on the unsupported code, it should be a small task to pick up.

vinsdragonis · 2022-08-10T23:57:37Z

Hey! Is this issue still open? If yes, I would like to give it a try. This is my first time working on open source. So any advice I can get on getting started with this would be welcome.

I have gone through the CONTRIBUTIONS and CODE_OF_CONDUCT.

jonmeow · 2022-08-11T01:17:50Z

I think @etalian is focused on #1971 so probably fine for you to grab, @vinsdragonis

vinsdragonis · 2022-08-11T11:47:43Z

So, after analyzing the files under /carbon-lang/common/, I understood that we don't want the stack trace to be printed for the case where a Unicode escape sequence is encountered. The stack trace is printed when we call ExitingStream() under check_internal.h. And this was also printed in the given case since we are calling the same from CARBON_RAW_EXITING_STREAM() under check.h.

So here's my approach to fixing it. We can create a standalone function that behaves like the CARBON_FATAL() method defined under check.h, except that it directly prints the FATAL failure message without having to print the stack trace only if necessary, print the message saying that escape sequence is not supported and then exits the program.

The function could go something like this.

#define CARBON_FATAL_UNSUPPORTED()
    << "FATAL failure at " << __FILE__ << ":" << __LINE__ << ": "

While this may not be the most efficient approach to patch up the problem, there are still a few other escape sequences and more features yet to be implemented. If there is a need to only print a message and exit at the time, this could prove helpful.

Please let me know if I could go ahead with it. And please excuse me if my approach doesn't seem up to the mark as is my first time trying open source.
Also, do let me know if there need to be any additional modifications to my approach if it seems partially correct.

etalian · 2022-08-11T13:18:53Z

While that would likely make the error message less jarring, the issue here is that an error is raised in the first place.
The Carbon documentation defines that the language should support Unicode literals, so the current implementation needs to cover this use case.
C++ does provide Unicode conversion facilities, but unfortunately they are deprecated since C++17 and should not be relied upon for new code.
Fortunately, Carbon is built on LLVM, which also already provides some Unicode support. Specifically, it looks to me that something like ConvertCodePointToUTF8 might be used in a first approximation.

NOTE: in C++, a u prefix encodes an UTF-16 literal but Carbon does not seem to specify anything of the sort, so I am assuming Swift/Rust behavior instead (in other words, strings contain UTF-8 code points, meaning that we can temporarily sort of get away with storing them in std::strings, as the current implementation does).

jonmeow · 2022-08-11T16:12:07Z

Agreed with @etalian -- while the message isn't ideal, the intent of this issue is to fix it by adding support, not just change the message. I think UTF-8 is the right solution too.

vinsdragonis · 2022-08-11T17:18:59Z

While that would likely make the error message less jarring, the issue here is that an error is raised in the first place. The Carbon documentation defines that the language should support Unicode literals, so the current implementation needs to cover this use case. C++ does provide Unicode conversion facilities, but unfortunately they are deprecated since C++17 and should not be relied upon for new code. Fortunately, Carbon is built on LLVM, which also already provides some Unicode support. Specifically, it looks to me that something like ConvertCodePointToUTF8 might be used in a first approximation.

NOTE: in C++, a u prefix encodes an UTF-16 literal but Carbon does not seem to specify anything of the sort, so I am assuming Swift/Rust behavior instead (in other words, strings contain UTF-8 code points, meaning that we can temporarily sort of get away with storing them in std::strings, as the current implementation does).

K, I will look into this.

vinsdragonis · 2022-08-11T21:14:34Z

Okay, so after a bit of research on the function mentioned by @etalian, an alternate idea came up. Why not consider writing a function which can manually convert the hex value to a UTF-8 string?
If this is something that seems fine, please do let me know.

vinsdragonis · 2022-08-11T21:25:55Z

This is the proposed function that could be used for hex to UTF-8 conversions:

#include <iomanip>

auto stoh(std::string const& in)
{
    std::ostringstream os;

    for(unsigned char const& c : in)
    {
        os << std::hex << std::setprecision(1) << std::setw(1)
           << std::setfill('0') << static_cast<char>(c);
    }

    return os.str();
}

No problems if this is not what is needed. I can always go with the UTF-8 function mentioned by @etalian if that is what's expected.

etalian · 2022-08-11T22:15:22Z

For a quick description of how UTF-8 works: https://fasterthanli.me/articles/working-with-strings-in-rust#a-very-quick-utf-8-primer
Let's say, to reuse the same example in the article, that our Carbon string contains the Unicode literal \u{E9}.
This value encodes the symbol é.
The UTF-8 representation of this letter is two bytes, 0xC3 and 0xA9 (but in general it can be anything between 1 and 4 bytes, it always depends on the specific symbol).
So in this example, the conversion function would take the integer 0xE9 as input, and produce a string containing the two bytes 0xC3 and 0xA9 as output.

As another example, the emoji 😁 corresponds to the Unicode literal \u{1F601}, which in UTF-8 is 4 bytes (0xf0 0x9f 0x98 0x81).

Pritjam · 2022-08-13T05:40:49Z

Hi! Would it be OK if I also took a shot at this issue? Like @vinsdragonis , this is my first time contributing to open source code as well. I've gone through the code of conduct and contributing files.

vinsdragonis · 2022-08-13T09:19:59Z

Hi! Would it be OK if I also took a shot at this issue? Like @vinsdragonis , this is my first time contributing to open source code as well. I've gone through the code of conduct and contributing files.

Sure, I'm happy to collaborate with anyone willing to help 😄

Pritjam · 2022-08-13T19:04:45Z

I added a first draft of the new code to my fork at https://github.com/Pritjam/carbon-lang. @vinsdragonis , could you take a look and see what you think/make changes? Specifically, I added code to string_helpers.cpp (in the UnescapeStringLiteral switch-case block) and a test in string_helpers_test.cpp.

vinsdragonis · 2022-08-13T19:12:02Z

@Pritjam, sure I will have a look at this.

Support \u in string literals as per issue carbon-language#1982

vinsdragonis · 2022-08-13T20:23:24Z

I have drafted test cases for the same. Unfortunately, this doesn't seem to recognize the escape sequence yet since we haven't parsed the string to get those literals.
We need to extract the literals in the ParseBlockStringLiteral() function under string_helpers.cpp. Then we can parse the literals individually.

While that didn't work, the code for the escape sequence seems logically correct to me, though I will need to confirm the same.

Pritjam · 2022-08-13T20:46:43Z

I think the reason the escape sequences in the new test case aren't being recognized is that they aren't formatted quite correctly. Looking at the docs for string literals (docs/design/lexical_conventions/string_literals.md), Unicode code points are specified with \u{HHHH...}, and the new test case omits these brackets.

As for the ParseBlockStringLiteral(), I don't think we need to extract the literals there, since that function calls UnescapeStringLiteral, so we just have to implement the Unicode literal parsing logic in UnescapeStringLiteral. This also makes sense because that's where the old code had a CARBON_FATAL() to show that Unicode wasn't implemented yet.

Pritjam · 2022-08-13T21:16:13Z

So I think we have 3 objectives we're trying to accomplish here:

Add the unicode functionality to the explorer
I want to have a commit incorporated as a contribution to Carbon
You want to have a commit incorporated as a contribution to Carbon

I think with what we have right now, we accomplish the first two objectives. Next, you should add a test case or some other way of testing the functionality (which you pretty much have, we just need to get your test case working) and then we can go ahead and make a pull request of the entire commit chain. What do you think of that?

vinsdragonis · 2022-08-13T21:19:09Z

Yeah, it seems good to me. As for the test case, I have rectified it by sticking to the format \u{HHHH}. Do check it out and let me know if it seems right.

Pritjam · 2022-08-13T21:28:12Z

I saw the change you made in the unicode.carbon test. It's almost correct! The one other change is, the string you're comparing against (on line 15 of that file) is still the old string, "str". I'll go ahead and fix that real quick, and then I think we should be good to make a PR.

Also, do you know how to run the tests? It took me a bit to figure out as well, but it's super useful to be able to iteratively test your code. To run the unicode.carbon test, you have to run bazel test //explorer/testdata:string/unicode.carbon.test. You can see that command contains the path //explorer/testdata:string/unicode.carbon.test. It starts with //explorer/testdata because that's a folder with a BUILD file in it, then specifies :string/unicode.carbon.test to say which test to run.

To run the string_helpers_test.cpp test, instead run bazel test common/string_helpers_test. This specifies a test in the common directory instead.

Pritjam · 2022-08-13T21:35:52Z

Alright, the code is good to go. We should be able to make a PR now. Do you want to go ahead and make that?

vinsdragonis · 2022-08-13T21:37:35Z

Sure, I'll do that

vinsdragonis · 2022-08-13T21:49:33Z

#2027

jonmeow · 2022-08-23T19:33:57Z

I think #2027 resolves this.

jonmeow added the explorer Action items related to Carbon explorer code label Aug 10, 2022

jonmeow added the good first issue Possibly a good first issue for newcomers label Aug 10, 2022

jonmeow assigned vinsdragonis Aug 11, 2022

jonmeow changed the title ~~FATAL: \u is not yet supported in string literals~~ Support \u in string literals Aug 11, 2022

Pritjam added a commit to Pritjam/carbon-lang that referenced this issue Aug 13, 2022

Support \u in string literals as per issue carbon-language#1982

2d32c4d

vinsdragonis added a commit to vinsdragonis/carbon-lang that referenced this issue Aug 13, 2022

Merge pull request #1 from Pritjam/trunk

d510dd0

Support \u in string literals as per issue carbon-language#1982

vinsdragonis mentioned this issue Aug 13, 2022

Added unicode support #2027

Merged

jonmeow closed this as completed Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support \u in string literals #1982

Support \u in string literals #1982

etalian commented Aug 10, 2022

jonmeow commented Aug 10, 2022

vinsdragonis commented Aug 10, 2022

jonmeow commented Aug 11, 2022

vinsdragonis commented Aug 11, 2022 •

edited

Loading

etalian commented Aug 11, 2022

jonmeow commented Aug 11, 2022 •

edited

Loading

vinsdragonis commented Aug 11, 2022

vinsdragonis commented Aug 11, 2022

vinsdragonis commented Aug 11, 2022 •

edited

Loading

etalian commented Aug 11, 2022

Pritjam commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

Pritjam commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022 •

edited

Loading

Pritjam commented Aug 13, 2022

Pritjam commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

Pritjam commented Aug 13, 2022

Pritjam commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

jonmeow commented Aug 23, 2022

Support \u in string literals #1982

Support \u in string literals #1982

Comments

etalian commented Aug 10, 2022

jonmeow commented Aug 10, 2022

vinsdragonis commented Aug 10, 2022

jonmeow commented Aug 11, 2022

vinsdragonis commented Aug 11, 2022 • edited Loading

etalian commented Aug 11, 2022

jonmeow commented Aug 11, 2022 • edited Loading

vinsdragonis commented Aug 11, 2022

vinsdragonis commented Aug 11, 2022

vinsdragonis commented Aug 11, 2022 • edited Loading

etalian commented Aug 11, 2022

Pritjam commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

Pritjam commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022 • edited Loading

Pritjam commented Aug 13, 2022

Pritjam commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

Pritjam commented Aug 13, 2022

Pritjam commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

vinsdragonis commented Aug 13, 2022

jonmeow commented Aug 23, 2022

vinsdragonis commented Aug 11, 2022 •

edited

Loading

jonmeow commented Aug 11, 2022 •

edited

Loading

vinsdragonis commented Aug 11, 2022 •

edited

Loading

vinsdragonis commented Aug 13, 2022 •

edited

Loading