Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support \u in string literals #1982

Closed
etalian opened this issue Aug 10, 2022 · 23 comments
Closed

Support \u in string literals #1982

etalian opened this issue Aug 10, 2022 · 23 comments
Assignees
Labels
explorer Action items related to Carbon explorer code good first issue Possibly a good first issue for newcomers

Comments

@etalian
Copy link

etalian commented Aug 10, 2022

While writing doctests for string_literals.md, this came up:
Input:

var fret: String = "I would 'twere something that would fret the string,\n" +
                   "The master-cord on's \u{2764}\u{FE0F}!";

Output:

Stack trace:
 #0 0x000055c57065d50b backtrace .../llvm-project-14.0.6.src/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4277:13
 #1 0x000055c570a2779b llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /proc/self/cwd/external/llvm-project/llvm/lib/Support/Unix/Signals.inc:569:13
 #2 0x000055c5706eaa00 Carbon::Internal::ExitingStream::ExitingStream() /proc/self/cwd/./common/check_internal.h:35:3
 #3 0x000055c570925732 Carbon::UnescapeStringLiteral(llvm::StringRef, int, bool) /proc/self/cwd/common/string_helpers.cpp:85:11
 #4 0x000055c5708bd5e2 has_value .../llvm/14.0.6_1/bin/../include/c++/v1/optional:321:22
 #5 0x000055c5708bd5e2 operator bool .../llvm/14.0.6_1/bin/../include/c++/v1/optional:975:64
 #6 0x000055c5708bd5e2 operator==<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > .../llvm/14.0.6_1/bin/../include/c++/v1/optional:1277:31
 #7 0x000055c5708bd5e2 Carbon::ProcessSingleLineString(llvm::StringRef, Carbon::ParseAndLexContext&, unsigned long) /proc/self/cwd/explorer/syntax/lex_scan_helper.cpp:46:17
 #8 0x000055c5708ada92 yylex(Carbon::Arena*, void*, Carbon::ParseAndLexContext&) /proc/self/cwd/explorer/syntax/lexer.lpp:353:20
 #9 0x000055c5708c838a Carbon::Parser::parse() /proc/self/cwd/bazel-out/k8-fastbuild/bin/explorer/syntax/parser.cpp:0:38
#10 0x000055c5708c604d Carbon::Parser::operator()() /proc/self/cwd/bazel-out/k8-fastbuild/bin/explorer/syntax/parser.cpp:1098:5
#11 0x000055c5708a45e7 Carbon::ParseImpl(void*, Carbon::Arena*, std::__1::basic_string_view<char, std::__1::char_traits<char> >, bool) /proc/self/cwd/explorer/syntax/parse.cpp:31:60
#12 0x000055c5708a3503 Carbon::Parse(Carbon::Arena*, std::__1::basic_string_view<char, std::__1::char_traits<char> >, bool) /proc/self/cwd/explorer/syntax/parse.cpp:64:28
#13 0x000055c5706e316f index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:785:12
#14 0x000055c5706e316f index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1435:59
#15 0x000055c5706e316f __holds_alternative<1UL, Carbon::Error, Carbon::AST> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1461:14
#16 0x000055c5706e316f holds_alternative<Carbon::AST, Carbon::Error, Carbon::AST> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1467:10
#17 0x000055c5706e316f ok /proc/self/cwd/./common/error.h:91:36
#18 0x000055c5706e316f Carbon::Main(llvm::StringRef, int, char**) /proc/self/cwd/explorer/main.cpp:73:3
#19 0x000055c5706e214a index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:785:12
#20 0x000055c5706e214a index .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1435:59
#21 0x000055c5706e214a __holds_alternative<1UL, Carbon::Error, Carbon::Success> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1461:14
#22 0x000055c5706e214a holds_alternative<Carbon::Success, Carbon::Error, Carbon::Success> .../llvm/14.0.6_1/bin/../include/c++/v1/variant:1467:10
#23 0x000055c5706e214a ok /proc/self/cwd/./common/error.h:91:36
#24 0x000055c5706e214a Carbon::ExplorerMain(llvm::StringRef, int, char**) /proc/self/cwd/explorer/main.cpp:91:69
#25 0x000055c5706e15a1 main /proc/self/cwd/explorer/main_bin.cpp:0:10
#26 0x00007f116997e0b3 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b3)
#27 0x000055c570621e0e _start (.../execroot/carbon/bazel-out/k8-fastbuild/bin/explorer/explorer+0x4e6e0e)
FATAL failure at common/string_helpers.cpp:85: \u is not yet supported in string literals
@jonmeow jonmeow added the explorer Action items related to Carbon explorer code label Aug 10, 2022
@jonmeow
Copy link
Contributor

jonmeow commented Aug 10, 2022

Note, this is just explicitly failing on the unsupported code, it should be a small task to pick up.

@jonmeow jonmeow added the good first issue Possibly a good first issue for newcomers label Aug 10, 2022
@vinsdragonis
Copy link
Contributor

Hey! Is this issue still open? If yes, I would like to give it a try. This is my first time working on open source. So any advice I can get on getting started with this would be welcome.

I have gone through the CONTRIBUTIONS and CODE_OF_CONDUCT.

@jonmeow
Copy link
Contributor

jonmeow commented Aug 11, 2022

I think @etalian is focused on #1971 so probably fine for you to grab, @vinsdragonis

@vinsdragonis
Copy link
Contributor

vinsdragonis commented Aug 11, 2022

So, after analyzing the files under /carbon-lang/common/, I understood that we don't want the stack trace to be printed for the case where a Unicode escape sequence is encountered. The stack trace is printed when we call ExitingStream() under check_internal.h. And this was also printed in the given case since we are calling the same from CARBON_RAW_EXITING_STREAM() under check.h.

So here's my approach to fixing it. We can create a standalone function that behaves like the CARBON_FATAL() method defined under check.h, except that it directly prints the FATAL failure message without having to print the stack trace only if necessary, print the message saying that escape sequence is not supported and then exits the program.

The function could go something like this.

#define CARBON_FATAL_UNSUPPORTED()
    << "FATAL failure at " << __FILE__ << ":" << __LINE__ << ": "

While this may not be the most efficient approach to patch up the problem, there are still a few other escape sequences and more features yet to be implemented. If there is a need to only print a message and exit at the time, this could prove helpful.

Please let me know if I could go ahead with it. And please excuse me if my approach doesn't seem up to the mark as is my first time trying open source.
Also, do let me know if there need to be any additional modifications to my approach if it seems partially correct.

@etalian
Copy link
Author

etalian commented Aug 11, 2022

While that would likely make the error message less jarring, the issue here is that an error is raised in the first place.
The Carbon documentation defines that the language should support Unicode literals, so the current implementation needs to cover this use case.
C++ does provide Unicode conversion facilities, but unfortunately they are deprecated since C++17 and should not be relied upon for new code.
Fortunately, Carbon is built on LLVM, which also already provides some Unicode support. Specifically, it looks to me that something like ConvertCodePointToUTF8 might be used in a first approximation.

NOTE: in C++, a u prefix encodes an UTF-16 literal but Carbon does not seem to specify anything of the sort, so I am assuming Swift/Rust behavior instead (in other words, strings contain UTF-8 code points, meaning that we can temporarily sort of get away with storing them in std::strings, as the current implementation does).

@jonmeow
Copy link
Contributor

jonmeow commented Aug 11, 2022

Agreed with @etalian -- while the message isn't ideal, the intent of this issue is to fix it by adding support, not just change the message. I think UTF-8 is the right solution too.

@jonmeow jonmeow changed the title FATAL: \u is not yet supported in string literals Support \u in string literals Aug 11, 2022
@vinsdragonis
Copy link
Contributor

While that would likely make the error message less jarring, the issue here is that an error is raised in the first place. The Carbon documentation defines that the language should support Unicode literals, so the current implementation needs to cover this use case. C++ does provide Unicode conversion facilities, but unfortunately they are deprecated since C++17 and should not be relied upon for new code. Fortunately, Carbon is built on LLVM, which also already provides some Unicode support. Specifically, it looks to me that something like ConvertCodePointToUTF8 might be used in a first approximation.

NOTE: in C++, a u prefix encodes an UTF-16 literal but Carbon does not seem to specify anything of the sort, so I am assuming Swift/Rust behavior instead (in other words, strings contain UTF-8 code points, meaning that we can temporarily sort of get away with storing them in std::strings, as the current implementation does).

K, I will look into this.

@vinsdragonis
Copy link
Contributor

Okay, so after a bit of research on the function mentioned by @etalian, an alternate idea came up. Why not consider writing a function which can manually convert the hex value to a UTF-8 string?
If this is something that seems fine, please do let me know.

@vinsdragonis
Copy link
Contributor

vinsdragonis commented Aug 11, 2022

This is the proposed function that could be used for hex to UTF-8 conversions:

#include <iomanip>

auto stoh(std::string const& in)
{
    std::ostringstream os;

    for(unsigned char const& c : in)
    {
        os << std::hex << std::setprecision(1) << std::setw(1)
           << std::setfill('0') << static_cast<char>(c);
    }

    return os.str();
}

No problems if this is not what is needed. I can always go with the UTF-8 function mentioned by @etalian if that is what's expected.

@etalian
Copy link
Author

etalian commented Aug 11, 2022

For a quick description of how UTF-8 works: https://fasterthanli.me/articles/working-with-strings-in-rust#a-very-quick-utf-8-primer
Let's say, to reuse the same example in the article, that our Carbon string contains the Unicode literal \u{E9}.
This value encodes the symbol é.
The UTF-8 representation of this letter is two bytes, 0xC3 and 0xA9 (but in general it can be anything between 1 and 4 bytes, it always depends on the specific symbol).
So in this example, the conversion function would take the integer 0xE9 as input, and produce a string containing the two bytes 0xC3 and 0xA9 as output.

As another example, the emoji 😁 corresponds to the Unicode literal \u{1F601}, which in UTF-8 is 4 bytes (0xf0 0x9f 0x98 0x81).

@Pritjam
Copy link
Contributor

Pritjam commented Aug 13, 2022

Hi! Would it be OK if I also took a shot at this issue? Like @vinsdragonis , this is my first time contributing to open source code as well. I've gone through the code of conduct and contributing files.

@vinsdragonis
Copy link
Contributor

Hi! Would it be OK if I also took a shot at this issue? Like @vinsdragonis , this is my first time contributing to open source code as well. I've gone through the code of conduct and contributing files.

Sure, I'm happy to collaborate with anyone willing to help 😄

Pritjam added a commit to Pritjam/carbon-lang that referenced this issue Aug 13, 2022
@Pritjam
Copy link
Contributor

Pritjam commented Aug 13, 2022

I added a first draft of the new code to my fork at https://github.com/Pritjam/carbon-lang. @vinsdragonis , could you take a look and see what you think/make changes? Specifically, I added code to string_helpers.cpp (in the UnescapeStringLiteral switch-case block) and a test in string_helpers_test.cpp.

@vinsdragonis
Copy link
Contributor

@Pritjam, sure I will have a look at this.

vinsdragonis added a commit to vinsdragonis/carbon-lang that referenced this issue Aug 13, 2022
Support \u in string literals as per issue carbon-language#1982
@vinsdragonis
Copy link
Contributor

vinsdragonis commented Aug 13, 2022

I have drafted test cases for the same. Unfortunately, this doesn't seem to recognize the escape sequence yet since we haven't parsed the string to get those literals.
We need to extract the literals in the ParseBlockStringLiteral() function under string_helpers.cpp. Then we can parse the literals individually.

While that didn't work, the code for the escape sequence seems logically correct to me, though I will need to confirm the same.

@Pritjam
Copy link
Contributor

Pritjam commented Aug 13, 2022

I think the reason the escape sequences in the new test case aren't being recognized is that they aren't formatted quite correctly. Looking at the docs for string literals (docs/design/lexical_conventions/string_literals.md), Unicode code points are specified with \u{HHHH...}, and the new test case omits these brackets.

As for the ParseBlockStringLiteral(), I don't think we need to extract the literals there, since that function calls UnescapeStringLiteral, so we just have to implement the Unicode literal parsing logic in UnescapeStringLiteral. This also makes sense because that's where the old code had a CARBON_FATAL() to show that Unicode wasn't implemented yet.

@Pritjam
Copy link
Contributor

Pritjam commented Aug 13, 2022

So I think we have 3 objectives we're trying to accomplish here:

  1. Add the unicode functionality to the explorer
  2. I want to have a commit incorporated as a contribution to Carbon
  3. You want to have a commit incorporated as a contribution to Carbon

I think with what we have right now, we accomplish the first two objectives. Next, you should add a test case or some other way of testing the functionality (which you pretty much have, we just need to get your test case working) and then we can go ahead and make a pull request of the entire commit chain. What do you think of that?

@vinsdragonis
Copy link
Contributor

Yeah, it seems good to me. As for the test case, I have rectified it by sticking to the format \u{HHHH}. Do check it out and let me know if it seems right.

@Pritjam
Copy link
Contributor

Pritjam commented Aug 13, 2022

I saw the change you made in the unicode.carbon test. It's almost correct! The one other change is, the string you're comparing against (on line 15 of that file) is still the old string, "str". I'll go ahead and fix that real quick, and then I think we should be good to make a PR.

Also, do you know how to run the tests? It took me a bit to figure out as well, but it's super useful to be able to iteratively test your code. To run the unicode.carbon test, you have to run bazel test //explorer/testdata:string/unicode.carbon.test. You can see that command contains the path //explorer/testdata:string/unicode.carbon.test. It starts with //explorer/testdata because that's a folder with a BUILD file in it, then specifies :string/unicode.carbon.test to say which test to run.

To run the string_helpers_test.cpp test, instead run bazel test common/string_helpers_test. This specifies a test in the common directory instead.

@Pritjam
Copy link
Contributor

Pritjam commented Aug 13, 2022

Alright, the code is good to go. We should be able to make a PR now. Do you want to go ahead and make that?

@vinsdragonis
Copy link
Contributor

Sure, I'll do that

@vinsdragonis
Copy link
Contributor

#2027

@jonmeow
Copy link
Contributor

jonmeow commented Aug 23, 2022

I think #2027 resolves this.

@jonmeow jonmeow closed this as completed Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
explorer Action items related to Carbon explorer code good first issue Possibly a good first issue for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants