Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not accept unicode escape characters in byte strings or as byte #23625

Merged
merged 1 commit into from
Mar 28, 2015

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Mar 22, 2015

closes #23620

This PR patches the issue mentioned in #23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the scan_byte function returns token::intern("??") for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn't scan_byte fail when it encounters an invalid byte?

And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains string_body *, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced by byte_body *. If this is valid, I can add this fix to this PR.

@rust-highfive
Copy link
Collaborator

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@@ -799,7 +799,15 @@ impl<'a> StringReader<'a> {
'n' | 'r' | 't' | '\\' | '\'' | '"' | '0' => true,
'x' => self.scan_byte_escape(delim, !ascii_only),
'u' if self.curr_is('{') => {
self.scan_unicode_escape(delim)
if self.scan_unicode_escape(delim) && ascii_only {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If scan_unicode_escape returns false, doesn't this whole branch return true? I think that only happens when the unicode escape is not a valid character (e.g. \u{ffffff}). Could you make sure that still emits an error?

@alexcrichton
Copy link
Member

If this is valid, I can add this fix to this PR.

I think it's even supposed to be byte_string_body, and sure, feel free to include it!

@fhahn fhahn force-pushed the issue-23620-ice-unicode-bytestring branch 3 times, most recently from 59e8c62 to fe64595 Compare March 24, 2015 23:03
@fhahn
Copy link
Contributor Author

fhahn commented Mar 24, 2015

I have updated the pull request and found other escape sequences that lead to ICEs. The problem seems to be that the parser expect the lexer to reject invalid byte literals, but the lexer just returns ?? tokens (e.g. https://github.com/rust-lang/rust/pull/23625/files#diff-d06ad31c547d2d94c8b8770006977767L1333), which lead to the ICEs.

In this PR, I just abort as soon as an invalid literal is encountered. Note that using fatal_span_ would be also possible, but the current version displays error messages for the complete literal before it aborts. Does this sound like a reasonable approach?

Unfortunately, this affects the following tests:

[parse-fail] parse-fail/ascii-only-character-escape.rs
[parse-fail] parse-fail/bad-char-literals.rs
[parse-fail] parse-fail/byte-literals.rs
[parse-fail] parse-fail/byte-string-literals.rs
[parse-fail] parse-fail/lex-bad-char-literals.rs
[parse-fail] parse-fail/lex-bare-cr-string-literal-doc-comment.rs

The fail, because they contain invalid string literals, which cause rustc to abort sooner as without the changes in the PR.

@alexcrichton
Copy link
Member

Hm would it be possible to intern a valid token instead of ??? If an error has already been emitted then we're guaranteed that it already won't finish compiling, so it could help paper over the later ICEs perhaps?

@fhahn fhahn force-pushed the issue-23620-ice-unicode-bytestring branch from fe64595 to d43aebc Compare March 26, 2015 08:44
@fhahn
Copy link
Contributor Author

fhahn commented Mar 26, 2015

@alexcrichton Yes, scan_byte now returns token::intern("?") instead of token::intern("??") which is an invalid byte. I also changed some fatal_span_ to err_span_ in scan_hex_digits so the parsing can continue.
Now it only uses fatal_span_ when the escape sequence is not terminated (https://github.com/rust-lang/rust/pull/23625/files#diff-d06ad31c547d2d94c8b8770006977767R749), but maybe we should read until the literal is terminated in order to continue parsing in this case?

escaped_pos,
self.last_pos,
"Unicode escape sequences cannot be used as byte or in \
byte string."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error messages in Rust conventionally do not start with an uppercase letter and also don't end with a period

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, perhaps "as a byte" instead of "as byte"?

@alexcrichton
Copy link
Member

Looks great to me, thanks @fhahn! Just a small nit about the error message and a tidy failure on travis and I think this is otherwise good to go.

@fhahn
Copy link
Contributor Author

fhahn commented Mar 26, 2015

Thanks for the feedback, I've updated the PR.

@alexcrichton
Copy link
Member

Perhaps the push was forgotten? (looks the same)

@fhahn fhahn force-pushed the issue-23620-ice-unicode-bytestring branch from d43aebc to 5ced095 Compare March 27, 2015 16:46
@fhahn fhahn force-pushed the issue-23620-ice-unicode-bytestring branch from 5ced095 to afaa3b6 Compare March 27, 2015 16:47
@fhahn
Copy link
Contributor Author

fhahn commented Mar 27, 2015

Damn, I pushed it to the wrong branch. It should be updated now

@alexcrichton
Copy link
Member

@bors: r+ afaa3b6

Manishearth added a commit to Manishearth/rust that referenced this pull request Mar 27, 2015
…tring, r=alexcrichton

 closes rust-lang#23620

This PR patches the issue mentioned in rust-lang#23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the `scan_byte` function returns ` token::intern(\"??\") ` for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn't `scan_byte` fail when it encounters an invalid byte?

And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains `string_body *`, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced by `byte_body *`. If this is valid, I can add this fix to this PR.
alexcrichton added a commit to alexcrichton/rust that referenced this pull request Mar 27, 2015
…ring

closes rust-lang#23620

This PR patches the issue mentioned in rust-lang#23620, but there is also an ICE for invalid escape sequences in byte literals. This is due to the fact that the `scan_byte` function returns ` token::intern("??") ` for invalid bytes, resulting in an ICE later on. Is there a reason for this behavior? Shouldn't `scan_byte` fail when it encounters an invalid byte?

And I noticed a small inconsistency in the documentation. According to the formal byte literal definition in http://doc.rust-lang.org/reference.html#byte-and-byte-string-literals , a byte string literal contains `string_body *`, but according to the text (and the behavior of the lexer) it should not accept unicode escape sequences. Hence it should be replaced by `byte_body *`. If this is valid, I can add this fix to this PR.
@bors bors merged commit afaa3b6 into rust-lang:master Mar 28, 2015
@fhahn fhahn deleted the issue-23620-ice-unicode-bytestring branch March 30, 2015 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ICE on Unicode escape in byte string literal
5 participants