Bad span computations with unicode characters, should be handling them as graphemes #8706

huonw · 2013-08-23T07:09:56Z

use std::io;

fn main() {
    let s = ~"ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
    io::println(s);
}

main.rs:4:46: 4:67 warning: denote infinite loops with loop { ... }, #[warn(while_true)] on by default
main.rs:4     let s = ~"ZͨA͑ͦ͒͋ͤ̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
                                                       ^~~~~~~~~~~~~~~~~~~~~

The proper fix for this requires grapheme handling (#7043), e.g. some graphemes are double width.

The text was updated successfully, but these errors were encountered:

They are still are not completely correct, since it does not handle graphemes at all, just codepoints, but at least it handles the common case correctly. The calculation was previously very wrong (rather than just a little bit wrong): it wasn't accounting for the fact that every character is 1 byte, and so multibyte characters were pretending to be zero width. cc rust-lang#8706

huonw · 2014-02-23T12:55:17Z

If #12489 lands, the compiler handles these slightly better, but it's just operating on codepoint counts and assuming they're all single width: i.e. still needs to be changed to work with graphemes.

They are still are not completely correct, since it does not handle graphemes at all, just codepoints, but at least it handles the common case correctly. The calculation was previously very wrong (rather than just a little bit wrong): it wasn't accounting for the fact that every character is 1 byte, and so multibyte characters were pretending to be zero width. cc rust-lang#8706

pzol · 2014-02-26T17:40:04Z

Visiting for triage, still requires #7043

alexcrichton · 2014-09-29T17:24:33Z

#7043 has been fixed, so this can actually make progress now!

Closes rust-lang#8706.

Closes #8706.

ftxqxd · 2015-04-19T06:07:59Z

Because of #24428, this issue should be reopened.

pnkfelix · 2015-04-19T13:14:30Z

I agree with @huonw's comment on PR #24428: namely, we should revert the portions of that PR that regressed rustc

bltavares · 2016-02-20T21:56:36Z

Triaging:

Updating the code to rustc 1.8.0-nightly (57c357d89 2016-02-16)

fn main() {
    let s = "ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
    println!("{}", s);
}

The error still does not point to the correct location:

<anon>:4:45: 4:66 warning: denote infinite loops with loop { ... }, #[warn(while_true)] on by default
<anon>:4     let s = "ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
                                                     ^~~~~~~~~~~~~~~~~~~~~

Mark-Simulacrum · 2017-04-28T00:08:47Z

Closing as fixed.

fn main() {
    let s = ~"ZͨA͑ͦL̄͑ĜͨOͥ͛!̏"; while true { break; }
}

Now gives:

error: expected expression, found `~`
 --> test.rs:2:13
  |
2 |     let s = ~"ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
  |             ^

error: aborting due to previous error

Mark-Simulacrum · 2017-04-28T00:10:02Z

Never mind, closed prematurely. The ~ was caught in an earlier pass, and since it's before the while true, I assumed this was fixed when it wasn't...

warning: unused variable: `s`
 --> test.rs:2:9
  |
2 |     let s = "ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
  |         ^
  |
  = note: #[warn(unused_variables)] on by default

warning: denote infinite loops with loop { ... }
 --> test.rs:2:45
  |
2 |     let s = "ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
  |                                             ^^^^^^^^^^^^^^^^^^^^^
  |
  = note: #[warn(while_true)] on by default

euclio · 2017-07-15T04:00:41Z

I'd like to work on this, but I could use some help. I've written a failing UI test, and I've pulled in the unicode-width crate to libsyntax. I'm having trouble finding the right place to change how the spans are calculated in the filemap. Should I be modifying bytepos_to_file_charpos...?

@estebank

Display spans correctly when there are zero-width or wide characters Hopefully... * fixes #45211 * fixes #8706 --- Before: ``` error: invalid width `7` for integer literal --> unicode_2.rs:12:25 | 12 | let _ = ("a̐éö̲", 0u7); | ^^^ | = help: valid widths are 8, 16, 32, 64 and 128 error: invalid width `42` for integer literal --> unicode_2.rs:13:20 | 13 | let _ = ("아あ", 1i42); | ^^^^ | = help: valid widths are 8, 16, 32, 64 and 128 error: aborting due to 2 previous errors ``` After: ``` error: invalid width `7` for integer literal --> unicode_2.rs:12:25 | 12 | let _ = ("a̐éö̲", 0u7); | ^^^ | = help: valid widths are 8, 16, 32, 64 and 128 error: invalid width `42` for integer literal --> unicode_2.rs:13:20 | 13 | let _ = ("아あ", 1i42); | ^^^^ | = help: valid widths are 8, 16, 32, 64 and 128 error: aborting due to 2 previous errors ``` Spans might display incorrectly on the browser. r? @estebank

est31 · 2018-01-12T11:23:28Z

Seems like a part of the issue is still present. See #47380.

estebank · 2018-01-17T02:04:37Z

@est31 this is a problem that might get introduced back every now and then by forgetting to test with unicode, but should only happen on new features (like the suggestions machinery that introduced the linked case).

Fix formatting of `cast_abs_to_unsigned` docs The "use instead" section of the example was not being formatted as Rust code, and the "configuration" documentation was being formatted as Rust code. changelog: `[cast_abs_to_unsigned]` Fix example/configuration formatting

ben0x539 mentioned this issue Sep 19, 2013

lexer: improve errors #9308

Merged

huonw mentioned this issue Feb 25, 2014

span information is mistaken in the presence of combining characters #3260

Closed

nrc mentioned this issue Feb 28, 2014

Fix bytepos_to_file_charpos. #12613

Closed

ftxqxd mentioned this issue Jan 22, 2015

Diagnostics' ^~~~ is not aligned properly when error contains 日本語 characters #21492

Closed

ftxqxd mentioned this issue Jan 22, 2015

Compute widths properly when displaying spans in error messages #21499

Merged

ftxqxd added a commit to ftxqxd/rust that referenced this issue Feb 3, 2015

Compute widths properly when displaying spans in error messages

d244f09

Closes rust-lang#8706.

bors added a commit that referenced this issue Feb 4, 2015

Auto merge of #21499 - P1start:issue-8706, r=huonw

ac134f7

Closes #8706.

bors closed this as completed in #21499 Feb 4, 2015

kwantam mentioned this issue Apr 19, 2015

deprecate Unicode functions that will be moved to crates.io #24428

Merged

huonw reopened this Apr 19, 2015

huonw mentioned this issue Oct 7, 2015

Spanning algorithm assumes all codepoints have width equal to 1 #28899

Closed

steveklabnik added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Mar 9, 2017

Mark-Simulacrum closed this as completed Apr 28, 2017

Mark-Simulacrum reopened this Apr 28, 2017

Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. C-bug Category: This is a bug. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. C-bug Category: This is a bug. I-wrong labels Jul 19, 2017

euclio mentioned this issue Aug 25, 2017

chars/bytes confusion in the error emitter #44080

Closed

hcpl mentioned this issue Oct 11, 2017

Error span is in incorrect place due to Unicode fullwidth characters #45211

Closed

tirr-c mentioned this issue Nov 2, 2017

Display spans correctly when there are zero-width or wide characters #45711

Merged

bors closed this as completed in #45711 Nov 5, 2017

est31 mentioned this issue Jan 12, 2018

Mispositioned help span indicator when emojis are involved #47380

Closed

est31 mentioned this issue Jun 13, 2018

Allow non-ASCII identifiers rust-lang/rfcs#2457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad span computations with unicode characters, should be handling them as graphemes #8706

Bad span computations with unicode characters, should be handling them as graphemes #8706

huonw commented Aug 23, 2013

huonw commented Feb 23, 2014

pzol commented Feb 26, 2014

alexcrichton commented Sep 29, 2014

ftxqxd commented Apr 19, 2015

pnkfelix commented Apr 19, 2015

bltavares commented Feb 20, 2016

Mark-Simulacrum commented Apr 28, 2017

Mark-Simulacrum commented Apr 28, 2017

euclio commented Jul 15, 2017

est31 commented Jan 12, 2018

estebank commented Jan 17, 2018

Bad span computations with unicode characters, should be handling them as graphemes #8706

Bad span computations with unicode characters, should be handling them as graphemes #8706

Comments

huonw commented Aug 23, 2013

huonw commented Feb 23, 2014

pzol commented Feb 26, 2014

alexcrichton commented Sep 29, 2014

ftxqxd commented Apr 19, 2015

pnkfelix commented Apr 19, 2015

bltavares commented Feb 20, 2016

Mark-Simulacrum commented Apr 28, 2017

Mark-Simulacrum commented Apr 28, 2017

euclio commented Jul 15, 2017

est31 commented Jan 12, 2018

estebank commented Jan 17, 2018