Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad span computations with unicode characters, should be handling them as graphemes #8706

Closed
huonw opened this issue Aug 23, 2013 · 11 comments · Fixed by #21499 or #45711
Closed

Bad span computations with unicode characters, should be handling them as graphemes #8706

huonw opened this issue Aug 23, 2013 · 11 comments · Fixed by #21499 or #45711
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@huonw
Copy link
Member

huonw commented Aug 23, 2013

use std::io;

fn main() {
    let s = ~"ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
    io::println(s);
}
main.rs:4:46: 4:67 warning: denote infinite loops with loop { ... }, #[warn(while_true)] on by default
main.rs:4     let s = ~"ZͨA͑ͦ͒͋ͤ̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
                                                       ^~~~~~~~~~~~~~~~~~~~~

The proper fix for this requires grapheme handling (#7043), e.g. some graphemes are double width.

huonw added a commit to huonw/rust that referenced this issue Feb 23, 2014
They are still are not completely correct, since it does not handle
graphemes at all, just codepoints, but at least it handles the common
case correctly.

The calculation was previously very wrong (rather than just a little bit
wrong): it wasn't accounting for the fact that every character is 1
byte, and so multibyte characters were pretending to be zero width.

cc rust-lang#8706
@huonw
Copy link
Member Author

huonw commented Feb 23, 2014

If #12489 lands, the compiler handles these slightly better, but it's just operating on codepoint counts and assuming they're all single width: i.e. still needs to be changed to work with graphemes.

alexcrichton pushed a commit to alexcrichton/rust that referenced this issue Feb 25, 2014
They are still are not completely correct, since it does not handle
graphemes at all, just codepoints, but at least it handles the common
case correctly.

The calculation was previously very wrong (rather than just a little bit
wrong): it wasn't accounting for the fact that every character is 1
byte, and so multibyte characters were pretending to be zero width.

cc rust-lang#8706
@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

Visiting for triage, still requires #7043

@alexcrichton
Copy link
Member

#7043 has been fixed, so this can actually make progress now!

@ftxqxd
Copy link
Contributor

ftxqxd commented Apr 19, 2015

Because of #24428, this issue should be reopened.

@pnkfelix
Copy link
Member

I agree with @huonw's comment on PR #24428: namely, we should revert the portions of that PR that regressed rustc

@bltavares
Copy link
Contributor

Triaging:

Updating the code to rustc 1.8.0-nightly (57c357d89 2016-02-16)

fn main() {
    let s = "ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
    println!("{}", s);
}

The error still does not point to the correct location:

<anon>:4:45: 4:66 warning: denote infinite loops with loop { ... }, #[warn(while_true)] on by default
<anon>:4     let s = "ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
                                                     ^~~~~~~~~~~~~~~~~~~~~

@steveklabnik steveklabnik added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Mar 9, 2017
@Mark-Simulacrum
Copy link
Member

Closing as fixed.

fn main() {
    let s = ~"ZͨA͑ͦL̄͑ĜͨOͥ͛!̏"; while true { break; }
}

Now gives:

error: expected expression, found `~`
 --> test.rs:2:13
  |
2 |     let s = ~"ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
  |             ^

error: aborting due to previous error

@Mark-Simulacrum
Copy link
Member

Never mind, closed prematurely. The ~ was caught in an earlier pass, and since it's before the while true, I assumed this was fixed when it wasn't...

warning: unused variable: `s`
 --> test.rs:2:9
  |
2 |     let s = "ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
  |         ^
  |
  = note: #[warn(unused_variables)] on by default

warning: denote infinite loops with loop { ... }
 --> test.rs:2:45
  |
2 |     let s = "ZͨA͑ͦ͒͋ͤ͑̚L̄͑͋Ĝͨͥ̿͒̽̈́Oͥ͛ͭ!̏"; while true { break; }
  |                                             ^^^^^^^^^^^^^^^^^^^^^
  |
  = note: #[warn(while_true)] on by default

@euclio
Copy link
Contributor

euclio commented Jul 15, 2017

I'd like to work on this, but I could use some help. I've written a failing UI test, and I've pulled in the unicode-width crate to libsyntax. I'm having trouble finding the right place to change how the spans are calculated in the filemap. Should I be modifying bytepos_to_file_charpos...?

@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. C-bug Category: This is a bug. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. C-bug Category: This is a bug. I-wrong labels Jul 19, 2017
bors added a commit that referenced this issue Nov 4, 2017
Display spans correctly when there are zero-width or wide characters

Hopefully...
* fixes #45211
* fixes #8706

---

Before:
```
error: invalid width `7` for integer literal
  --> unicode_2.rs:12:25
   |
12 |     let _ = ("a̐éö̲", 0u7);
   |                         ^^^
   |
   = help: valid widths are 8, 16, 32, 64 and 128

error: invalid width `42` for integer literal
  --> unicode_2.rs:13:20
   |
13 |     let _ = ("아あ", 1i42);
   |                    ^^^^
   |
   = help: valid widths are 8, 16, 32, 64 and 128

error: aborting due to 2 previous errors
```

After:
```
error: invalid width `7` for integer literal
  --> unicode_2.rs:12:25
   |
12 |     let _ = ("a̐éö̲", 0u7);
   |                     ^^^
   |
   = help: valid widths are 8, 16, 32, 64 and 128

error: invalid width `42` for integer literal
  --> unicode_2.rs:13:20
   |
13 |     let _ = ("아あ", 1i42);
   |                      ^^^^
   |
   = help: valid widths are 8, 16, 32, 64 and 128

error: aborting due to 2 previous errors
```

Spans might display incorrectly on the browser.

r? @estebank
@est31
Copy link
Member

est31 commented Jan 12, 2018

Seems like a part of the issue is still present. See #47380.

@estebank
Copy link
Contributor

@est31 this is a problem that might get introduced back every now and then by forgetting to test with unicode, but should only happen on new features (like the suggestions machinery that introduced the linked case).

flip1995 pushed a commit to flip1995/rust that referenced this issue Apr 21, 2022
Fix formatting of `cast_abs_to_unsigned` docs

The "use instead" section of the example was not being formatted as Rust code, and the "configuration" documentation was being formatted as Rust code.

changelog: `[cast_abs_to_unsigned]` Fix example/configuration formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet