Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup of eat_while() in lexer #77629

Merged
merged 1 commit into from
Oct 10, 2020

Conversation

Julian-Wollersberger
Copy link
Contributor

@Julian-Wollersberger Julian-Wollersberger commented Oct 6, 2020

The size of a lexer Token was inflated by the largest TokenKind variants LiteralKind::RawStr and RawByteStr, because

  • it used usize although u32 is sufficient in rustc, since crates must be smaller than 4GB,
  • and it stored the 20 bytes big RawStrError enum for error reporting.

If a raw string is invalid, it now needs to be reparsed to get the RawStrError data, but that is a very cold code path.

Technically this breaks other tools that depend on rustc_lexer because they are now also restricted to a max file size of 4GB. But this shouldn't matter in practice, and rustc_lexer isn't stable anyway.

Can I also get a perf run?

Edit: This makes no difference in performance. The PR now only contains a small cleanup.

@rust-highfive
Copy link
Collaborator

r? @varkor

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 6, 2020
@jonas-schievink
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 6, 2020

⌛ Trying commit 2cc461ba86901295d9b706d7e39a15e3d4efa6cf with merge 42c8db8af9dcb6f65f03cbbdf104da7ed2fc2018...

@camelid camelid added A-parser Area: The parsing of Rust source code to an AST. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. labels Oct 6, 2020
@bors
Copy link
Contributor

bors commented Oct 7, 2020

💥 Test timed out

@Julian-Wollersberger
Copy link
Contributor Author

I fixed the test failure and added @pickfire's suggestion.

I don't understand why bors timed out. Was it spurious?
Can I get another perf run?

@Xanewok
Copy link
Member

Xanewok commented Oct 7, 2020

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 7, 2020

⌛ Trying commit 2663c35bf86871e1849400eca6417c1fed02990c with merge b468e924ec58d34e078526586335258b57deb9a8...

@Julian-Wollersberger
Copy link
Contributor Author

Julian-Wollersberger commented Oct 7, 2020

I noticed a potential bug in eat_while(): it doesn't account for the number of UTF8 bytes.
Fixed it by inlining it in the two places where the count is used and also simplified the logic there.
It might be a little perf improvement too.

@Xanewok can you restart the perf-run? EDIT: Nevermind. Sorry for bothering you.

@jyn514
Copy link
Member

jyn514 commented Oct 7, 2020

@Julian-Wollersberger bug fixes don't need a perf run IMO, unless they strongly affect perf.

@jyn514 jyn514 added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 7, 2020
@Julian-Wollersberger
Copy link
Contributor Author

I guess rust-timer got stuck. @Mark-Simulacrum do you know what happened here and can you restart the perf run?
Maybe rust-timer got confused because bors timed out?

@jyn514
Copy link
Member

jyn514 commented Oct 8, 2020

@rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@jyn514
Copy link
Member

jyn514 commented Oct 8, 2020

🤦 I think it got confused because you pushed more commits maybe?

@bors try

@bors
Copy link
Contributor

bors commented Oct 8, 2020

⌛ Trying commit 8caa6072eef69d6f4098ed7a04f2e674c22dbe57 with merge 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b...

@bors
Copy link
Contributor

bors commented Oct 8, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b (85f2a07ff3f2a86b9ef4ebe097dc23656933e80b)

@rust-timer
Copy link
Collaborator

Queued 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b with parent f1dab24, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (85f2a07ff3f2a86b9ef4ebe097dc23656933e80b): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

…er of UTF8 bytes.

Fixed it by inlining it in the two places where the count is used and simplified the logic there.
@Julian-Wollersberger
Copy link
Contributor Author

The perf results couldn't be more neutral. Oh well.
I've removed the first two commits and only the bugfix remains. This can be rollup=always I guess.

Changing usize to u32 might still be worth it as a cleanup so rustc_lexer is more consistent with the rest of rustc. Opinions?
@varkor this is ready for review again.

@pickfire
Copy link
Contributor

pickfire commented Oct 9, 2020

Oh, there doesn't seemed to be much performance gain but I wonder if memory footprint is reduced? From what I see the memory footprint is not tracked in the benchmarks.

@Mark-Simulacrum
Copy link
Member

Please update the PR description and title if you significantly change the pull request, right now it is not clear that this is a bugfix. I would also expect to see a test case if there is a bug being fixed here.

@mati865
Copy link
Contributor

mati865 commented Oct 9, 2020

Oh, there doesn't seemed to be much performance gain but I wonder if memory footprint is reduced? From what I see the memory footprint is not tracked in the benchmarks.

It's noisy as always but in terms of the memory it's neutral/slightly worse.

@Julian-Wollersberger Julian-Wollersberger changed the title Reduce size of lexer::Token from 72 to 16 bytes Cleanup of eat_while() in lexer Oct 10, 2020
@Julian-Wollersberger
Copy link
Contributor Author

Julian-Wollersberger commented Oct 10, 2020

Please update the PR description and title if you significantly change the pull request

Ok, makes sense. Done.

I would also expect to see a test case if there is a bug being fixed here.

Sorry, I wasn't precise. It could have been a bug if the count returned by eat_while() was used in a place where non-ASCII chars can be, but luckily it was only used to count the '#' in raw strings, which are ASCII chars.
So it's not really a bugfix but a cleanup to prevent a potential bug.

@Julian-Wollersberger
Copy link
Contributor Author

From what I see the memory footprint is not tracked in the benchmarks.

@pickfire At the bottom of the page you can select "max-rss" and click "submit" at the right edge of the page.
(There is PR rust-lang/rustc-perf#782 to improve the design.)

@varkor
Copy link
Member

varkor commented Oct 10, 2020

@bors r+ rollup

@bors
Copy link
Contributor

bors commented Oct 10, 2020

📌 Commit bd49ded has been approved by varkor

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 10, 2020
JohnTitor added a commit to JohnTitor/rust that referenced this pull request Oct 10, 2020
…rError, r=varkor

Cleanup of `eat_while()` in lexer

The size of a lexer Token was inflated by the largest `TokenKind` variants `LiteralKind::RawStr` and `RawByteStr`, because
* it used `usize` although `u32` is sufficient in rustc, since crates must be smaller than 4GB,
* and it stored the 20 bytes big `RawStrError` enum for error reporting.

If a raw string is invalid, it now needs to be reparsed to get the `RawStrError` data, but that is a very cold code path.

Technically this breaks other tools that depend on rustc_lexer because they are now also restricted to a max file size of 4GB. But this shouldn't matter in practice, and rustc_lexer isn't stable anyway.

Can I also get a perf run?

Edit: This makes no difference in performance. The PR now only contains a small cleanup.
bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 10, 2020
Rollup of 10 pull requests

Successful merges:

 - rust-lang#77195 (Link to documentation-specific guidelines.)
 - rust-lang#77629 (Cleanup of `eat_while()` in lexer)
 - rust-lang#77709 (Link Vec leak doc to Box)
 - rust-lang#77738 (fix __rust_alloc_error_handler comment)
 - rust-lang#77748 (Dead code cleanup in windows-gnu std)
 - rust-lang#77754 (Add TraitDef::find_map_relevant_impl)
 - rust-lang#77766 (Clarify the debug-related values should take boolean)
 - rust-lang#77777 (doc: disambiguate stat in MetadataExt::as_raw_stat)
 - rust-lang#77782 (Fix typo in error code description)
 - rust-lang#77787 (Update `changelog-seen` in config.toml.example)

Failed merges:

r? `@ghost`
@bors bors merged commit c14c9ba into rust-lang:master Oct 10, 2020
@rustbot rustbot added this to the 1.49.0 milestone Oct 10, 2020
nnethercote added a commit to nnethercote/rust that referenced this pull request Jul 31, 2022
From 72 bytes to 12 bytes (on x86-64).

There are two parts to this:
- Changing various source code offsets from 64-bit to 32-bit. This is
  not a problem because the rest of rustc also uses 32-bit source code
  offsets. This means `Token` is no longer `Copy` but this causes no
  problems.
- Removing the `RawStrError` from `LiteralKind`. Raw string literal
  invalidity is now indicated by a `None` value within
  `RawStr`/`RawByteStr`, and the new `validate_raw_str` function can be
  used to re-lex an invalid raw string literal to get the `RawStrError`.

There is one very small change in behaviour. Previously, if a raw string
literal matched both the `InvalidStarter` and `TooManyHashes` cases,
the latter would override the former. This has now changed, because
`raw_double_quoted_string` now uses `?` and so returns immediately upon
detecting the `InvalidStarter` case. I think this is a slight
improvement to report the earlier-detected error, and it explains the
change in the `test_too_many_hashes` test.

The commit also removes a couple of comments that refer to rust-lang#77629 and
say that the size of these types don't affect performance. These
comments are wrong, though the performance effect is small.
flip1995 pushed a commit to flip1995/rust that referenced this pull request Aug 11, 2022
From 72 bytes to 12 bytes (on x86-64).

There are two parts to this:
- Changing various source code offsets from 64-bit to 32-bit. This is
  not a problem because the rest of rustc also uses 32-bit source code
  offsets. This means `Token` is no longer `Copy` but this causes no
  problems.
- Removing the `RawStrError` from `LiteralKind`. Raw string literal
  invalidity is now indicated by a `None` value within
  `RawStr`/`RawByteStr`, and the new `validate_raw_str` function can be
  used to re-lex an invalid raw string literal to get the `RawStrError`.

There is one very small change in behaviour. Previously, if a raw string
literal matched both the `InvalidStarter` and `TooManyHashes` cases,
the latter would override the former. This has now changed, because
`raw_double_quoted_string` now uses `?` and so returns immediately upon
detecting the `InvalidStarter` case. I think this is a slight
improvement to report the earlier-detected error, and it explains the
change in the `test_too_many_hashes` test.

The commit also removes a couple of comments that refer to rust-lang#77629 and
say that the size of these types don't affect performance. These
comments are wrong, though the performance effect is small.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area: The parsing of Rust source code to an AST. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.