Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative formatting processing #42

Merged
merged 1 commit into from
Oct 15, 2024

Conversation

cdesaintguilhem
Copy link
Contributor

Hi there, as I was tinkering to figure out how to wrap closer to the given wrapping parameter one thing led to another and I arrived at an alternative logic for some of the steps that the formatting loop makes. I thought I'd share the code here in case you would be interested; if there are particular changes you like, I would be happy to cherry-pick them into their own PR and close this one up for the archives.

All the changes in this PR are to the logic only and not to the formatting style, all tests still pass unchanged (save for the addition of a \begin{env1} inside a comment to check that it is properly ignored). However, with these changes, I think that wrapping tightly to the wrapping parameter can be achieved by changing the wrapping_boundary to use args.wrap instead of args.wrap_min. I'm still testing this on another branch, but this requires changing all the test file to account for the tighter expected wrap.

Summary of changes

  1. The main formatting loop processes (splits and wraps) lines only by reference, allocating new memory only when applying the indentation, or when returning lines to the queue after wrapping them.
  2. To avoid allocating when indenting only to return the line to the queue without adding it to the text, calculating the indent is separated from applying the indent. This allows moving the indent application until after the wrapping computation.
  3. Since the indent is no longer applied to the line before wrapping it, the wrapping logic is expanded to instead take into account the calculated indent when deciding whether a line needs to be wrapped and, if so, where it should be wrapped.
  4. Since some languages have 0-width chars (such as the French accents `,´,ˆ, etc.), I suspect that going with .chars() could cause inconsistent wrapping in non-English languages. This code introduces the unicode-width dependency to obtain the on-screen width of each character in the string and calculate a wrapping point that visually matches the wrapping parameter.
  5. The splitting of lines is done with a new capturing regex from which the index of the matches can be extracted. This then allows returning two string slices: one containing everything before the split point, and the other containing the rest. This Regex could be easily expanded to capture other commands at which the line should be split (such as \section{...} or \[ for example).
  6. The wrapping of lines is also done by reference by returning a string slice for the line before the wrap point, a slice containing the start of the next line (to add "% " in front of comments), and a slice containing the rest of the line.

Misc changes

  1. Cleaning the text (removing tabs and trailing whitespace) is extracted into a function.
  2. Deciding whether to ignore a line (based on verbatim environments or tex-fmt ignore commands) is extracted into a function, and checked before proceeding to the formatting logic.
  3. Checking whether formatting returns to zero is done by checking the value of indent.actual after exiting the loop instead of accessing the lines of the text.
  4. The test_source() function is modified slightly to panic on the first file that formats incorrectly and to print the file name. The computation of the diff was taking a very long time when many files were failing because of many diffs.

Performance Report

As show from the output below, these changes bring a (63.0 -> 57.1 = ) 9.3% improvement in performance on my machine.

Switched to branch 'main'
Your branch is up to date with 'origin/main'.
Building release binary
   Compiling tex-fmt v0.4.5 (/Users/cyprien/Code/tex-fmt)
    Finished `release` profile [optimized] target(s) in 11.74s
Running benchmark
Benchmark 1: tex-fmt-main
  Time (mean ± σ):      63.0 ms ±   0.7 ms    [User: 58.7 ms, System: 3.8 ms]
  Range (min … max):    62.4 ms …  72.6 ms    200 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
branch 'pr-refactor-logic' set up to track 'origin/pr-refactor-logic'.
Switched to a new branch 'pr-refactor-logic'
Building release binary
  Downloaded unicode-width v0.2.0
  Downloaded 1 crate (271.5 KB) in 0.21s
   Compiling unicode-width v0.2.0
   Compiling tex-fmt v0.4.5 (/Users/cyprien/Code/tex-fmt)
    Finished `release` profile [optimized] target(s) in 11.78s
Running benchmark
Benchmark 1: tex-fmt-branch
  Time (mean ± σ):      57.1 ms ±   0.5 ms    [User: 53.0 ms, System: 3.8 ms]
  Range (min … max):    56.5 ms …  61.6 ms    200 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
tex-fmt-main: 0.062972017995s
tex-fmt-branch: 0.05712572198499997s

@WGUNDERWOOD
Copy link
Owner

WGUNDERWOOD commented Oct 14, 2024

Thanks, this is very exciting! I will look through this carefully soon; your contributions are greatly appreciated.

src/format.rs Outdated Show resolved Hide resolved
@WGUNDERWOOD
Copy link
Owner

Other than the comment above, this all looks good to me. Perhaps you can rebase it onto main, noting that I have split parse.rs into cli.rs and read.rs to separate the CLI logic from the file reading logic.

@cdesaintguilhem cdesaintguilhem changed the base branch from develop to main October 15, 2024 12:07
@cdesaintguilhem
Copy link
Contributor Author

I've now merged your latest changes to main into this branch and addressed the typo.

I'm happy to squash all of these commits into one on my end, but I'm wondering whether this is something that you're also able to do on your side. Up to you if you want to decide what the merge commit should say or if you're happy for me come up with something.

@WGUNDERWOOD
Copy link
Owner

Yes I think it's best if you squash on your side before merging. Also if you ever want to dig deeper into performance issues, you can run the extra/prof.sh script to produce a flamegraph. Right now it seems that the .count() method in the get_diff function in indent.rs is the main candidate for improvement.

Squashed commit of the following:

commit ccbc433
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Tue Oct 15 14:13:35 2024 +0200

    Fix typo

commit 0e4ec8c
Merge: 52fe1f7 3f658fd
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Tue Oct 15 14:07:14 2024 +0200

    Merge remote-tracking branch 'upstream/main' into pr-refactor-logic

commit 52fe1f7
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Mon Oct 14 22:19:16 2024 +0200

    Document added regex

commit 1418812
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Mon Oct 14 18:15:44 2024 +0200

    Fix Clippy warnings

commit 6c248db
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Mon Oct 14 18:04:11 2024 +0200

    Use `char::width()` for checking wrapping need

commit 3f658fd
Author: William G Underwood <42812654+WGUNDERWOOD@users.noreply.github.com>
Date:   Mon Oct 14 15:28:00 2024 +0100

    Clippy

commit 0dcbc93
Author: William G Underwood <42812654+WGUNDERWOOD@users.noreply.github.com>
Date:   Mon Oct 14 15:03:15 2024 +0100

    Split cli and read files

commit 3ca0fd2
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Mon Oct 14 14:09:18 2024 +0200

    Remove unused function

commit b4e2130
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Mon Oct 14 14:08:45 2024 +0200

    Wrap lines based on indent length and only apply indentation after wrapping

commit 9bc0a3b
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Sun Oct 13 20:12:44 2024 +0200

    Move allocation outside of wrap application

commit 76d9ef0
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Sat Oct 12 11:17:43 2024 +0200

    Separate indent calculation from application

commit eb7971e
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Sat Oct 12 10:46:56 2024 +0200

    Extract indent calculation into function

commit e41a66d
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Sat Oct 12 10:31:58 2024 +0200

    Apply clippy suggestions

commit 21bb908
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 17:43:14 2024 +0200

    Rename ignore check and document

commit ac164ad
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 17:13:31 2024 +0200

    Simplify Option matching

commit 8a6e868
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 12:58:12 2024 +0200

    Remove unnecessary comment check and `Option` in return type when splitting lines

commit 1d15b86
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 12:57:17 2024 +0200

    Change comments around

commit e77b629
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 12:56:46 2024 +0200

    Refactor ignoring check as function

commit b10b010
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 11:37:18 2024 +0200

    Move environment line splitting before indenting

    This avoids re-adding the start of the split line to the queue.

commit 0424d0a
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 11:23:01 2024 +0200

    Move check for ignore outside of `apply_indent()`

commit 70ac614
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 09:29:55 2024 +0200

    Use new return type for `put_env_new_line`

commit fdea16b
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 09:28:54 2024 +0200

    Fix infinite loop issue when split points are in comments

commit 791c7f1
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 09:27:23 2024 +0200

    Correct to not expect `{` after `\item` but only after `\begin` and `\end`

commit 6e5ce6c
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Fri Oct 11 09:26:06 2024 +0200

    Match target tex file to source file

commit 9562af1
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Thu Oct 10 22:27:53 2024 +0200

    Remove debug print statement

commit 0f48547
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Thu Oct 10 18:10:00 2024 +0200

    Split environment and item lines without allocating

    This does not compile because the `format` function cannot yet deal with the new return type.

commit c49a08c
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Thu Oct 10 13:27:48 2024 +0200

    Check indentation return to 0 on `state` instead of `new_text`

commit 1b90f21
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Thu Oct 10 13:11:57 2024 +0200

    Split condition tree into distinct blocks with `continue`

commit e974225
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Thu Oct 10 11:07:08 2024 +0200

    Plan line formatting logic steps

commit 4bf802a
Author: Cyprien de Saint Guilhem <cyprien.desaintguilhem@3milabs.tech>
Date:   Thu Oct 10 10:53:42 2024 +0200

    Group lines to clean text into one function

commit 380380d
Author: William G Underwood <42812654+WGUNDERWOOD@users.noreply.github.com>
Date:   Thu Oct 10 14:33:04 2024 +0100

    Release process notes
@cdesaintguilhem
Copy link
Contributor Author

Squash completed.

I've kept the messages in the squash commit's message; and I've kept a branch on my fork with the individual commits.

@WGUNDERWOOD WGUNDERWOOD merged commit f73426c into WGUNDERWOOD:main Oct 15, 2024
9 checks passed
@cdesaintguilhem cdesaintguilhem deleted the pr-refactor-logic branch October 15, 2024 13:46
@WGUNDERWOOD
Copy link
Owner

Could you add an example test of where the old code would fail on zero-width chars? I tried an example with a French accent in tests/source/unicode.tex, but using .width() doesn't seem necessary for this one to pass.

@cdesaintguilhem
Copy link
Contributor Author

As a first attempt: the following line (with two spaces before "Aujourd'hui" and where the 70th visual column is between "vérifier" and "is")

y̆es  Aujourd'hui je me suis baladé déhors. Hier je suis allé vérifier si l'été était arrivé.

is wrapped by tex-fmt 0.4.5 to

y̆es  Aujourd'hui je me suis baladé déhors. Hier je suis allé
vérifier si l'été était arrivé.

So it's wrapped at the 61-st visual place, instead of the expected 70th (because of wrap_min)

However, running the current main with cargo run wraps it to

y̆es  Aujourd'hui je me suis baladé déhors. Hier je suis allé vérifier
si l'été était arrivé.

The issue with what I said in the PR is that French accents are not 0-width characters; indeed letters like é are single characters in their own right, rather than being e + ´. However, introducing the (Hungarian?) letter causes the difference in behaviour since this is represented with two separate characters.

If there is a performance loss in using the function from unicode-width it might be worth removing it given that it might not affect as many languages as I thought.

@WGUNDERWOOD
Copy link
Owner

I do see a slight performance degradation using unicode-width. I'll look into it a little more to see how likely we are to encounter problems if we remove it, but my guess is that we don't need to worry about this too much, as LaTeX is designed so that most accents can be expressed using standard characters, such as {\'e} for example.

@cdesaintguilhem
Copy link
Contributor Author

That's a good point about how LaTeX is meant to offer other ways for the user to express accents, without assuming that the source is UTF-8 encoded. Some of the scripts for which I suspect there could be more 0-width characters would include Mandarin, Hindi, Arabic or maybe Korean or Japanese but I don't know enough about these to check. (I also don't know how much LaTeX is used with these languages.)

One thought I had would be to extract the code that calculates a line's length into a function, and have a feature (something like true-unicode) which controls how that function is implemented (using unicode-width::width() when the feature is enabled, or just .char_indices() when not).

@WGUNDERWOOD
Copy link
Owner

A list of characters for which the width needs modifying is given in the unicode_width crate docs. My impression is that these will be rarely encountered, so I'll remove this dependency for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants