Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reformulate wrapping in terms of words with whitespace and penalties #221

Merged
merged 1 commit into from
Nov 8, 2020

Conversation

mgeisler
Copy link
Owner

@mgeisler mgeisler commented Nov 8, 2020

This is a complete rewrite of the core word wrapping functionality. The user-visible change is that wrap now returns a Vec<Cow<'_, str>> instead of impl Iterator<Item = Cow<'_, str>>. In other words, you now get all lines returned to you at once instead of getting an iterator back. Code that simply iterated over the old return value simply need to add .iter(), code that already collected the lines into a vector can now do away with this code. I looked around GitHub for code that uses textwrap
and most code simply calls fill or wrap. An example is the clap crate.


New algorithm: Before, we would step though the input string and (attempt to) keep track of all aspects of the state. This didn't always work (see at least #122, #158, #158, and #193) and it's inflexible.

This commit replaces the old algorithm with a new one which works on a more abstract level. We now first

  1. First split the input string into "words". A word is a substring of the original string, including any trailing whitespace.

  2. We split each word according to the WordSplitter.

  3. Optional, if break_words is true: further spit each word so that it is no longer than the line width.

  4. We then simply put the words into lines based on the display width.

This is slower than the previous algorithm. The fill/1600 benchmark shows that is now takes ~19 microseconds to wrap a 1600 character long string (about 20 lines of terminal text). That is ~8 microseconds longer than before. I think this is still plenty fast, and the new structure makes it easier to reason about the logic.


This is a step towards #126: the wrap_fragments function could now in principle be used to wrap any kind of opaque "box", and this box could carry formatting information as needed. We can work on abstracting more functionality going forward, probably by making the Fragment trait more powerful, e.g., by moving the break_apart method from Word to Fragment.

This is a complete rewrite of the core word wrapping functionality.
Before, we would step though the input string and (attempt to) keep
track of all aspects of the state. This didn't always work (see at
least #122, #158, #158, and #193) and it's inflexible.

This commit replaces the old algorithm with a new one which works on a
more abstract level. We now first

1. First split the input string into "words". A word is a substring of
   the original string, including any trailing whitespace.

2. We split each word according to the `WordSplitter`.

3. We then simply put the words into lines based on the display width.

This is slower than the previous algorithm. The `fill/1600` benchmark
shows that is now takes ~18 microseconds to wrap a 1600 character long
string. That is around 8 microseconds longer than before.
@mgeisler mgeisler merged commit 52c39c3 into master Nov 8, 2020
@mgeisler
Copy link
Owner Author

mgeisler commented Nov 8, 2020

Ah, I should probably explain the PR title a little... the model here is vaguely inspired by the concepts of boxes, glue, and penalties in TeX. The terms were introduced in the very readable article Breaking Paragraphs into Lines from 1981 by Donald E. Knuth and Michael F. Plass. In short, a box is an opaque rectangle on the page, glue is the stretchable whitespace between boxes, and penalties are the extra content inserted at line breaks (such as hyphens). The article describes a line braking algorithm which justifies text while minimizing the stretching of individual lines.

I first wanted to reuse the terminology from the article, but the word box is a reserved keyword and already has a meaning of "heap allocation". I could have used the word glue to refer to the whitespace between words, but since we don't (yet) support justified text, our glue would be rather unflexible. Lastly, the greedy algorithm implemented in textwrap does not try to minimize anything except the total number of lines — reusing the terminology from the article would have been misleading people.

@mgeisler mgeisler changed the title Reformulate wrapping in terms of boxes, glue, and penalties Reformulate wrapping in terms of words with whitespace and penalties Dec 5, 2020
mgeisler added a commit that referenced this pull request Dec 9, 2020
This was broken by the rewrite in #221 and we only had coverage for a
single case of wrapping colored text.

Fixes #248.
@mgeisler mgeisler deleted the fragments-and-words branch January 30, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant