Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularize library #244

Closed
wants to merge 2 commits into from
Closed

Modularize library #244

wants to merge 2 commits into from

Conversation

Kestrer
Copy link
Contributor

@Kestrer Kestrer commented Dec 5, 2020

First of all, apologies for the enormous PR - I initially tried to do incremental changes but it became clear that a rewrite from the ground up was required, so here we are.

This rewrite had several motivations that explain many of the design decisions:

  • The ability to support multiple splitting algorithms in the future, such as ICU.
  • The ability to support non-terminal contexts that have different width calculations than unicode-width.
  • Currently, the library assumes that terminals will display characters using their Unicode width, however this is an incorrect assumption: iTerm2 for example displays 👨‍👨‍👧‍👦 (a family emoji) in two columns instead of eight as unicode-width would report. This point ties into the last one: allowing swapping out unicode-width for other functions.
  • The possibility of supporting no_std in the future; none of the main algorithms currently allocate or use any std features.
  • Favouring modularity over configuration - having the core API be entirely modular allows it to be implemented in a far simpler way and allows it to be more extensible. However it can be fairly verbose, I can add a configuration layer on top of it if you wish.

To understand the precise changes that were made, it's probably best to just to open it in Rustdoc.

This is far from complete. I still have to re-add hyphenation support, tests, benchmarks, examples, shorthand functions et cetera, but the base functionality is there. By the way, I'm not expecting this API to be added as-is; this PR just represents one extreme of how modular this library could be. We could compromise between this and the current one.

@mgeisler
Copy link
Owner

mgeisler commented Dec 6, 2020

Hi @Koxiaet,
Wow, this is indeed a big change. There are definitely some nice ideas here — now the good question is how we can extract them and merge them.

@mgeisler
Copy link
Owner

mgeisler commented Dec 6, 2020

I think would need to discuss the problems with the current design first — when we agree on what those are, then we can look for solutions to them.

You mention above that this PR brings

  • The ability to support multiple splitting algorithms in the future, such as ICU.

Can this not be done today by preparing Fragments in a suitable fashion using ICU?

Right now, textwrap::wrap takes a &str and gives you a Vec. It finds and splits the Words internally, but I was thinking that we should change wrap to something like

pub fn wrap<'a, S, Opt>(text: &str, options: Opt) -> Vec<Cow<'_, str>>
where
    S: WordSplitter,
    Opt: Into<Options<'a, S>>,
{
    text.split('\n').flat_map(|line| {
        let words = core::find_word_and_measure_unicode_width(line, &options);
        core::wrap_fragments(words, &options);
    })
}

That is, make wrap really trivial. If someone has a different way of finding and measuring words, then it simply replace find_words_and_measure_unicode_width with another function. This function can be as complex as it wants, as long as it returns a Vec<Fragment> or Iterator<Item = Fragment> at the end of the day. We can then feed that into a generic wrapping function and we should be good?

@mgeisler
Copy link
Owner

mgeisler commented Dec 6, 2020

  • Currently, the library assumes that terminals will display characters using their Unicode width, however this is an incorrect assumption: iTerm2 for example displays family_man_man_girl_boy (a family emoji) in two columns instead of eight as unicode-width would report. This point ties into the last one: allowing swapping out unicode-width for other functions.

Right, we cannot actually know how the characters are displayed at the end of the day. Applications which know more than what unicode-width does, will have to prepare their Fragments themselves.

My goal was to make it easy for 99% of the applications to call

for line in textwrap::wrap(&some_text, some_width) {
    // do something with the line of text
}

while also making it possible for more advanced programs to do something like this:

let fragments = custom_way_of_finding_words(&some_text);
let wrapped_fragments = core::wrap_optimal_fit(&words, line_lengths);
for line in wrapped_fragments {
    // Use the fragments in line to refer back to font and styling information
    // so you can render them in your UI or put them into a PDF, or...
}

Perhaps an advanced program only needs to check if $TERM_PROGRAM is equal to iTerm.app after which it can run through the words returned by core::find_words and adjust the widths suitably. We could also do this automatically in find_words, which would be pretty awesome actually!

@mgeisler
Copy link
Owner

mgeisler commented Dec 6, 2020

The decoupling of Span from Fragment seems nice:

/// A span of text with content, glue and a penalty.
///
/// This type is intentionally agnostic over the algorithm used to calculate the width of each
/// string, and so does not implement [`Fragment`](super::Fragment).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct Span<'a> {

The Word struct I made has one and only one way to measure the text width (via unicode-width) and that might have been a mistake...

In any case, I look forward to seeing some benchmark results when you get that hooked up. I don't know precisely how fast textwrap needs to be, but my feeling is that people consider wrapping text to be "simple" or nearly "trivial", so I hope to keep things fairly lean here. The functions in textwrap::core are pretty lean to me right now and I hope to keep that.

@Kestrer
Copy link
Contributor Author

Kestrer commented Dec 7, 2020

Just to clear up some confusion - I'm not suggesting that all users should be forced to use the modular and verbose API. I was planning on adding simple functions like the ones we have now on top of it that use sensible defaults. So I absolutely share the goal "to make it easy for 99% of the applications".

Perhaps an advanced program only needs to check if $TERM_PROGRAM is equal to iTerm.app after which it can run through the words returned by core::find_words and adjust the widths suitably. We could also do this automatically in find_words, which would be pretty awesome actually!

This is a good idea, the trouble is I don't know how to "adjust the widths suitably". Once an application doesn't use Unicode width, it's hard to say anything about string widths without hard-coding special cases ourselves, which isn't really a good solution.

people consider wrapping text to be "simple" or nearly "trivial", so I hope to keep things fairly lean here.

One improvement we could make is to only build in support for text splitting with ICU. In theory this would reduce performance for simple cases, but it's not by much and really users should always be using internationalized algorithms anyway. What do you think?

@mgeisler
Copy link
Owner

mgeisler commented Dec 8, 2020

Just to clear up some confusion - I'm not suggesting that all users should be forced to use the modular and verbose API. I was planning on adding simple functions like the ones we have now on top of it that use sensible defaults. So I absolutely share the goal "to make it easy for 99% of the applications".

Definitely! I saw that you talked about putting a layer on top of the new building blocks you've created. This is great and necessary.

I've been trying to figure out what the problem is with the current building blocks? Why can users of textwrap not use the Fragments instead? I saw that @robinkrahl was able to wrap lines of text for a PDF library (see #126 (comment)), so that shows some promise to the current abstraction.

This is a good idea, the trouble is I don't know how to "adjust the widths suitably". Once an application doesn't use Unicode width, it's hard to say anything about string widths without hard-coding special cases ourselves, which isn't really a good solution.

Yeah, I also don't know how to know that the family emoji should be counted as taking up 2 columns or something else... However, I think that is okay: textwrap doesn't need to solve this problem in the first attempt. But we should enable other programs to solve it, if they want.

So I'm imagining that a more comprehensive program would have code like

let mut words = custom_way_of_finding_words(&some_text);
for word in words {
  if has_zwj_emoji(word) {
    adjust_width(&mut word);
  }
}
let wrapped_fragments = core::wrap_optimal_fit(&words, line_lengths);

(plus/minus mutability)

Does that goal make sense?

If so, I think a first next step would be to make it very simple for external programs to do what wrap does today. This means extracting the functionality of wrap into helper functions which we can put into core.

@mgeisler
Copy link
Owner

mgeisler commented Dec 8, 2020

One improvement we could make is to only build in support for text splitting with ICU. In theory this would reduce performance for simple cases, but it's not by much and really users should always be using internationalized algorithms anyway. What do you think?

I'm skeptical of the ICU dependency — simply because it's a dependency on a C library, which means that there are a number of new requirements, such a Clang being installed. However, it could still be a non-default dependency for people who need it and who has the C library installed on their system.

I see the splitting as the first step of a kind of "pipeline", as described here: textwrap::core. So I'm totally onboard with making it configurable. Does the Fragment approach not satisfy this?

@Kestrer
Copy link
Contributor Author

Kestrer commented Dec 8, 2020

Why can users of textwrap not use the Fragments instead?

Does the Fragment approach not satisfy this?

This API is also based on Fragments, the only difference is that it separates text splitting and text width finding via plain::Span and plain::Fragment. This means that users who want to change the width-finding function will not have to reimplement the text splitting functionality, they only have to replace the width finding layer.

But we should enable other programs to solve it, if they want.

This was the motivation behind the Width trait - users can implement the Width trait for their own type, and plug it in to the existing logic and everything should work. So your example would look roughly like:

struct UserWidth {
    inner: width::Unicode,
}
impl Width for UserWidth {
    fn width_char(&self, c: char) -> usize {
        self.inner.width_char(c)
    }
    fn width_str(&self, s: &str) -> usize {
        // Magic ZWJ stuff
    }
}

and then

core::wrap_optimal_fit(
    custom_way_of_finding_words(&some_text)
        .map(|s| s.width(UserWidth::new())),
    line_lengths,
)

However, it could still be a non-default dependency for people who need it

That's a good idea, I think a feature flag is the way to go, since there's never any case where you wouldn't want ICU to be used if you have it. Otherwise we should probably use unicode-linebreak, which is slightly better than splitting on spaces.

@mgeisler
Copy link
Owner

Does the Fragment approach not satisfy this?

This API is also based on Fragments, the only difference is that it separates text splitting and text width finding via plain::Span and plain::Fragment. This means that users who want to change the width-finding function will not have to reimplement the text splitting functionality, they only have to replace the width finding layer.

Okay, but we need to break the change down into smaller steps. Otherwise I cannot review it and we won't make forward progress :-) In other words, it's great that you've create a goal, but please try to start with the existing code.

My plan is still to basically rip out all the logic of wrap so that it only calls some helper functions. This way it becomes a trivial wrapper around the helper functions. In Rust pseudo code I imagine it would look like this:

fn wrap(text: &str, options: Opt) -> Vec<&str> {
    text.split('\n').flat_map(|line| wrap_single_line(line, options))
}

fn wrap_single_line(line: &str, options: Opt) -> Vec<&str> {
    let words = core::find_words(line);
    let split_words = core::split_words(words);
    let broken_wrods = core::break_words(split_words);
    let break_points = core::wrap_fragments(broken_words);
    core::wrap_line_by_break_points(line, break_points)
}

// in core
fn wrap_text_by_break_points(line: &str, wrapped_words: &[[Fragment]]) -> Vec<&str> {
    // The second half of wrap today:
    let mut idx = 0;
    for words in wrapped_words {
        // ...
    }
}

The problem with such an approach is probably that I'll run into all sorts of problems with borrowing: I'll need to pass the Options around and I also need to make sure that I have the original &str at hand at the right place so I can get a substring out of it.

Have you tried writing the full wrap function with your approach — and verified that it can borrow correctly from the input string? Last I tested it, there was a clear performance win from using the Cow<str> type as the result for each line (a win when there is no indentation so that we can borrow from the input). if you haven't already, then please see if this also works smoothly with your approach.

@mgeisler
Copy link
Owner

This was the motivation behind the Width trait - users can implement the Width trait for their own type, and plug it in to the existing logic and everything should work. So your example would look roughly like:

I'm not opposed to this at all — as I wrote somewhere above, the existing Word struct is actually measuring things twice and thus doing some extra work.

So we should perhaps let core::find_words return something else than Word, or let the measurement be delayed like you suggest. Both would be fine, I think, but please start attack the problem from that starting point.

Looking at names, I think we cannot have both Fragment and Span: it's not intuitively clear and obvious what the difference is between two such abstract entities. Does a Span have a width or is it Fragment? We could perhaps rename the Word struct to TextFragment so we have

  • Fragment: a trait for something which can be wrapped
  • TextFragment: a concrete type of Fragment, what I called Word before.

We could do the same with Span and TextSpan — I like the shorter name :-) I don't think anybody will start depending on the inner workings of the library anytime soon, so I'm fine with making a breaking release sometime in the new year.

@Kestrer
Copy link
Contributor Author

Kestrer commented Dec 11, 2020

Otherwise I cannot review it and we won't make forward progress :-) In other words, it's great that you've create a goal, but please try to start with the existing code.

Yes, I was planning on doing so - this was mostly an experiment so I could better understand the problem space and spark discussion. Should have made that more clear in the original PR, sorry.

Have you tried writing the full wrap function with your approach — and verified that it can borrow correctly from the input string?

I have not - the next thing I'll do is to add the iterator-based wrapping functions to the library I think, and I'll see how it goes.

let the measurement be delayed like you suggest

The big advantage of delaying measurement is allowing for chaining of multiple splitting functions (e.g. word splitting -> hyphenation splitting) without recalculating widths.

We could do the same with Span and TextSpan — I like the shorter name :-)

As in, have a Span trait? Or rename Span to TextSpan? I'm fine with all of text::Fragment, TextFragment, Span and TextSpan for the struct names, whichever you think is clearer :).

Something I wanted to discuss is the use of configuation (Options); if wrap is trivialised to the pseudocode you wrote, then is there much benefit for users to use options as opposed to simply writing out the steps manually?

@mgeisler
Copy link
Owner

Otherwise I cannot review it and we won't make forward progress :-) In other words, it's great that you've create a goal, but please try to start with the existing code.

Yes, I was planning on doing so - this was mostly an experiment so I could better understand the problem space and spark discussion. Should have made that more clear in the original PR, sorry.

Awesome, thanks! It is definitely valuable to explore the design, so thanks for putting up the experiment.

I've been playing with the ideas in #221 for about two years now — it was just a background thing that I tinkered with when I found some free time. Originally, the code would step through the input string char by char and keep track of everything in a bunch of local variables. It was a mess...and various bugs kept cropping up over time 😬

I had a hunch that it should be possible to simplify the code while also making it more flexible. So I'm quite happy with the new "pipeline" and with the reformulation in terms of Words or Fragments or whatever we should call them.

For the user of textwrap, nothing much changed with #221, but that's where great decoupling ideas like you have come into the picture 😄

let the measurement be delayed like you suggest

The big advantage of delaying measurement is allowing for chaining of multiple splitting functions (e.g. word splitting -> hyphenation splitting) without recalculating widths.

Yeah, that is indeed cool! Definitely something we need.

We could do the same with Span and TextSpan — I like the shorter name :-)

As in, have a Span trait? Or rename Span to TextSpan? I'm fine with all of text::Fragment, TextFragment, Span and TextSpan for the struct names, whichever you think is clearer :).

Okay, cool! If at all possible, then I would prefer to have just one abstract term. That is, we can have a trait named Fragment, Segment, Span or something similar. I tried looking at the definitions of those words:

  • Fragment: a part broken off, detached, or incomplete.

  • Segment: a separate piece of something or one of the constituent parts into which a body, entity, or quantity is divided or marked off by or as if by natural boundaries.

  • Span: an extent, stretch, reach, or spread between two limits: such as a: a limited space (as of time), the spread or extent between abutments or supports (as of a bridge).

Based on these definitions, as in particular the synonym discussion for fragment, I think I like fragment more than span. I guess the word span came to mind because I've seen it used to describe regions of Rust code. It's also HTML where the <span> element is used to mark regions of text.

This would imply renaming Word to TextFragment, or perhaps have a text module with a Fragment struct which implements core::Fragment. I find that a bit confusing, though, since the two names are the same.

Something I wanted to discuss is the use of configuation (Options); if wrap is trivialised to the pseudocode you wrote, then is there much benefit for users to use options as opposed to simply writing out the steps manually?

No, in some sense there isn't — the steps would be simple and users could have written them themselves. However, from looking at the reverse dependencies, I conclude that 99% of all users simply want to call textwrap::fill(some_string, some_width) or textwrap::wrap(some_string, some_width). It seems very rare that people actually customize the wrapping behavior. So from that perspective, the top-level function provides value by giving people a good default. It's less maintenance for people to call the functions over having to understand anything about fragments and whatnot.

@Kestrer
Copy link
Contributor Author

Kestrer commented Dec 12, 2020

I think I like fragment more than span.

Me too; I think Span has also has connotations of text (<span>, Rust's spans) whereas Fragment is more generic.

I find that a bit confusing, though, since the two names are the same.

Ok, in that case we'll use TextFragment; but we also do need a name for a bit of text without a width, and I think Span is a good name for it, since it has the aforementioned text connotations and doesn't include the word Fragment (which implies it would implement Fragment).

So from that perspective, the top-level function provides value by giving people a good default.

Agreed, we should absolute keep the top-level functions; I was wondering whether it is better to just take a usize instead of impl Into<Options>. Since 99% of users are just using a usize and those that aren't can use the modular API, Options might not be necessary at all.

@mgeisler
Copy link
Owner

Ok, in that case we'll use TextFragment; but we also do need a name for a bit of text without a width, and I think Span is a good name for it, since it has the aforementioned text connotations and doesn't include the word Fragment (which implies it would implement Fragment).

Ahh.... that's why you need two types: one for (content, glue, penalty) and one for the same but with a cached width? And you need two types because you want to cache the width? Otherwise you could just compute the width on the fly — wrap_first_fit and wrap_optimal_fit both compute the width of each Fragment once. However, the break_words function also computes the width to see if it should forcibly break the word.

I don't think struct Span fits in well with struct TextFragment. Perhaps we could use struct UnmeasuredText as a name for a type which simply holds content, glue, and whitespace? We could even go with struct Text.

The Breaking Paragraphs into Lines paper talks about "items" as the generic term for something that is put into a line. These items can then either be a box, glue, or whitespace. However, I think Fragment is a better term than Item — the latter is too generic.

So from that perspective, the top-level function provides value by giving people a good default.

Agreed, we should absolute keep the top-level functions; I was wondering whether it is better to just take a usize instead of impl Into<Options>. Since 99% of users are just using a usize and those that aren't can use the modular API, Options might not be necessary at all.

I think Options is useful because it gives people a central overview of the configuration options available to them when using fill or wrap. Otherwise this knowledge would be spread across the individual building-block functions that wrap calls. As an example, toggling break_words would require a user to look at the implementation of wrap, notice that it calls a core::break_words function and then decide to not call this function.

So for me, Options gives people some ready-made recipies. For example, it allows for indentation of the first or subsequent lines. One could easily use the functions in core to indent every odd line differently. But this is not supported via Options and I think that is fine. Similarly, the wrapping functions can in principle use different line widths for every line — so you can shape text into triangles, circles and whatnot. I've been playing with that and it's a lot of fun :-) But it's also not exposed and I'll probably simply end up adding a demo to the examples/ folder.

@Kestrer
Copy link
Contributor Author

Kestrer commented Dec 13, 2020

one for (content, glue, penalty) and one for the same but with a cached width?

Yes. If the wrapping functions take in iterators and the splitting function outputs are passed directly into the line wrapping inputs, the widths will be lazily computed on the fly anyway, but conceptually caching them will allow users that have static (unchanging) text to only compute the width of each fragment once.

However, the break_words function also computes the width to see if it should forcibly break the word.

I actually want to move the force-breaking into the line wrapping functions themselves eventually, so it would only have to be calculated once. Another advantage of having the force breaking in the wrapping functions is that we don't have to pass around the line widths everywhere - they only have to be managed by one part, the line wrappers.

Perhaps we could use struct UnmeasuredText as a name for a type which simply holds content, glue, and whitespace? We could even go with struct Text.

I don't really like UnmeasuredText or Text; UnmeasuredText is very long and it's not quite clear what "measure" means, and they both use the word Text which implies it's a complete string of text rather than a small range of it.

However, I think Fragment is a better term than Item — the latter is too generic.

Agreed, and our fragments are not the same as their item; our fragments are a combination of a box, glue and penalty item. As a side note this means that we don't support multiple glue items in a row, which the paper does support, but I don't think it's that important since we also do box-breaking based on words rather than characters.

I think Options is useful because it gives people a central overview of the configuration options available to them when using fill or wrap.

I think that this purpose is better served by documentation and examples; but if you think Options is worth keeping that's probably a better idea.

I'm currently reading the line breaking paper, so once I'm done with that I'll start working on actual code again.

@mgeisler
Copy link
Owner

one for (content, glue, penalty) and one for the same but with a cached width?

Yes. If the wrapping functions take in iterators and the splitting function outputs are passed directly into the line wrapping inputs, the widths will be lazily computed on the fly anyway, but conceptually caching them will allow users that have static (unchanging) text to only compute the width of each fragment once.

That's a very good point!

Perhaps we could use struct UnmeasuredText as a name for a type which simply holds content, glue, and whitespace? We could even go with struct Text.

I don't really like UnmeasuredText or Text; UnmeasuredText is very long and it's not quite clear what "measure" means, and they both use the word Text which implies it's a complete string of text rather than a small range of it.

Hmm... yeah, I guess you're right 😄 Let's revisit this later, perhaps Span is good enough after all.

Agreed, and our fragments are not the same as their item; our fragments are a combination of a box, glue and penalty item. As a side note this means that we don't support multiple glue items in a row, which the paper does support, but I don't think it's that important since we also do box-breaking based on words rather than characters.

Yeah! I'm quite happy to have a simpler model t han the full model in the paper. I've seen people do all sorts of tricks in (La)TeX using the full expressive power of glue and penalties — I would prefer not to see such tricks again if I can avoid it.

So the paper can serve as inspiration, but it's a non-goal to implement the exact same algorithm.

I'm currently reading the line breaking paper, so once I'm done with that I'll start working on actual code again.

Sounds good 👍

@Kestrer
Copy link
Contributor Author

Kestrer commented Dec 13, 2020

I've been thinking a lot about whether to favour consistency or efficiency/flexibility. This comes up for example in the return types of wrap_first_fit and wrap_optimal_fit:

  • If they both return Vecs, wrap_first_fit will always cause an allocation even when one isn't necessary. This also eliminates the possibility of using it in no-std.
  • If they both return iterators, and a user calling wrap_optimal_fit wants a Vec they'll have to .collect() the result of wrap_optimal_fit even though it is already a vector, causing an unnecessary allocation. This approach does avoid .reverse()ing the vec if we do .into_iter().rev() so maybe it's a good idea anyway? Still, collecting a reversed vector iterator is slower than reversing an existing vector.
  • If wrap_first_fit returns an iterator but wrap_optimal_fit returns a Vec that's inconsistent and could be confusing.

Similar problems occur in other places too (e.g. wrap_first_fit can take the more flexible iterator of line lengths whereas wrap_optimal_fit must take the less flexible random-access line lengths with Fn(usize) -> usize).

Do you have any thoughts on how to solve this? Do you mind having a slightly inconsistent API if it allows for optimal efficiency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants