-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modularize library #244
Modularize library #244
Conversation
Hi @Koxiaet, |
I think would need to discuss the problems with the current design first — when we agree on what those are, then we can look for solutions to them. You mention above that this PR brings
Can this not be done today by preparing Right now, pub fn wrap<'a, S, Opt>(text: &str, options: Opt) -> Vec<Cow<'_, str>>
where
S: WordSplitter,
Opt: Into<Options<'a, S>>,
{
text.split('\n').flat_map(|line| {
let words = core::find_word_and_measure_unicode_width(line, &options);
core::wrap_fragments(words, &options);
})
} That is, make |
Right, we cannot actually know how the characters are displayed at the end of the day. Applications which know more than what My goal was to make it easy for 99% of the applications to call for line in textwrap::wrap(&some_text, some_width) {
// do something with the line of text
} while also making it possible for more advanced programs to do something like this: let fragments = custom_way_of_finding_words(&some_text);
let wrapped_fragments = core::wrap_optimal_fit(&words, line_lengths);
for line in wrapped_fragments {
// Use the fragments in line to refer back to font and styling information
// so you can render them in your UI or put them into a PDF, or...
} Perhaps an advanced program only needs to check if |
The decoupling of
The In any case, I look forward to seeing some benchmark results when you get that hooked up. I don't know precisely how fast textwrap needs to be, but my feeling is that people consider wrapping text to be "simple" or nearly "trivial", so I hope to keep things fairly lean here. The functions in |
Just to clear up some confusion - I'm not suggesting that all users should be forced to use the modular and verbose API. I was planning on adding simple functions like the ones we have now on top of it that use sensible defaults. So I absolutely share the goal "to make it easy for 99% of the applications".
This is a good idea, the trouble is I don't know how to "adjust the widths suitably". Once an application doesn't use Unicode width, it's hard to say anything about string widths without hard-coding special cases ourselves, which isn't really a good solution.
One improvement we could make is to only build in support for text splitting with ICU. In theory this would reduce performance for simple cases, but it's not by much and really users should always be using internationalized algorithms anyway. What do you think? |
Definitely! I saw that you talked about putting a layer on top of the new building blocks you've created. This is great and necessary. I've been trying to figure out what the problem is with the current building blocks? Why can users of textwrap not use the
Yeah, I also don't know how to know that the family emoji should be counted as taking up 2 columns or something else... However, I think that is okay: textwrap doesn't need to solve this problem in the first attempt. But we should enable other programs to solve it, if they want. So I'm imagining that a more comprehensive program would have code like let mut words = custom_way_of_finding_words(&some_text);
for word in words {
if has_zwj_emoji(word) {
adjust_width(&mut word);
}
}
let wrapped_fragments = core::wrap_optimal_fit(&words, line_lengths); (plus/minus mutability) Does that goal make sense? If so, I think a first next step would be to make it very simple for external programs to do what |
I'm skeptical of the ICU dependency — simply because it's a dependency on a C library, which means that there are a number of new requirements, such a Clang being installed. However, it could still be a non-default dependency for people who need it and who has the C library installed on their system. I see the splitting as the first step of a kind of "pipeline", as described here: |
This API is also based on
This was the motivation behind the struct UserWidth {
inner: width::Unicode,
}
impl Width for UserWidth {
fn width_char(&self, c: char) -> usize {
self.inner.width_char(c)
}
fn width_str(&self, s: &str) -> usize {
// Magic ZWJ stuff
}
} and then core::wrap_optimal_fit(
custom_way_of_finding_words(&some_text)
.map(|s| s.width(UserWidth::new())),
line_lengths,
)
That's a good idea, I think a feature flag is the way to go, since there's never any case where you wouldn't want ICU to be used if you have it. Otherwise we should probably use |
Okay, but we need to break the change down into smaller steps. Otherwise I cannot review it and we won't make forward progress :-) In other words, it's great that you've create a goal, but please try to start with the existing code. My plan is still to basically rip out all the logic of fn wrap(text: &str, options: Opt) -> Vec<&str> {
text.split('\n').flat_map(|line| wrap_single_line(line, options))
}
fn wrap_single_line(line: &str, options: Opt) -> Vec<&str> {
let words = core::find_words(line);
let split_words = core::split_words(words);
let broken_wrods = core::break_words(split_words);
let break_points = core::wrap_fragments(broken_words);
core::wrap_line_by_break_points(line, break_points)
}
// in core
fn wrap_text_by_break_points(line: &str, wrapped_words: &[[Fragment]]) -> Vec<&str> {
// The second half of wrap today:
let mut idx = 0;
for words in wrapped_words {
// ...
}
} The problem with such an approach is probably that I'll run into all sorts of problems with borrowing: I'll need to pass the Have you tried writing the full |
I'm not opposed to this at all — as I wrote somewhere above, the existing So we should perhaps let Looking at names, I think we cannot have both
We could do the same with |
Yes, I was planning on doing so - this was mostly an experiment so I could better understand the problem space and spark discussion. Should have made that more clear in the original PR, sorry.
I have not - the next thing I'll do is to add the iterator-based wrapping functions to the library I think, and I'll see how it goes.
The big advantage of delaying measurement is allowing for chaining of multiple splitting functions (e.g. word splitting -> hyphenation splitting) without recalculating widths.
As in, have a Span trait? Or rename Span to TextSpan? I'm fine with all of Something I wanted to discuss is the use of configuation (Options); if |
Awesome, thanks! It is definitely valuable to explore the design, so thanks for putting up the experiment. I've been playing with the ideas in #221 for about two years now — it was just a background thing that I tinkered with when I found some free time. Originally, the code would step through the input string I had a hunch that it should be possible to simplify the code while also making it more flexible. So I'm quite happy with the new "pipeline" and with the reformulation in terms of For the user of textwrap, nothing much changed with #221, but that's where great decoupling ideas like you have come into the picture 😄
Yeah, that is indeed cool! Definitely something we need.
Okay, cool! If at all possible, then I would prefer to have just one abstract term. That is, we can have a trait named
Based on these definitions, as in particular the synonym discussion for fragment, I think I like fragment more than span. I guess the word span came to mind because I've seen it used to describe regions of Rust code. It's also HTML where the This would imply renaming
No, in some sense there isn't — the steps would be simple and users could have written them themselves. However, from looking at the reverse dependencies, I conclude that 99% of all users simply want to call |
Me too; I think
Ok, in that case we'll use
Agreed, we should absolute keep the top-level functions; I was wondering whether it is better to just take a |
Ahh.... that's why you need two types: one for I don't think The Breaking Paragraphs into Lines paper talks about "items" as the generic term for something that is put into a line. These items can then either be a box, glue, or whitespace. However, I think
I think So for me, |
Yes. If the wrapping functions take in iterators and the splitting function outputs are passed directly into the line wrapping inputs, the widths will be lazily computed on the fly anyway, but conceptually caching them will allow users that have static (unchanging) text to only compute the width of each fragment once.
I actually want to move the force-breaking into the line wrapping functions themselves eventually, so it would only have to be calculated once. Another advantage of having the force breaking in the wrapping functions is that we don't have to pass around the line widths everywhere - they only have to be managed by one part, the line wrappers.
I don't really like
Agreed, and our fragments are not the same as their item; our fragments are a combination of a box, glue and penalty item. As a side note this means that we don't support multiple glue items in a row, which the paper does support, but I don't think it's that important since we also do box-breaking based on words rather than characters.
I think that this purpose is better served by documentation and examples; but if you think I'm currently reading the line breaking paper, so once I'm done with that I'll start working on actual code again. |
That's a very good point!
Hmm... yeah, I guess you're right 😄 Let's revisit this later, perhaps
Yeah! I'm quite happy to have a simpler model t han the full model in the paper. I've seen people do all sorts of tricks in (La)TeX using the full expressive power of glue and penalties — I would prefer not to see such tricks again if I can avoid it. So the paper can serve as inspiration, but it's a non-goal to implement the exact same algorithm.
Sounds good 👍 |
I've been thinking a lot about whether to favour consistency or efficiency/flexibility. This comes up for example in the return types of
Similar problems occur in other places too (e.g. Do you have any thoughts on how to solve this? Do you mind having a slightly inconsistent API if it allows for optimal efficiency? |
First of all, apologies for the enormous PR - I initially tried to do incremental changes but it became clear that a rewrite from the ground up was required, so here we are.
This rewrite had several motivations that explain many of the design decisions:
unicode-width
.unicode-width
would report. This point ties into the last one: allowing swapping outunicode-width
for other functions.no_std
in the future; none of the main algorithms currently allocate or use anystd
features.To understand the precise changes that were made, it's probably best to just to open it in Rustdoc.
This is far from complete. I still have to re-add
hyphenation
support, tests, benchmarks, examples, shorthand functions et cetera, but the base functionality is there. By the way, I'm not expecting this API to be added as-is; this PR just represents one extreme of how modular this library could be. We could compromise between this and the current one.