Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] text.wrap now returns non-empty list for one character strings #400

Merged
merged 2 commits into from
Jan 5, 2022

Conversation

Sygmei
Copy link
Contributor

@Sygmei Sygmei commented Dec 28, 2021

text.wrap returns empty list for one-character strings, this PR fixes this by using <= instead of < for the iterator

Copy link
Member

@alerque alerque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to contribute!

This looks reasonable, but can you clarify ... if it was returning the wrong result for one character strings before doesn't that imply it was returning the wrong result all along for all lengths because it was failing to evaluate the last position? Or does the nxt shenanigans mean that isn't true? If the latter, isn't this now looping one too many times for other cases? Perhaps a do ... until loop be more appropriate here.

Also how can we test this so that it is demonstrated to work as expected and we don't break it again?

@Sygmei
Copy link
Contributor Author

Sygmei commented Dec 29, 2021

@alerque sorry for yesterday's incomplete PR

I added more content :

  • Unit tests that checks that it works with old behaviour
  • New unit tests that tests for mono-letters and accented characters
  • Fixed old unit tests that did not respect the width limit (for wrap / fill)
  • Changed while <= to repeat until as recommended
  • wrap now uses %s as a "next word search" since old %W did not work with accented characters

@@ -46,18 +46,19 @@ batteries. That is because
the goal of Lua is to
produce a lean expressive
language that will be
used on all sorts of machines,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length of used on all sorts of = 20
should have jumped instead of putting machines, on same line

@Tieske
Copy link
Member

Tieske commented Dec 30, 2021

This thing has more problems, and I think the proposed solution is just shifting them (didn't test).

Small test;

local ww = require("pl.text").wrap

local t = [[
hello "world" 'this' -is- a bb ccc dddd test... but wouldn't it pass??? final. word-that-can-be-broken
]]

for _, v in ipairs(ww(t,3)) do print(v) end

Output (without the changes in this PR):

hello
"world"
'this'
-is-
a bb              --> 'bb' on the next line?
ccc dddd          --> 'dddd' on the next line?
test.             --> three dots on this line
.. but
wouldn'           --> the issue @Sygmei is fixing
t it
pass?             --> three ? on this line
?? final.
word-
that-
can-be-           --> broken into 'can-' and 'be-'
broken

@Tieske
Copy link
Member

Tieske commented Dec 30, 2021

this does remind me of this: #141

@Sygmei
Copy link
Contributor Author

Sygmei commented Dec 30, 2021

We can change the separator but %w and %W clearly are not ideal solutions
User might want to keep hyphened-words together as well ?

@alerque
Copy link
Member

alerque commented Dec 30, 2021

I agree with @Tieske that there are lots of problems and this is just shifting them, but I'm inclined to think that shifting them has some benefit. As long as we have a naive implementation the closer it works to what it says on the tin the better.

On the other hand I have lots of experience with typesetting and wrapping text context is such a hopelessly complex task I do think this naive implementation needs a disclaimer. In fact it should have been called naive_wrap() or naive_chunk() or something like that.

If we wanted to actually fix this we would have to hook into Unicode data –perhaps by using ICU– to actually determine what were word characters and where breaks are allowed in a locale dependent way. There really isn't any way to fake this and get it right more than a small percentage of the time. Given that we don't want to introduce external dependencies, I'm not sure there is anything for it but to just keep faking it or deprecate this entirely. Which brings us back to "we might as well fake it as best we can".

@Sygmei
Copy link
Contributor Author

Sygmei commented Dec 31, 2021

One advantage on splitting on space characters only is that the result is usually closer to what the user might expect.
Splitting on %w breaks unicode characters whereas breaking on space might make the line a little longer than expected (better downside imho)

@Tieske
Copy link
Member

Tieske commented Jan 3, 2022

I rebased and force-pushed your branch @Sygmei

Added the test from a previous comment as well.

@Tieske
Copy link
Member

Tieske commented Jan 3, 2022

The only problem left imo is that the function takes the width parameter, and treats it as the minimum. Whereas user expectations probably are that it would be the maximum width.

@Sygmei
Copy link
Contributor Author

Sygmei commented Jan 3, 2022

I did not change this behaviour since it was clearly documented :
"It will not break long words, so lines can be over the length to that extent."
Should that behaviour change ?

@Tieske
Copy link
Member

Tieske commented Jan 3, 2022

I read that in a way that words longer than the limit would extend beyond it (they would not be forcefully broken). But that's different from the current behaviour of a minimum length.

Though that might as well be a different pr.

Opinions @alerque ?

@Tieske
Copy link
Member

Tieske commented Jan 4, 2022

Added a commit that will honour the width given, and adds an option to forcefully break words longer than width.
This is what (I think) a user would expect from these functions.

It was harder than anticipated actually, it has soo many edge cases... added tests for all of them

@Tieske Tieske merged commit 70eb8f6 into lunarmodules:master Jan 5, 2022
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Feb 13, 2022
## 1.12.0 (2022-Jan-10)
 - deprecate: module `pl.text` the contents have moved to `pl.stringx` (removal later)
   [#407](lunarmodules/Penlight#407)
 - deprecate: module `pl.xml`, please switch to a more specialized library (removal later)
   [#409](lunarmodules/Penlight#409)
 - feat: `utils.npairs` added. An iterator with a range that honours the `n` field
   [#387](lunarmodules/Penlight#387)
 - fix: `xml.maptags` would hang if it encountered text-nodes
   [#396](lunarmodules/Penlight#396)
 - fix: `text.dedent` didn't handle declining indents nor empty lines
   [#402](lunarmodules/Penlight#402)
 - fix: `dir.getfiles`, `dir.getdirectories`, and `dir.getallfiles` now have the
   directory optional, as was already documented
   [#405](lunarmodules/Penlight#405)
 - feat: `array2d.default_range` now also takes a spreadsheet range, which means
   also other functions now take a range. [#404](lunarmodules/Penlight#404)
 - fix: `lapp` enums allow [patterns magic characters](https://www.lua.org/pil/20.2.html)
   [#393](lunarmodules/Penlight#393)
 - fix: `text.wrap` and `text.fill` numerous fixes for handling whitespace,
   accented characters, honouring width, etc.
   [#400](lunarmodules/Penlight#400)
 - feat: `text.wrap` and `text.fill` have a new parameter to forcefully break words
   longer than the width given.
   [#400](lunarmodules/Penlight#400)
 - fix: `stringx.expandtabs` could error out on Lua 5.3+
   [#406](lunarmodules/Penlight#406)
 - fix: `pl` the module would not properly forward the `newindex` metamethod
   on the global table.
   [#395](lunarmodules/Penlight#395)
 - feat: `utils.enum` added to create enums and prevent magic strings
   [#408](lunarmodules/Penlight#408)
 - change: `xml.new` added some sanity checks on input
   [#397](lunarmodules/Penlight#397)
 - added: `xml.xml_escape` and `xml.xml_unescape` functions (previously private)
   [#397](lunarmodules/Penlight#397)
 - feat: `xml.tostring` now also takes numeric indents (previously only strings)
   [#397](lunarmodules/Penlight#397)
 - fix: `xml.walk` now detects recursion (errors out)
   [#397](lunarmodules/Penlight#397)
 - fix: `xml.clone` now detects recursion (errors out)
   [#397](lunarmodules/Penlight#397)
 - fix: `xml.compare` now detects recursion (errors out)
   [#397](lunarmodules/Penlight#397)
 - fix: `xml.compare` text compares now work
   [#397](lunarmodules/Penlight#397)
 - fix: `xml.compare` attribute order compares now only compare if both inputs provide an order
   [#397](lunarmodules/Penlight#397)
 - fix: `xml.compare` child comparisons failing now report proper error
   [#397](lunarmodules/Penlight#397)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants