Improve the text tokenizer #1885

vkbo · 2024-05-21T20:55:02Z

Summary:

This PR makes the following improvements to the Tokenizer class:

Paragraphs consisting of multiple lines are now combined in the tokenizer instead of passed through to the converter classes. This means that each token list entry with the block format T_TEXT is a whole paragraph. A setting flag determines whether the line break is preserved or replaced with a single space. Each format class must ensure any line breaks in this text is handled correctly.
There are no longer any T_EMPTY token entries passed on to the format classes. These are now stripped in the second pass over the token list when paragraphs lines are combined.
Any non-text block is now considered a paragraph separator, in addition to an empty line. This makes more sense in corner cases where for instance two text lines are separated by a heading, but not empty lines. The heading would process fine, but the surrounding text would be combined into a paragraph after the heading. This is no longer the case.
All of this is achieved by extending the second pass of the token list, which is capable of both look behind and look ahead since it already has done one pass of the text. Since there was already a second pass, there is little additional cost to this aside from some memory bloat. Performance-wise, the job done in the tokenizer was anyway done in each of the format classes, so there should be little difference here as well.

This text line combining logic is identical to the one that was used in the ODT writer class, with a single optimisation that the recalculation of formatting positions is skipped if the paragraph contains a single line.

Related Issue(s):

Needed for #1882

Reviewer's Checklist:

The header of all files contain a reference to the repository license
The overall test coverage is increased or remains the same as before
All tests are passing
All flake8 checks are passing and the style guide is followed
Documentation (as docstrings) is complete and understandable
Only files that have been actively changed are committed

vkbo added 7 commits May 21, 2024 16:33

Rename a few files

8ac0440

Define header sizes as constants

aadad6a

Strip consecutive empty paragraphs when tokenizing

afe4a29

Merge lines of the same paragraph already in the tokenizer

d8bb2ad

Update converter classes to handle only single paragraphs

ba6259f

Update tests

adf0550

Drop empty tokens from the tokenizer data

7a13b14

vkbo added this to the Release 2.5 Beta 1 milestone May 21, 2024

vkbo added 2 commits May 21, 2024 22:59

Move text blocks first also in stats counter as an optimisation

dc0ddcf

Remove context menu test from main menu tests

3fd62d0

vkbo merged commit e95d330 into main May 21, 2024
8 checks passed

vkbo deleted the feature/tokenizer_improvements branch May 21, 2024 21:11

vkbo mentioned this pull request May 21, 2024

Add first line indent to HTML #1858

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the text tokenizer #1885

Improve the text tokenizer #1885

vkbo commented May 21, 2024 •

edited

Loading

Improve the text tokenizer #1885

Improve the text tokenizer #1885

Conversation

vkbo commented May 21, 2024 • edited Loading

vkbo commented May 21, 2024 •

edited

Loading