Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the text tokenizer #1885

Merged
merged 9 commits into from
May 21, 2024
Merged

Improve the text tokenizer #1885

merged 9 commits into from
May 21, 2024

Conversation

vkbo
Copy link
Owner

@vkbo vkbo commented May 21, 2024

Summary:

This PR makes the following improvements to the Tokenizer class:

  • Paragraphs consisting of multiple lines are now combined in the tokenizer instead of passed through to the converter classes. This means that each token list entry with the block format T_TEXT is a whole paragraph. A setting flag determines whether the line break is preserved or replaced with a single space. Each format class must ensure any line breaks in this text is handled correctly.
  • There are no longer any T_EMPTY token entries passed on to the format classes. These are now stripped in the second pass over the token list when paragraphs lines are combined.
  • Any non-text block is now considered a paragraph separator, in addition to an empty line. This makes more sense in corner cases where for instance two text lines are separated by a heading, but not empty lines. The heading would process fine, but the surrounding text would be combined into a paragraph after the heading. This is no longer the case.
  • All of this is achieved by extending the second pass of the token list, which is capable of both look behind and look ahead since it already has done one pass of the text. Since there was already a second pass, there is little additional cost to this aside from some memory bloat. Performance-wise, the job done in the tokenizer was anyway done in each of the format classes, so there should be little difference here as well.

This text line combining logic is identical to the one that was used in the ODT writer class, with a single optimisation that the recalculation of formatting positions is skipped if the paragraph contains a single line.

Related Issue(s):

Needed for #1882

Reviewer's Checklist:

  • The header of all files contain a reference to the repository license
  • The overall test coverage is increased or remains the same as before
  • All tests are passing
  • All flake8 checks are passing and the style guide is followed
  • Documentation (as docstrings) is complete and understandable
  • Only files that have been actively changed are committed

@vkbo vkbo added this to the Release 2.5 Beta 1 milestone May 21, 2024
@vkbo vkbo merged commit e95d330 into main May 21, 2024
8 checks passed
@vkbo vkbo deleted the feature/tokenizer_improvements branch May 21, 2024 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant