Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: invalid cm/tm in visitor functions #2206

Merged
merged 15 commits into from
Oct 8, 2023
Merged

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Sep 19, 2023

closes #2200
closes #2075
reworks and is still valid to close #2059

@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (4b090ba) 94.40% compared to head (8a6a0e5) 94.44%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2206      +/-   ##
==========================================
+ Coverage   94.40%   94.44%   +0.04%     
==========================================
  Files          43       43              
  Lines        7599     7621      +22     
  Branches     1501     1502       +1     
==========================================
+ Hits         7174     7198      +24     
+ Misses        262      261       -1     
+ Partials      163      162       -1     
Files Coverage Δ
pypdf/__init__.py 87.50% <100.00%> (ø)
pypdf/_text_extraction/__init__.py 93.70% <100.00%> (+0.36%) ⬆️
pypdf/_page.py 94.67% <88.00%> (+0.31%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator Author

@troethe,
From my analysis, the change you've proposed was not suffcient as the prev position may be overriden during the process.
I've created a PR where I've storing the cm/tm matrices when resetting the text variable (this normally occurs when meeting the Td which will occurs/set the position before restarting a new line
Can you have a review of my change?

@pubpub-zz pubpub-zz marked this pull request as ready for review September 20, 2023 19:10
Copy link
Contributor

@troethe troethe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think this approach is currently the best way to handle #2075. The two comments I added are minutiae and not necessarily important.

m_prev = mult(tm_prev, cm_prev)
m = mult(tm_matrix, cm_matrix)
orientation = orient(m)
delta_x = m[4] - m_prev[4]
delta_y = m[5] - m_prev[5]
k = math.sqrt(abs(m[0] * m[3]) + abs(m[1] * m[2]))
f = font_size * k
cm_prev = m
Copy link
Contributor

@troethe troethe Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look quite right. I think this line can however be safely removed, because cm_prev doesn't get accessed until it is being assigned to again at the end of this function (i.e. this is currently a noop).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree.
Good point to have noticed it . Actually I've forgot to pass the multiplied cm_matrix to the visitor

pypdf/_page.py Outdated
@@ -1946,7 +1946,10 @@ def _extract_text(
1.0,
0.0,
0.0,
] # will store previous tm_matrix
] # will store cm_matrix * tm_matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old comment was more correct, right? This change was probably a side effect of the revert.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct we store inhere only tm_matrix

@pubpub-zz pubpub-zz changed the title Terminal BUG: invalid cm/tm in visitor functions BUG: invalid cm/tm in visitor functions Sep 21, 2023
@pubpub-zz
Copy link
Collaborator Author

@troethe
I've taken into account your comments

@srogmann
I would like to have your opinion too:
I've also completed the PR passing the product of cm_matrix . tm_matrix as the users may not have an easy access to mult() function and because cm_matrix alone has no mean (for me)
also for me it is better to recommand to work on the cm_matrix as it will take into account all transformations()
finally looking at the code I currently see no difference between using visitor_operand_before() and visitor_operand_after().
Shouldn't we await from visitor_operand_before() to return the tuple operator,operands to allow the user to modify/cancel the processing of some operations ?

@srogmann
Copy link
Contributor

@pubpub-zz I also have no example of cm_matrix without tm_matrix at hand. PDF-reference, "5.3.3 Text Space Details", uses the product of cm_matrix and tm_matrix you mentioned.

An argument to stay at cm_matrix instead of the product might be the performance if the product is computed but not used. In that case the publish of a mult()-function might be useful.

In text-extraction-test 3 of tests/test_page.py I used visitor_operand_after to verify the Td-operation (see visitor_td).

The return of the operator by visit_operand_before() can support patches or studies like the one in #1389 (comment). Can or should the operand-list be modified inside visit_operand_before() by using append(...), clear(...), ...? Otherwise it would be better to return the tuple (operator, operands).

@pubpub-zz
Copy link
Collaborator Author

@pubpub-zz I also have no example of cm_matrix without tm_matrix at hand. PDF-reference, "5.3.3 Text Space Details", uses the product of cm_matrix and tm_matrix you mentioned.

An argument to stay at cm_matrix instead of the product might be the performance if the product is computed but not used. In that case the publish of a mult()-function might be useful.

In text-extraction-test 3 of tests/test_page.py I used visitor_operand_after to verify the Td-operation (see visitor_td).

The return of the operator by visit_operand_before() can support patches or studies like the one in #1389 (comment). Can or should the operand-list be modified inside visit_operand_before() by using append(...), clear(...), ...? Otherwise it would be better to return the tuple (operator, operands).

@MartinThoma
your opinion?

@MartinThoma
Copy link
Member

I don't have an opinion on this topic. To be honest, I haven't read the specs so I need to trust you on those changes.

If two of you agree that this PR is good (and CI is green), then I would merge :-)

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma / @srogmann
the PR for me is mature.
I left the change of cm_matrix which now reflects the position/orientation/scale of the text in the page user space :
this may be considered as a change in the interface but I don't think many people are using the existing matrix and the change makes things easier to handle (the matrix description was very obscure)
I did not implement the new feature to use visit_operand_before to modify the operations. I will do this in an independant PR for traceability.

@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Sep 26, 2023
@pubpub-zz
Copy link
Collaborator Author

Plus stdby

@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Oct 1, 2023
@MartinThoma
Copy link
Member

Plus stdby

I take this as "please standby", so I should not merge?

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Oct 1, 2023

Plus stdby

I take this as "please standby", so I should not merge?

Correct interpretation
@jianfan2012 used the visitor in an interesting (cf #2163 (reply in thread)) way where the multiplication of cm.tm is preventing the use
we need to keep then separate. I'm reverting my change update the documentation

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
All should be good now.

Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
@MartinThoma MartinThoma merged commit bcd85c4 into py-pdf:main Oct 8, 2023
@MartinThoma
Copy link
Member

The cm/tm parts of the visitor functions are really hard to get right. Thank you for taking care of it @pubpub-zz 🙏

MartinThoma added a commit that referenced this pull request Oct 8, 2023
## What's new

### Bug Fixes (BUG)
-  invalid cm/tm in visitor functions (#2206) by @pubpub-zz
-  Encrypt / decrypt Stream object dictionaries (#2228) by @pilotandy
-  Support nested color spaces for the /DeviceN color space (#2241) by @Stefan
-  images property fails if NullObject in list (#2215) by @pubpub-zz

### Robustness (ROB)
-  XYZ destination to cope with missing left and top param (#2237) by @pubpub-zz

### Documentation (DOC)
-  Add pilotandy for #2228 as a contributor by @MartinThoma
-  Link to CONTRIBUTING.md in README.md (#2232) by @markpfeifle
-  Changelog by @MartinThoma

### Developer Experience (DEV)
-  Exclude tests from mypy checks by @MartinThoma
-  Unify mypy options and warn redundant workarounds (#2223) by @exiledkingcc
-  Stabilize Pillow test with Pillow missing (#2229) by @Stefan

### Maintenance (MAINT)
-  Update pinned packages (#2243) by @MartinThoma

### Testing (TST)
-  Regression test against non-deterministic accidental object reuse (#2244) by @MartinThoma

[Full Changelog](3.16.2...3.16.3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants