-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add "layout" mode for text extraction #2388
Commits on Jan 3, 2024
-
ENH: text extraction "layout" mode
- add _text_extraction/_layout_mode subpackage (initial version) - expose new subpackage functionality via new PageObject methods _layout_mode_fonts() and _layout_mode_text() - add "extraction_mode" parameter and layout_mode kwargs to existing PageObject.extract_text() method for experimental usage
Configuration menu - View commit details
-
Copy full SHA for 86ed974 - Browse repository at this point
Copy the full SHA 86ed974View commit details -
BUG: bad refactor in _layout_mode/_fonts.py
Remove unnecessary "any()" wrapper after refactoring for python 3.7
Configuration menu - View commit details
-
Copy full SHA for f43b84e - Browse repository at this point
Copy the full SHA f43b84eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 220de15 - Browse repository at this point
Copy the full SHA 220de15View commit details -
Configuration menu - View commit details
-
Copy full SHA for 21d9f1b - Browse repository at this point
Copy the full SHA 21d9f1bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 9fa3b5f - Browse repository at this point
Copy the full SHA 9fa3b5fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 1545a27 - Browse repository at this point
Copy the full SHA 1545a27View commit details -
Configuration menu - View commit details
-
Copy full SHA for 81b6a83 - Browse repository at this point
Copy the full SHA 81b6a83View commit details -
Configuration menu - View commit details
-
Copy full SHA for bb9190b - Browse repository at this point
Copy the full SHA bb9190bView commit details
Commits on Jan 4, 2024
-
Configuration menu - View commit details
-
Copy full SHA for cefbfc6 - Browse repository at this point
Copy the full SHA cefbfc6View commit details -
MAINT: Address PR review comments
- DOC: standardize language. use "layout", not "structure/structural". - BUG: address bug introduced by ruff refactoring (remove "TYPE_CHECKING" block for Literal import) - DEV: use sys.version_info based import switch (not try/except) for Literal and TypedDict to correct vscode colors and prevent odd mypy errors - TST: add test created by @MartinThoma in py-pdf#2390 - ENH: add remaining standard fonts and aliases
Configuration menu - View commit details
-
Copy full SHA for ff7e40f - Browse repository at this point
Copy the full SHA ff7e40fView commit details -
Configuration menu - View commit details
-
Copy full SHA for f37909c - Browse repository at this point
Copy the full SHA f37909cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 48e971e - Browse repository at this point
Copy the full SHA 48e971eView commit details
Commits on Jan 5, 2024
-
MAINT: Address review comments
- PI: move json imports (debug only) - DEV: move `_set_state_param()` definition nearer to usage - MAINT: use `PdfReadError` vs `ValueError` - DOC: Comment/docstring improvements per review
Configuration menu - View commit details
-
Copy full SHA for 8742bcc - Browse repository at this point
Copy the full SHA 8742bccView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8e9d879 - Browse repository at this point
Copy the full SHA 8e9d879View commit details -
- DEV: add LAYOUT_NEW_BT_GROUP_SPACE_WIDTHS constant to _text_extraction __init__.py
Configuration menu - View commit details
-
Copy full SHA for 4dc3250 - Browse repository at this point
Copy the full SHA 4dc3250View commit details
Commits on Jan 6, 2024
-
Configuration menu - View commit details
-
Copy full SHA for d1d85a0 - Browse repository at this point
Copy the full SHA d1d85a0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 70b2f31 - Browse repository at this point
Copy the full SHA 70b2f31View commit details -
Configuration menu - View commit details
-
Copy full SHA for e7d5edd - Browse repository at this point
Copy the full SHA e7d5eddView commit details -
Configuration menu - View commit details
-
Copy full SHA for 64d1df0 - Browse repository at this point
Copy the full SHA 64d1df0View commit details
Commits on Jan 7, 2024
-
ENH: TJ spacing and rotation handling
- DEV: disambiguate "XformStack" and "xform" language in layout mode from existing extract_xform_text: - "xform" --> "transform" - "XformStack" --> "TextStateManager" - "xform_stack" --> "text_state_mgr" or "state_mgr" - DEV: move "TextStateParams" to its own file for easier discoverability and cross referencing during development - PI: reduce overhead of TextStateParams by eliminating unnecessary dataclass fields - DEV: rename _fonts.py to _font.py to properly reflect internal class name (Font) - DEV: rename "opands" to "operands" for all usages in _fixed_width_page.py - ENH: Use font.space_width * 2 as a fallback for assigning space_tx during TextStateParams.__post_init__() - applies when a font assigns width 0 for the " " char and uses TJ int operators for fine grained inter-word spacing as shown in crazyones.pdf - ENH: correct calculation for triggering a new BTGroup in recurs_to_target_op(). Remove abs() to prevent triggering on TJ spacing operators - ENH: rotation handling: - add "layout_mode_strip_rotated" kwarg to PageObject.extract_text() to assign new layout mode "strip_rotated" parameter. - produce a logger_warning if rotated text is found. - if strip_rotated == True, remove text that is rotated with respect to the page with warning "Rotated text discovered. Output will be incomplete." - if strip_rotated == False, include all text, rotated or not, and warn with "Rotated text discovered. Layout will be degraded."
Configuration menu - View commit details
-
Copy full SHA for 377bbd1 - Browse repository at this point
Copy the full SHA 377bbd1View commit details
Commits on Jan 8, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 3a0fc89 - Browse repository at this point
Copy the full SHA 3a0fc89View commit details -
Configuration menu - View commit details
-
Copy full SHA for fe7bb69 - Browse repository at this point
Copy the full SHA fe7bb69View commit details -
Configuration menu - View commit details
-
Copy full SHA for cec0be3 - Browse repository at this point
Copy the full SHA cec0be3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 955bd38 - Browse repository at this point
Copy the full SHA 955bd38View commit details -
BUG: address bugs caused by rename/refactor
- correct submodule name in test_text_extraction.py - resolve indirect objects in /DescendantFonts - add "layout_mode_strip_rotated" explanation to extract-text.md - prevent double spacing for the first tj element of a bt group
Configuration menu - View commit details
-
Copy full SHA for 579692a - Browse repository at this point
Copy the full SHA 579692aView commit details -
Merge branch 'text-layout-mode' of https://github.com/shartzog/pypdf …
…into text-layout-mode
Configuration menu - View commit details
-
Copy full SHA for 4402caa - Browse repository at this point
Copy the full SHA 4402caaView commit details -
Configuration menu - View commit details
-
Copy full SHA for 41417eb - Browse repository at this point
Copy the full SHA 41417ebView commit details -
Configuration menu - View commit details
-
Copy full SHA for 778f3c7 - Browse repository at this point
Copy the full SHA 778f3c7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8279c79 - Browse repository at this point
Copy the full SHA 8279c79View commit details -
Tests Bug Fixes "Uncommon" Operators
- add toy.pdf and toy.layout.pdf and associated test case for handling T*, ', ", TD, Tc, Tw, Tz, TL, and Ts operators - correct bugs associated with TL impacting T*, ', and " (sign is reversed from 1.7 standard, side effect of layout mode algorithm) - make "_set_state_param" and "decode_tj" methods of the TextStateManager class rather than passing the text state manager to them manually
Configuration menu - View commit details
-
Copy full SHA for 75aec12 - Browse repository at this point
Copy the full SHA 75aec12View commit details -
Configuration menu - View commit details
-
Copy full SHA for f25e9d5 - Browse repository at this point
Copy the full SHA f25e9d5View commit details
Commits on Jan 9, 2024
-
- cover both Type0 DecendantFonts /W formats in tests - add `set_font()` to TextStateManager instead of setting font/font_size attributes directly
Configuration menu - View commit details
-
Copy full SHA for 744a6db - Browse repository at this point
Copy the full SHA 744a6dbView commit details -
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for c5f0cd8 - Browse repository at this point
Copy the full SHA c5f0cd8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1b65085 - Browse repository at this point
Copy the full SHA 1b65085View commit details -
Configuration menu - View commit details
-
Copy full SHA for 373025d - Browse repository at this point
Copy the full SHA 373025dView commit details -
Configuration menu - View commit details
-
Copy full SHA for cdaa9ca - Browse repository at this point
Copy the full SHA cdaa9caView commit details -
Configuration menu - View commit details
-
Copy full SHA for 64e4c83 - Browse repository at this point
Copy the full SHA 64e4c83View commit details -
Configuration menu - View commit details
-
Copy full SHA for 878e407 - Browse repository at this point
Copy the full SHA 878e407View commit details -
Configuration menu - View commit details
-
Copy full SHA for e9962b3 - Browse repository at this point
Copy the full SHA e9962b3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 06e79d3 - Browse repository at this point
Copy the full SHA 06e79d3View commit details -
Configuration menu - View commit details
-
Copy full SHA for f43201a - Browse repository at this point
Copy the full SHA f43201aView commit details