Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Absorb spaces after 'stream' declarations #642

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

GreyWyvern
Copy link
Contributor

@GreyWyvern GreyWyvern commented Sep 26, 2023

Type of pull request

  • Bug fix (involves code and configuration changes)

About

When detecting the start of a stream, PdfParser currently expects the next character to be either a carriage-return (\r) or a newline (\n). If there is a space in between the stream and either the \r or the \n, it is not detected as a stream of data and is discarded.

Adjust the regexp in RawDataParser.php to absorb spaces after stream.

Resolves #641. Note that in the sample files provided by the original reporter of 641 there are remaining font decoding issues with the output that are outside the scope of this fix.

Checklist for code / configuration changes

  • Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well.
  • Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.
  • In case you fix an existing issue, please do one of the following:
    • Write in this text something like fixes #1234 to outline that you are providing a fix for the issue #1234.

There looks to be some additional issues with fonts remaining, but the text content is now read by getText().
@k00ni k00ni added the fix label Sep 27, 2023
@k00ni k00ni merged commit 051ec84 into smalot:master Sep 28, 2023
29 checks passed
@GreyWyvern GreyWyvern deleted the stream-whitespace branch February 14, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't extract text from pdf, Returns empty string
2 participants