Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance by not extracting compressed image data if retainImageContent was set to false #590

Merged
merged 12 commits into from
Apr 13, 2023

Conversation

se-ti
Copy link
Contributor

@se-ti se-ti commented Mar 27, 2023

One of previous pull requests allowed not to decompress image data, if the user has set retainImageContent flag to false.
This pull request suggests even not to extract compressed image data, if this flag was set to false.

Minor optimization: extracting stream data tries to use header information about stream's length to speed preg_match up.

This PR improves memory allocation, achieved with pull 582,
from 45M for the first launch, 25M for the best subsequent one
to 43M for the first launch, 18M for the best subsequent.

https://westra.ru/reports/kavkaz/danilina-1el2-arhyz-2021.pdf

$lim = '128M';
ini_set("memory_limit", $lim);
include 'alt_autoload.php-dist';

$path = 'danilina-1el2-arhyz-2021.pdf';

$config = new \Smalot\PdfParser\Config();
$config->setRetainImageContent(false);
$parser = new \Smalot\PdfParser\Parser([], $config);

$content = file_get_contents($path);
$pdf = $parser->parseContent($content);

$text = $pdf->getText();

echo ini_get("memory_limit")."\n";
echo $text;

@se-ti
Copy link
Contributor Author

se-ti commented Mar 27, 2023

Sorry, i don't have PHPStan installed and set up.
What is the problem with it?

How can i interpret its output?

Loaded config default from "/home/runner/work/pdfparser/pdfparser/.php-cs-fixer.php".
................F................................................ 65 / 72 ( 90%)
.......                                                           72 / 72 (100%)
Legend: .-no changes, F-fixed, S-skipped (cached or empty file), I-invalid file syntax (file ignored), E-error
   1) src/Smalot/PdfParser/RawData/RawDataParser.php (modernize_types_casting, method_argument_space, no_unneeded_control_parentheses, no_superfluous_phpdoc_tags, native_function_invocation, yoda_style, phpdoc_align)

@k00ni
Copy link
Collaborator

k00ni commented Mar 27, 2023

How can i interpret its output?

PHP-CS-Fixer has complains about the file. Give me a day or two, if you don't want to install PHP-CS-Fixer locally (to fix it).

Made getHeaderValue private, because it is only used internally. We don't wanna expand our API, if it can be avoided.
@k00ni k00ni changed the title Do not extract compressed image data if retainImageContent was set to false Improve performance by not extracting compressed image data if retainImageContent was set to false Mar 30, 2023
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't had the time to answer properly. Now I have.

First of all, thank you again for your work @se-ti.

I pushed a few changes fixing coding style issues (the ones you mentioned) and made the new method getRawObject private.

Further remarks in code.

@se-ti
Copy link
Contributor Author

se-ti commented Mar 30, 2023

I pushed a few changes fixing coding style issues...

Thanks a lot, @k00ni !

Co-authored-by: Konrad Abicht <hi@inspirito.de>
@se-ti se-ti requested a review from k00ni April 1, 2023 21:01
@se-ti
Copy link
Contributor Author

se-ti commented Apr 4, 2023

I think, I have added all the requested changes.
Should i do anything else?
Any other requests?
@k00ni

@se-ti
Copy link
Contributor Author

se-ti commented Apr 10, 2023

Аny news, @k00ni?

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@se-ti I am very busy currently, but I will try to squeeze in a few minutes here and there. Thank you for bearing with me.

I added a few comments to the code based on your latest feedback. My knowledge in this topic is limited so I will have to trust you on this one. It looks clean and results speak for themselves.

I will leave this open for a few days to give others a chance to object.

@se-ti
Copy link
Contributor Author

se-ti commented Apr 11, 2023

Thank You!
Waiting for this PR to be merged, to suggest another one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants