Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce excessive memory allocations #582

Merged
merged 4 commits into from
Mar 14, 2023
Merged

Reduce excessive memory allocations #582

merged 4 commits into from
Mar 14, 2023

Conversation

se-ti
Copy link
Contributor

@se-ti se-ti commented Mar 8, 2023

Hi!
I ran into issue, that extracting text from ~20Mb pdfs (~300K text, lots of images), fails with memory allocation error at 128M limit.
I looked through the code and found several unlimited substring allocations and fixed them.
I've replaced couple of regexps with a bit more effective code also.

Hope, You'd accept my PR.

You can test my fixes with such a pdf https://westra.ru/reports/kavkaz/danilina-1el2-arhyz-2021.pdf and this code:

$lim = '128M';
ini_set("memory_limit", $lim);
include 'alt_autoload.php-dist';

$path = 'danilina-1el2-arhyz-2021.pdf';

$config = new \Smalot\PdfParser\Config();
$config->setRetainImageContent(false);
$parser = new \Smalot\PdfParser\Parser([], $config);

$content = file_get_contents($path);
$pdf = $parser->parseContent($content);

$text = $pdf->getText();

echo ini_get("memory_limit")."\n";
echo $text;

Given file required ~90Mb RAM for the first launch and ~45Mb for the best subsequent ones.
After the fixes it requires 45Mb for the first launch and just ~25Mb for the best subsequent ones.

@k00ni
Copy link
Collaborator

k00ni commented Mar 8, 2023

@se-ti Thank you for your pull request.

I am especially interested in optimization-related pull requests. Could you elaborate at bit what your changes do?

@k00ni
Copy link
Collaborator

k00ni commented Mar 8, 2023

Thank you for your fast response. I didn't check your code in detail, but was wondering in which instances your adaptions change code semantics (and implications for existing programs using it).

Are there any edge cases we have to keep in mind here?

if (('<' == $char) && 1 == $pregResult) {

$span = strspn($pdfData, "0123456789abcdefABCDEF\x09\x0a\x0c\x0d\x20", $offset);
if (('<' == $char) && $span > 0 && @$pdfData[$offset+$span] == '>') {
Copy link
Collaborator

@k00ni k00ni Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using the @ character here? I know it is being used with functions to suppress PHP warnings/errors.

Copy link
Contributor Author

@se-ti se-ti Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$pdfData is a string.
It shouldn't be the case, but what if $offset + $span exceeds string length?
let it be silent null, that would fail comparison with '>'.

But if you don't like it, you can remove it.
No problem.

@se-ti
Copy link
Contributor Author

se-ti commented Mar 8, 2023

I didn't check your code in detail, but was wondering in which instances your adaptions change code semantics (and implications for existing programs using it).

Are there any edge cases we have to keep in mind here?

I didn't change semantics of any method.
Therefore i suppose my changes are well-isolated and won't change behaviour of existsing programs.

Except for reduced memory consumption ^)

@k00ni
Copy link
Collaborator

k00ni commented Mar 8, 2023

I didn't change semantics of any method. Therefore i suppose my changes are well-isolated and won't change behaviour of existsing programs.

Thank you for these detailled and quick answers! It will help to better understand these changes. There are over 1000 projects using this library, thats why I wanted to check these points.
I am currently busy, but will try to find some time to get back to you. If someone else wants to bring this PR forward, feel free though!

@k00ni
Copy link
Collaborator

k00ni commented Mar 10, 2023

Can you please merge in master branch? Latest changes should fix all the failing tests.

@se-ti
Copy link
Contributor Author

se-ti commented Mar 10, 2023

@k00ni, is current problem PR problem, or tests problem?

It doesn't like non-yoda-style condition, no space around +, @, or what?
$span > 0 && @$pdfData[$offset+$span] == '>'

@se-ti
Copy link
Contributor Author

se-ti commented Mar 10, 2023

P.S. Is it reliable to use object header's field /Length when parsing stream-endstream section?
There seem to be 2 more RAM & CPU performance improvements if it is.

@k00ni
Copy link
Collaborator

k00ni commented Mar 13, 2023

[...] is current problem PR problem, or tests problem?

To my knowledge, PHPUnit 10.0 "wrecked" some tests which also impedes this PR. It was not your fault.

It doesn't like non-yoda-style condition, no space around +, @, or what? $span > 0 && @$pdfData[$offset+$span] == '>'

I have to check that. Usually it is enough to run our PHP-CS-Fixer instance and it fixes all. Did you run make run-php-cs-fixer?

P.S. Is it reliable to use object header's field /Length when parsing stream-endstream section?

I can't answer that. Can you provide more context?

@se-ti
Copy link
Contributor Author

se-ti commented Mar 13, 2023

I have to check that. Usually it is enough to run our PHP-CS-Fixer instance and it fixes all. Did you run make run-php-cs-fixer?

No, i didn't.

P.S. Is it reliable to use object header's field /Length when parsing stream-endstream section?

I can't answer that. Can you provide more context?

I've already found its usages in existing code :), so i'll use it too.

if ($decoding && ('stream' === $element[0]) && (isset($objContentArr[$i - 1][0])) && ('<<' === $objContentArr[$i - 1][0])) {

if (('Length' == $v[1]) && (isset($sdic[$k + 1])) && ('numeric' == $sdic[$k + 1][0])) {

Waiting for this PR to be merged in master to suggest another one.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, only have a few remarks.

Any feedback @j0k3r @smalot?

src/Smalot/PdfParser/RawData/RawDataParser.php Outdated Show resolved Hide resolved
@k00ni
Copy link
Collaborator

k00ni commented Mar 13, 2023

@k00ni k00ni changed the title reduce excessive memory allocations Reduce excessive memory allocations Mar 13, 2023
Co-authored-by: Konrad Abicht <hi@inspirito.de>
@se-ti
Copy link
Contributor Author

se-ti commented Mar 13, 2023

Just curios, did you also try with https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md#option-setdecodememorylimit--setretainimagecontent-manage-memory-usage?

Tried, but didn't find it useful.
I set 1M limit, got a pdf with larger image in it, received a warning
gzuncompress(): insufficient memory

What should i do next?
What's the profit?

It seems to be useful only if i'm sure, i would never receive pdf with large enough objects in it.
I'm not sure in my case.

@k00ni
Copy link
Collaborator

k00ni commented Mar 13, 2023

What's the profit?

We introduced these optimizations with #476 and #441, whereas #441 is about the images. As far as I understood these PRs, these options allow to restrict certain (PHP-) functions to use fewer ressources.

What should i do next?

I just wanted to check if you knew about these options and in case you did, if you experienced anything problematic.

@k00ni k00ni self-assigned this Mar 13, 2023
Copy link
Collaborator

@j0k3r j0k3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the code was already too complex for me to understand it properly.
If tests are green and OP fixed a memory leak, I'm 👍🏼

@k00ni k00ni merged commit 43a1c14 into smalot:master Mar 14, 2023
@k00ni
Copy link
Collaborator

k00ni commented Mar 14, 2023

Thank you for your work here @se-ti.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants