Reduce excessive memory allocations #582

se-ti · 2023-03-08T08:13:46Z

Hi!
I ran into issue, that extracting text from ~20Mb pdfs (~300K text, lots of images), fails with memory allocation error at 128M limit.
I looked through the code and found several unlimited substring allocations and fixed them.
I've replaced couple of regexps with a bit more effective code also.

Hope, You'd accept my PR.

You can test my fixes with such a pdf https://westra.ru/reports/kavkaz/danilina-1el2-arhyz-2021.pdf and this code:

$lim = '128M';
ini_set("memory_limit", $lim);
include 'alt_autoload.php-dist';

$path = 'danilina-1el2-arhyz-2021.pdf';

$config = new \Smalot\PdfParser\Config();
$config->setRetainImageContent(false);
$parser = new \Smalot\PdfParser\Parser([], $config);

$content = file_get_contents($path);
$pdf = $parser->parseContent($content);

$text = $pdf->getText();

echo ini_get("memory_limit")."\n";
echo $text;

Given file required ~90Mb RAM for the first launch and ~45Mb for the best subsequent ones.
After the fixes it requires 45Mb for the first launch and just ~25Mb for the best subsequent ones.

k00ni · 2023-03-08T11:42:47Z

@se-ti Thank you for your pull request.

I am especially interested in optimization-related pull requests. Could you elaborate at bit what your changes do?

src/Smalot/PdfParser/RawData/RawDataParser.php

k00ni · 2023-03-08T12:14:08Z

Thank you for your fast response. I didn't check your code in detail, but was wondering in which instances your adaptions change code semantics (and implications for existing programs using it).

Are there any edge cases we have to keep in mind here?

src/Smalot/PdfParser/RawData/RawDataParser.php

k00ni · 2023-03-08T12:16:39Z

src/Smalot/PdfParser/RawData/RawDataParser.php

-                    if (('<' == $char) && 1 == $pregResult) {
+
+                    $span = strspn($pdfData, "0123456789abcdefABCDEF\x09\x0a\x0c\x0d\x20", $offset);
+                    if (('<' == $char) && $span > 0 && @$pdfData[$offset+$span] == '>') {


Why using the @ character here? I know it is being used with functions to suppress PHP warnings/errors.

$pdfData is a string.
It shouldn't be the case, but what if $offset + $span exceeds string length?
let it be silent null, that would fail comparison with '>'.

But if you don't like it, you can remove it.
No problem.

src/Smalot/PdfParser/RawData/RawDataParser.php

se-ti · 2023-03-08T13:41:56Z

I didn't check your code in detail, but was wondering in which instances your adaptions change code semantics (and implications for existing programs using it).

Are there any edge cases we have to keep in mind here?

I didn't change semantics of any method.
Therefore i suppose my changes are well-isolated and won't change behaviour of existsing programs.

Except for reduced memory consumption ^)

k00ni · 2023-03-08T14:41:57Z

I didn't change semantics of any method. Therefore i suppose my changes are well-isolated and won't change behaviour of existsing programs.

Thank you for these detailled and quick answers! It will help to better understand these changes. There are over 1000 projects using this library, thats why I wanted to check these points.
I am currently busy, but will try to find some time to get back to you. If someone else wants to bring this PR forward, feel free though!

k00ni · 2023-03-10T10:17:58Z

Can you please merge in master branch? Latest changes should fix all the failing tests.

se-ti · 2023-03-10T16:45:53Z

@k00ni, is current problem PR problem, or tests problem?

It doesn't like non-yoda-style condition, no space around +, @, or what?
$span > 0 && @$pdfData[$offset+$span] == '>'

se-ti · 2023-03-10T16:56:15Z

P.S. Is it reliable to use object header's field /Length when parsing stream-endstream section?
There seem to be 2 more RAM & CPU performance improvements if it is.

k00ni · 2023-03-13T07:45:28Z

[...] is current problem PR problem, or tests problem?

To my knowledge, PHPUnit 10.0 "wrecked" some tests which also impedes this PR. It was not your fault.

It doesn't like non-yoda-style condition, no space around +, @, or what? $span > 0 && @$pdfData[$offset+$span] == '>'

I have to check that. Usually it is enough to run our PHP-CS-Fixer instance and it fixes all. Did you run make run-php-cs-fixer?

P.S. Is it reliable to use object header's field /Length when parsing stream-endstream section?

I can't answer that. Can you provide more context?

se-ti · 2023-03-13T08:32:36Z

I have to check that. Usually it is enough to run our PHP-CS-Fixer instance and it fixes all. Did you run make run-php-cs-fixer?

No, i didn't.

P.S. Is it reliable to use object header's field /Length when parsing stream-endstream section?

I can't answer that. Can you provide more context?

I've already found its usages in existing code :), so i'll use it too.

pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php

Line 562 in 3ef8bc5

    
           if ($decoding && ('stream' === $element[0]) && (isset($objContentArr[$i - 1][0])) && ('<<' === $objContentArr[$i - 1][0])) {

pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php

Line 100 in 3ef8bc5

    
           if (('Length' == $v[1]) && (isset($sdic[$k + 1])) && ('numeric' == $sdic[$k + 1][0])) {

Waiting for this PR to be merged in master to suggest another one.

k00ni

Looks good to me, only have a few remarks.

Any feedback @j0k3r @smalot?

src/Smalot/PdfParser/RawData/RawDataParser.php

k00ni · 2023-03-13T09:28:37Z

Just curios, did you also try with https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md#option-setdecodememorylimit--setretainimagecontent-manage-memory-usage?

Co-authored-by: Konrad Abicht <hi@inspirito.de>

se-ti · 2023-03-13T10:42:04Z

Just curios, did you also try with https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md#option-setdecodememorylimit--setretainimagecontent-manage-memory-usage?

Tried, but didn't find it useful.
I set 1M limit, got a pdf with larger image in it, received a warning
gzuncompress(): insufficient memory

What should i do next?
What's the profit?

It seems to be useful only if i'm sure, i would never receive pdf with large enough objects in it.
I'm not sure in my case.

k00ni · 2023-03-13T12:10:09Z

What's the profit?

We introduced these optimizations with #476 and #441, whereas #441 is about the images. As far as I understood these PRs, these options allow to restrict certain (PHP-) functions to use fewer ressources.

What should i do next?

I just wanted to check if you knew about these options and in case you did, if you experienced anything problematic.

j0k3r

AFAIK the code was already too complex for me to understand it properly.
If tests are green and OP fixed a memory leak, I'm 👍🏼

k00ni · 2023-03-14T08:14:33Z

Thank you for your work here @se-ti.

reduce excessive memory allocations

a2c3168

k00ni added the enhancement label Mar 8, 2023

se-ti commented Mar 8, 2023

View reviewed changes