-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce excessive memory allocations #582
Conversation
@se-ti Thank you for your pull request. I am especially interested in optimization-related pull requests. Could you elaborate at bit what your changes do? |
Thank you for your fast response. I didn't check your code in detail, but was wondering in which instances your adaptions change code semantics (and implications for existing programs using it). Are there any edge cases we have to keep in mind here? |
if (('<' == $char) && 1 == $pregResult) { | ||
|
||
$span = strspn($pdfData, "0123456789abcdefABCDEF\x09\x0a\x0c\x0d\x20", $offset); | ||
if (('<' == $char) && $span > 0 && @$pdfData[$offset+$span] == '>') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using the @
character here? I know it is being used with functions to suppress PHP warnings/errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$pdfData is a string.
It shouldn't be the case, but what if $offset + $span exceeds string length?
let it be silent null, that would fail comparison with '>'.
But if you don't like it, you can remove it.
No problem.
I didn't change semantics of any method. Except for reduced memory consumption ^) |
Thank you for these detailled and quick answers! It will help to better understand these changes. There are over 1000 projects using this library, thats why I wanted to check these points. |
Can you please merge in master branch? Latest changes should fix all the failing tests. |
@k00ni, is current problem PR problem, or tests problem? It doesn't like non-yoda-style condition, no space around +, @, or what? |
P.S. Is it reliable to use object header's field /Length when parsing stream-endstream section? |
To my knowledge, PHPUnit 10.0 "wrecked" some tests which also impedes this PR. It was not your fault.
I have to check that. Usually it is enough to run our PHP-CS-Fixer instance and it fixes all. Did you run
I can't answer that. Can you provide more context? |
No, i didn't.
I've already found its usages in existing code :), so i'll use it too.
Waiting for this PR to be merged in master to suggest another one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curios, did you also try with https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md#option-setdecodememorylimit--setretainimagecontent-manage-memory-usage? |
Co-authored-by: Konrad Abicht <hi@inspirito.de>
Tried, but didn't find it useful. What should i do next? It seems to be useful only if i'm sure, i would never receive pdf with large enough objects in it. |
We introduced these optimizations with #476 and #441, whereas #441 is about the images. As far as I understood these PRs, these options allow to restrict certain (PHP-) functions to use fewer ressources.
I just wanted to check if you knew about these options and in case you did, if you experienced anything problematic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the code was already too complex for me to understand it properly.
If tests are green and OP fixed a memory leak, I'm 👍🏼
Thank you for your work here @se-ti. |
Hi!
I ran into issue, that extracting text from ~20Mb pdfs (~300K text, lots of images), fails with memory allocation error at 128M limit.
I looked through the code and found several unlimited substring allocations and fixed them.
I've replaced couple of regexps with a bit more effective code also.
Hope, You'd accept my PR.
You can test my fixes with such a pdf https://westra.ru/reports/kavkaz/danilina-1el2-arhyz-2021.pdf and this code:
Given file required ~90Mb RAM for the first launch and ~45Mb for the best subsequent ones.
After the fixes it requires 45Mb for the first launch and just ~25Mb for the best subsequent ones.