Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce excessive substring allocation, using preg_match's offset parameter #595

Merged
merged 4 commits into from
May 10, 2023

Conversation

se-ti
Copy link
Contributor

@se-ti se-ti commented Apr 14, 2023

This PR proposes to use preg_match's offset parameter instead of preparing special string for it.
This change requires minor changes of regexps: "match start of string" meta-character ^ should be replaced with \G escape sequence https://www.php.net/manual/en/regexp.reference.escape.php

This PR slightly improves memory allocation, achieved with pulls #582, #590, but would be useful for all PDF's and not only image-rich ones.

This PR shows
42M for the first launch, 18M for the best subsequent vs
43M for the first launch, 18M for the best subsequent for pull 590

https://westra.ru/reports/kavkaz/danilina-1el2-arhyz-2021.pdf

$lim = '128M';
ini_set("memory_limit", $lim);
include 'alt_autoload.php-dist';

$path = 'danilina-1el2-arhyz-2021.pdf';

$config = new \Smalot\PdfParser\Config();
$config->setRetainImageContent(false);
$parser = new \Smalot\PdfParser\Parser([], $config);

$content = file_get_contents($path);
$pdf = $parser->parseContent($content);

$text = $pdf->getText();

echo ini_get("memory_limit")."\n";
echo $text;

@k00ni
Copy link
Collaborator

k00ni commented Apr 17, 2023

Wow another PR already. Thanks for investing the time. I will try to get back to you asap!

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@se-ti In the following example you replaced the part begin of the string (^) with first appearance (\G).

/^\s*(?P<name>\/[A-Z0-9\._]+)(?P<value>.*)/si
/\G\s*(?P<name>\/[A-Z0-9\._]+)(?P<value>.*)/si

You altered the scope of the regex, because they can produce different results. As of now, it always checks directly at the beginning of the string and if nothing is to be found, the if-clause is dismissed.

My question is, is this kind of information you check with the regex always at the beginning of the string anyway? Can there multiple instances of the this kind of information in the string at question?

@se-ti
Copy link
Contributor Author

se-ti commented Apr 25, 2023

Sorry, i don't understand You. :(
As of now, it always checks directly at the beginning of the string...
It was a /^\s*(?P<name>\/[A-Z0-9\._]+)(?P<value>.*)/si
It DID check (see ^) just very beginning of the string.

I have replaced it with \G, which is "first matching position in subject". And matching position is changed because of the $offset parameter.

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match. It differs from \A when the value of offset is non-zero.
https://www.php.net/manual/en/regexp.reference.escape.php

Multiple instances are not the issue, because of regexp's greedyness -- see .* in the end.
And this code definitly works not with the whole document, but with it's small part, substr-ed at higher level.

@se-ti
Copy link
Contributor Author

se-ti commented Apr 28, 2023

Have i answered your question, @k00ni?
Do i miss something?

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanna make sure we don't change parser behavior by accident.

My question about \G was due to the fact that it behaves differently if $offset is set. Quote:

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero.

And about \A it says:

start of subject (independent of multiline mode)

After reading the docs again I am for merging this.


Offtopic: We have a few "performance" tests in https://github.com/smalot/pdfparser/tree/master/tests/Performance. Their purpose is to check time or RAM usage when processing some more-or-less heavy data. You wrote that you did some performance tests yourself. Do you think we should integrate them?

@se-ti
Copy link
Contributor Author

se-ti commented May 1, 2023

My performance tests were pretty naive: just launching code from above with different memory limits, testing whether it would fail or not.

@se-ti
Copy link
Contributor Author

se-ti commented May 1, 2023

Looked at existing performance tests. They are about time performance.
I'd think, but right now i do not see reliable way to test memory performance.

One of reasons are these results "X Mb for the first launch, X/2.5 for the best subsequent one".
I hope, that was OpCache warm up, but won't bet it. :(

@k00ni k00ni merged commit fafe2f2 into smalot:master May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants