-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce excessive substring allocation, using preg_match's offset parameter #595
Conversation
Wow another PR already. Thanks for investing the time. I will try to get back to you asap! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@se-ti In the following example you replaced the part begin of the string
(^
) with first appearance
(\G
).
/^\s*(?P<name>\/[A-Z0-9\._]+)(?P<value>.*)/si
/\G\s*(?P<name>\/[A-Z0-9\._]+)(?P<value>.*)/si
You altered the scope of the regex, because they can produce different results. As of now, it always checks directly at the beginning of the string and if nothing is to be found, the if-clause is dismissed.
My question is, is this kind of information you check with the regex always at the beginning of the string anyway? Can there multiple instances of the this kind of information in the string at question?
Sorry, i don't understand You. :( I have replaced it with \G, which is "first matching position in subject". And matching position is changed because of the $offset parameter.
Multiple instances are not the issue, because of regexp's greedyness -- see .* in the end. |
Have i answered your question, @k00ni? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanna make sure we don't change parser behavior by accident.
My question about \G
was due to the fact that it behaves differently if $offset
is set. Quote:
The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero.
And about \A
it says:
start of subject (independent of multiline mode)
After reading the docs again I am for merging this.
Offtopic: We have a few "performance" tests in https://github.com/smalot/pdfparser/tree/master/tests/Performance. Their purpose is to check time or RAM usage when processing some more-or-less heavy data. You wrote that you did some performance tests yourself. Do you think we should integrate them?
My performance tests were pretty naive: just launching code from above with different memory limits, testing whether it would fail or not. |
Looked at existing performance tests. They are about time performance. One of reasons are these results "X Mb for the first launch, X/2.5 for the best subsequent one". |
This PR proposes to use preg_match's offset parameter instead of preparing special string for it.
This change requires minor changes of regexps: "match start of string" meta-character ^ should be replaced with \G escape sequence https://www.php.net/manual/en/regexp.reference.escape.php
This PR slightly improves memory allocation, achieved with pulls #582, #590, but would be useful for all PDF's and not only image-rich ones.
This PR shows
42M for the first launch, 18M for the best subsequent vs
43M for the first launch, 18M for the best subsequent for pull 590
https://westra.ru/reports/kavkaz/danilina-1el2-arhyz-2021.pdf