-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance by relying on a native string instead of InputStream #146
Conversation
I just pushed an additional performance gain achieved by removing some calls to the I updated the script to run 1000 iterations in order to get more precise results. Here are the results for the whole PR now: Before:
After removing the InputStream system:
After (whole PR):
|
@tgalopin while reducing parsing time is very good, did you measure memory usage too? I think streams were used in order to reduce memory footprint. |
Thanks for the feedback :) ! InputStream instances were not actually streams, they were just wrappers around a plain string :) . The memory footprint is even lower in the new version (1.25 MB to 1.17MB, as you can see in the Blackfire profiles), probably because of the removal of the objects :) . I do think though that using real streams may be a great idea for further improvements. I'll perhaps have a look if I find how to do it properly :) . |
Hi! Will take me some time to review it and to try it but is definitively a 👍 |
@mattfarina @technosophos Anything against the general approach in this PR? |
9999951
to
321ed96
Compare
I removed the tokenizer improvements from this PR to focus on the InputStream here. I'm still working on the tokenizer, I'll open a new PR once ready. Note that the two improvements are fully independent, no need to wait for the other if we want to merge this one :) . |
Here it is: #147 :) ! |
thanks a lot for this! Results on my laptop:
Great work! |
When I used the library in the context of https://github.com/tgalopin/html-sanitizer, I stumbled upon speed issues. I launched a Blackfire analysis and saw it mainly came from this library.
I started to investigate and quickly found out that the architecture with an InputStream and a Scanner was the biggest part of the problem: calling 3 methods for each characters of the input slowed a lot the parsing process.
In addition to that, I realized that the InputStream architecture, while being very extensible, is not used in the library itself: the FileStreamInput is a plain StringInput with a
file_get_contents
, and the StringInput is a simple wrapper around a native string.Finally, I think it is not likely that anyone using this library did it in a way where they can't transform their input into a string.
Thus, I would like to propose in this pull request to simplify the library critical execution path by deprecating (and removing in 3.0) the InputStream architecture and relying on a native string directly in the Scanner instead.
This implementation shows great speed improvements:
Test script
Simple time measurement
Before:
After:
Blackfire analysis
I am fully aware this could be a bit controversial PR but I really think it would be a great improvement for this library, especially in the context of Drupal. In many cases, we can expect a 22% speed improvement, which is worth the removal of InputStream in my opinion :) !