-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate memory leak #188
Mitigate memory leak #188
Conversation
Since `Traverser` and `OutputRules` classes have a cyclic reference, they can not be removed from the memory by PHP directly. Unless the GC does its job. And theses classes holds a reference (stream) to the HTML and it can be very huge (> 1Mb). So if one uses this lib to parse lots of HTML, the leak lead to a **massive** use of memory (>2Gb in less than 10s with my current workload). By "manually" removing the cyclic reference, the GC is not involved anymore. PHP can "normally" free the two instances, and so the reference to the HTML (stream) is also cleared. Reproducer (with symfony/dom-crawler): ```php $content = file_get_contents('https://www.php.net/'); $count = $argv[1] ?? 151; $s = microtime(true); for ($i = 0; $i < $count; $i++) { $crawler = new Crawler($content); $nodes = $crawler->filterXPath('descendant-or-self::head/descendant-or-self::*/title'); $nodes->each(static function ($node): void { $node->html(); // The faulty line }); if (0 == $i % 10) { preg_match('/^VmRSS:\s(.*)/m', file_get_contents('/proc/self/status'), $m); printf("%03d - %.2fMb - %s - %.3fs\n", $i, memory_get_usage(true) / 1024 / 1024, trim($m[1]), microtime(true) - $s); } } ``` Results without the patch: ``` 000 - 4.00Mb - 38052 kB - 0.085s 010 - 4.00Mb - 45732 kB - 0.513s 020 - 4.00Mb - 53124 kB - 0.941s 030 - 4.00Mb - 60780 kB - 1.370s 040 - 4.00Mb - 68436 kB - 1.798s 050 - 4.00Mb - 76092 kB - 2.224s 060 - 4.00Mb - 83484 kB - 2.659s 070 - 4.00Mb - 91140 kB - 3.086s 080 - 4.00Mb - 98796 kB - 3.514s 090 - 4.00Mb - 106452 kB - 3.944s 100 - 4.00Mb - 113844 kB - 4.384s 110 - 4.00Mb - 121500 kB - 4.816s 120 - 4.00Mb - 129156 kB - 5.246s 130 - 4.00Mb - 136812 kB - 5.680s 140 - 4.00Mb - 144468 kB - 6.114s 150 - 4.00Mb - 151860 kB - 6.550s ``` Results with the patch: ``` 000 - 4.00Mb - 38020 kB - 0.086s 010 - 4.00Mb - 38868 kB - 0.551s 020 - 4.00Mb - 38868 kB - 0.987s 030 - 4.00Mb - 38868 kB - 1.422s 040 - 4.00Mb - 38868 kB - 1.860s 050 - 4.00Mb - 38868 kB - 2.295s 060 - 4.00Mb - 38868 kB - 2.734s 070 - 4.00Mb - 38868 kB - 3.167s 080 - 4.00Mb - 38868 kB - 3.603s 090 - 4.00Mb - 38868 kB - 4.040s 100 - 4.00Mb - 38868 kB - 4.476s 110 - 4.00Mb - 38868 kB - 4.919s 120 - 4.00Mb - 38868 kB - 5.373s 130 - 4.00Mb - 38868 kB - 5.820s 140 - 4.00Mb - 38868 kB - 6.270s 150 - 4.00Mb - 38868 kB - 6.715s ```
I believe that this is not a memory leak. If you call As per doc (https://www.php.net/manual/en/features.gc.collecting-cycles.php):
With your approach, you remove manually the cycle, in this way PHP is able remove the object immediately without waiting for I think that this teaches us to call To be hones I'm happy with your solution, but I would ask you to add a comment on why that call is being done (and that by design is not needed). |
I think a clearer solution would be to leave the |
That was also my first idea, but that will be a BC break since adding a method to an interface will break compatibility for sure. |
@goetas calling |
What do you think about the approach in #190, is substantially the same but using a separate method in the concrete class... |
Closing in favor of #190 |
Since
Traverser
andOutputRules
classes have a cyclic reference,they can not be removed from the memory by PHP directly. Unless the GC
does its job. And theses classes holds a reference (stream) to the HTML
and it can be very huge (> 1Mb).
So if one uses this lib to parse lots of HTML, the leak lead to a
massive use of memory (>2Gb in less than 10s with my current
workload).
By "manually" removing the cyclic reference, the GC is not involved
anymore. PHP can "normally" free the two instances, and so the reference
to the HTML (stream) is also cleared.
Reproducer (with symfony/dom-crawler):
Results without the patch:
Results with the patch:
the full story: https://jolicode.com/blog/a-journey-to-find-a-memory-leak