Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate memory leak #188

Closed
wants to merge 1 commit into from
Closed

Mitigate memory leak #188

wants to merge 1 commit into from

Conversation

lyrixx
Copy link
Contributor

@lyrixx lyrixx commented Jul 1, 2020

Since Traverser and OutputRules classes have a cyclic reference,
they can not be removed from the memory by PHP directly. Unless the GC
does its job. And theses classes holds a reference (stream) to the HTML
and it can be very huge (> 1Mb).

So if one uses this lib to parse lots of HTML, the leak lead to a
massive use of memory (>2Gb in less than 10s with my current
workload).

By "manually" removing the cyclic reference, the GC is not involved
anymore. PHP can "normally" free the two instances, and so the reference
to the HTML (stream) is also cleared.

Reproducer (with symfony/dom-crawler):

$content = file_get_contents('https://www.php.net/');
$count = $argv[1] ?? 151;

$s = microtime(true);
for ($i = 0; $i < $count; $i++) {
    $crawler = new Crawler($content);
    $nodes = $crawler->filterXPath('descendant-or-self::head/descendant-or-self::*/title');

    $nodes->each(static function ($node): void {
        $node->html(); // The faulty line
    });

    if (0 == $i % 10) {
        preg_match('/^VmRSS:\s(.*)/m', file_get_contents('/proc/self/status'), $m);
        printf("%03d - %.2fMb - %s - %.3fs\n", $i, memory_get_usage(true) / 1024 / 1024, trim($m[1]), microtime(true) - $s);
    }
}

Results without the patch:

000 - 4.00Mb - 38052 kB - 0.085s
010 - 4.00Mb - 45732 kB - 0.513s
020 - 4.00Mb - 53124 kB - 0.941s
030 - 4.00Mb - 60780 kB - 1.370s
040 - 4.00Mb - 68436 kB - 1.798s
050 - 4.00Mb - 76092 kB - 2.224s
060 - 4.00Mb - 83484 kB - 2.659s
070 - 4.00Mb - 91140 kB - 3.086s
080 - 4.00Mb - 98796 kB - 3.514s
090 - 4.00Mb - 106452 kB - 3.944s
100 - 4.00Mb - 113844 kB - 4.384s
110 - 4.00Mb - 121500 kB - 4.816s
120 - 4.00Mb - 129156 kB - 5.246s
130 - 4.00Mb - 136812 kB - 5.680s
140 - 4.00Mb - 144468 kB - 6.114s
150 - 4.00Mb - 151860 kB - 6.550s

Results with the patch:

000 - 4.00Mb - 38020 kB - 0.086s
010 - 4.00Mb - 38868 kB - 0.551s
020 - 4.00Mb - 38868 kB - 0.987s
030 - 4.00Mb - 38868 kB - 1.422s
040 - 4.00Mb - 38868 kB - 1.860s
050 - 4.00Mb - 38868 kB - 2.295s
060 - 4.00Mb - 38868 kB - 2.734s
070 - 4.00Mb - 38868 kB - 3.167s
080 - 4.00Mb - 38868 kB - 3.603s
090 - 4.00Mb - 38868 kB - 4.040s
100 - 4.00Mb - 38868 kB - 4.476s
110 - 4.00Mb - 38868 kB - 4.919s
120 - 4.00Mb - 38868 kB - 5.373s
130 - 4.00Mb - 38868 kB - 5.820s
140 - 4.00Mb - 38868 kB - 6.270s
150 - 4.00Mb - 38868 kB - 6.715s

the full story: https://jolicode.com/blog/a-journey-to-find-a-memory-leak

Since `Traverser` and `OutputRules` classes  have a cyclic reference,
they can not be removed from the memory by PHP directly. Unless the GC
does its job. And theses classes holds a reference (stream) to the HTML
and it can be very huge (> 1Mb).

So if one uses this lib to parse lots of HTML, the leak lead to a
**massive** use of memory (>2Gb in less than 10s with my current
workload).

By "manually" removing the cyclic reference, the GC is not involved
anymore. PHP can "normally" free the two instances, and so the reference
to the HTML (stream) is also cleared.

Reproducer (with symfony/dom-crawler):

```php
$content = file_get_contents('https://www.php.net/');
$count = $argv[1] ?? 151;

$s = microtime(true);
for ($i = 0; $i < $count; $i++) {
    $crawler = new Crawler($content);
    $nodes =
$crawler->filterXPath('descendant-or-self::head/descendant-or-self::*/title');

    $nodes->each(static function ($node): void {
        $node->html(); // The faulty line
    });

    if (0 == $i % 10) {
        preg_match('/^VmRSS:\s(.*)/m', file_get_contents('/proc/self/status'), $m);
        printf("%03d - %.2fMb - %s - %.3fs\n", $i, memory_get_usage(true) / 1024 / 1024, trim($m[1]), microtime(true) - $s);
    }
}
```

Results without the patch:

```
000 - 4.00Mb - 38052 kB - 0.085s
010 - 4.00Mb - 45732 kB - 0.513s
020 - 4.00Mb - 53124 kB - 0.941s
030 - 4.00Mb - 60780 kB - 1.370s
040 - 4.00Mb - 68436 kB - 1.798s
050 - 4.00Mb - 76092 kB - 2.224s
060 - 4.00Mb - 83484 kB - 2.659s
070 - 4.00Mb - 91140 kB - 3.086s
080 - 4.00Mb - 98796 kB - 3.514s
090 - 4.00Mb - 106452 kB - 3.944s
100 - 4.00Mb - 113844 kB - 4.384s
110 - 4.00Mb - 121500 kB - 4.816s
120 - 4.00Mb - 129156 kB - 5.246s
130 - 4.00Mb - 136812 kB - 5.680s
140 - 4.00Mb - 144468 kB - 6.114s
150 - 4.00Mb - 151860 kB - 6.550s
```

Results with the patch:

```
000 - 4.00Mb - 38020 kB - 0.086s
010 - 4.00Mb - 38868 kB - 0.551s
020 - 4.00Mb - 38868 kB - 0.987s
030 - 4.00Mb - 38868 kB - 1.422s
040 - 4.00Mb - 38868 kB - 1.860s
050 - 4.00Mb - 38868 kB - 2.295s
060 - 4.00Mb - 38868 kB - 2.734s
070 - 4.00Mb - 38868 kB - 3.167s
080 - 4.00Mb - 38868 kB - 3.603s
090 - 4.00Mb - 38868 kB - 4.040s
100 - 4.00Mb - 38868 kB - 4.476s
110 - 4.00Mb - 38868 kB - 4.919s
120 - 4.00Mb - 38868 kB - 5.373s
130 - 4.00Mb - 38868 kB - 5.820s
140 - 4.00Mb - 38868 kB - 6.270s
150 - 4.00Mb - 38868 kB - 6.715s
```
@goetas
Copy link
Member

goetas commented Jul 2, 2020

I believe that this is not a memory leak. If you call gc_collect_cycles() in the loop, the memory usage stays very stable. PHP does not invoke the cycle collector very ofthen.

As per doc (https://www.php.net/manual/en/features.gc.collecting-cycles.php):

When the garbage collector is turned on, the cycle-finding algorithm as described above is executed whenever the root buffer runs full. The root buffer has a fixed size of 10,000 possible roots (although you can alter this by changing the GC_ROOT_BUFFER_MAX_ENTRIES constant in Zend/zend_gc.c in the PHP source code, and re-compiling PHP)

With your approach, you remove manually the cycle, in this way PHP is able remove the object immediately without waiting for gc_collect_cycles() .

I think that this teaches us to call gc_collect_cycles() at least once every iteration in queue workers and other background processes (waiting for PHP to do so can waste a lot of ram).

To be hones I'm happy with your solution, but I would ask you to add a comment on why that call is being done (and that by design is not needed).
Another question, since i'm not an expert on BC breaks, is making a parameter nullable considered a BC break? (even if I think that almost nobody implemented that interface out of this project.

@mundschenk-at
Copy link
Contributor

Another question, since i'm not an expert on BC breaks, is making a parameter nullable considered a BC break? (even if I think that almost nobody implemented that interface out of this project.

I think a clearer solution would be to leave the setTraverser method as-is and add a clearTraverser method that nulls the property. Then there would also be less need for an additional comment (although a note about breaking circular references is still a good idea).

@goetas
Copy link
Member

goetas commented Jul 2, 2020

add a clearTraverser method that nulls the property

That was also my first idea, but that will be a BC break since adding a method to an interface will break compatibility for sure.
Adding the clearTraverser only on the concrete implementation would work....

@stof
Copy link
Contributor

stof commented Jul 2, 2020

@goetas calling gc_collect_cycles all the time would have a performance impact. And here, the cycle involves only 2 objects, which is far from reaching the PHP threshold. But the issue is that this cycle references the stream, which retains lots of memory in the stream. So this is still a memory leak IMO (it is leaked only until the GC runs, but this would require leaking 5000 such cycles to trigger the GC; at which point you have probably exhausted the system memory before that when the HTML is big)

@goetas
Copy link
Member

goetas commented Jul 3, 2020

What do you think about the approach in #190, is substantially the same but using a separate method in the concrete class...

@lyrixx
Copy link
Contributor Author

lyrixx commented Jul 3, 2020

Closing in favor of #190

@lyrixx lyrixx closed this Jul 3, 2020
@lyrixx lyrixx deleted the mem-leak branch July 3, 2020 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants