Poor Memory Management with Search Indexes (Searchables) #8854

electroflame · 2023-10-16T10:41:54Z

Bug description

When you have a large site with many entries (50k+), updating search indexes can pretty easily lead to running out of memory.

This is largely due to the use of searchables()->all(), which seems to load all entries in an index into memory, before releasing that memory. Initially, I broke collections up into smaller collections (which helps with the Stache) but the search index will always fail as loading all of the searchables for the index (i.e. all collections tied to that index) before letting the garbage collector at it is a poor use of memory.

I'm using the Meilisearch addon, which also makes use of this call, but I've modified it slightly to fix this issue.

Changing this:

// Prepare documents for update
$searchables = $this->searchables()->all()->map(function ($entry) {
    return array_merge(
        $this->searchables()->fields($entry),
        $this->getDefaultFields($entry),
    );
});

// Update documents
$documents = new Documents($searchables);
$this->insertDocuments($documents);

To this:

$searchableProvidersReflection = new \ReflectionProperty(get_class($this->searchables()), 'providers');
$searchableProvidersReflection->setAccessible(true);

//Get the collection containing the provider data.
$providers = $searchableProvidersReflection->getValue($this->searchables());

//Now get the underlying collection of Entries
$collection = $providers['collection'];

//Now we have to get our keys, which should be the array of Searchable collections.
$collectionKeysReflection = new \ReflectionProperty(get_class($collection), 'keys');
$collectionKeysReflection->setAccessible(true);

$keys = $collectionKeysReflection->getValue($collection);

foreach($keys as $key)
{
    $coll = $providers['collection'];
    $this->updateProviderIndex($coll, $key);
    
    $coll = null;
    unset($coll);     

    echo('Indexed '.$key.PHP_EOL);
    sleep(1);
    gc_collect_cycles();
}

With updateProviderIndex looking like this:

private function updateProviderIndex($collection, $key)
{
    $collection->setKeys([$key]);
    $entries = $collection->provide()->map(function ($entry) {
                         return array_merge(
                             $this->searchables()->fields($entry),
                             $this->getDefaultFields($entry),
                         );
                });
    $documents = new Documents($entries);
    $this->insertDocuments($documents);
    
    $entries = null;
    unset($entries);
    
    $searchables = null;
    unset($searchables);
    
    $documents = null;
    unset($documents);
}

As you can probably tell, PHP is not my forte. I made use of Reflection to get at the underlying data structures and update them one-by-one instead of loading everything into memory. This is a very wrong and ugly way to do this (I didn't want to edit core files, so I was forced to work from the outside), but it does work, and it fixes the issue. I'm not sure how the actual fix should be handled, but doing something like this ensures that the memory usage should only go as high as the largest collection you have (so, if needed, you can split collections into smaller ones and add them to the same search index).

How to reproduce

Install Statamic
Generate a bunch of entries (thousands)
Stick them into one or multiple collections
Slap those collections into a search index
Run search:update
Watch memory utilization steadily increase until all of the collections have been indexed, instead of rising and falling as each collection is indexed separately

Logs

No response

Environment

(This is my local machine, but it happens locally and on a live site)
Environment
Laravel Version: 9.52.8
PHP Version: 8.1.9
Composer Version: 2.3.10
Environment: local
Maintenance Mode: OFF

Config: NOT CACHED
Events: NOT CACHED
Routes: NOT CACHED
Views: CACHED

Broadcasting: log
Cache: redis
Database: mysql
Logs: stack / single
Mail: smtp
Queue: redis
Session: file

Antlers: runtime
Stache Watcher: Disabled
Static Caching: Disabled
Version: 4.20.0

Installation

Fresh statamic/statamic site via CLI

Antlers Parser

runtime (new)

Additional details

No response

The text was updated successfully, but these errors were encountered:

electroflame · 2023-10-16T10:46:15Z

As an aside, I added the index logging for debugging so I could see the progress when the search index was being generated and really appreciated it. By indexing separately it's possible to log when a collection inside of an index has been successfully added to the index, which is really nice visual feedback (and gives you an idea of how far along the indexing is).

jasonvarga · 2023-10-16T13:41:18Z

Thank you for this detailed explanation. 👍

ryanmitchell · 2023-12-02T22:19:56Z

@electroflame do the changes in #9072, available since 4.37.0 help your use case at all?

electroflame · 2023-12-03T12:31:45Z

@ryanmitchell Let me check. I'll try to get into this next week since I'll have to play around with the Meilisearch addon to verify.

Thanks for what you've done so far though!

electroflame · 2023-12-07T13:47:15Z

Hey @ryanmitchell, I took a look and the code looks great, but when trying to adapt it with the Meilisearch addon I'm getting poor results (which might be related to me doing something wrong). Meilisearch doesn't handle updating the index the same as the Algolia addon, so it's not a one-to-one conversion. Anything special I should be doing?

ryanmitchell · 2023-12-07T13:54:30Z

@electroflame this block of code needs updated: https://github.com/statamic-rad-pack/meilisearch/blob/26160977f22003d193076850c63b20acda0142ff/src/Meilisearch/Index.php#L97-L107

It should use the lazy() approach, similar to:
https://github.com/ryanmitchell/cms/blob/00aa0598cdd14112383261b38b07e2c13bc6a759/src/Search/Index.php#L60

ryanmitchell · 2023-12-08T13:33:36Z

After a bit of chat on Discord it seems the effect of this has been minimal. The next obvious step would be to allow the searchable documents to be returned lazily.

ryanmitchell · 2023-12-11T08:05:19Z

@electroflame I've opened a PR here to provide the entries and terms lazily. Do you want to pull in the changes using a composer patch and see if that gets you close to the performance you were seeing through your reflection method?

electroflame · 2023-12-11T11:03:59Z

@ryanmitchell I'll give it a shot soon -- sometime this week probably. Thanks for working on this!

ryanmitchell · 2023-12-11T14:09:48Z

@electroflame I've added a PR to meilisearch too, to make things easier for you: statamic-rad-pack/meilisearch#31

electroflame · 2023-12-12T12:33:06Z

Thanks for the hard work @ryanmitchell!

I think this might've solved the memory usage issue. This basically reduces the memory usage down to a negligible amount, and it seems to stay right around there the whole time (i.e. no steadily increasing memory creep like before). This is great -- a huge deal considering where memory usage started.

The downside is that it absolutely tanks performance (i.e. about 45-55 minutes to update search indexes when it usually takes less than two minutes).

That seems to be down to the difference in how the indexing is handled, though. My reflection method essentially chunks it per-collection (provider, etc.) so, in the case of Meilisearch, single documents aren't being sent. What you've got right now is peak memory efficiency (which rocks!), but you lose speed (at least with Meilisearch, although I'd imagine any search index where you have to send data is going to bottleneck in a similar fashion). I'd imagine most installations won't be bothered too much, but in my (admittedly extreme) case, the speed penalty is pretty brutal.

As an aside, it might be useful to expose the providers and their collections via a public method (getProviders() and then getCollection, etc.). It wouldn't necessarily help with memory for this, but it would allow me (and others like me) to get at the underlying collections and process them without resorting to reflection (so it'd be safer long-term).

ryanmitchell · 2023-12-12T13:29:45Z

Ok - I've updated the chunk size to 100 and I've reconfigured the meilisesarch PR to insert in chunks instead of 1 by 1 statamic-rad-pack/meilisearch#31

let me know if thats any better for you.

electroflame · 2023-12-15T12:14:03Z

Just wanted to update that this PR and the linked one for Meilisearch work great. After a bit of Discord-debugging @ryanmitchell was able to track down the few lingering issues I found and they're included in the now-merged PRs.

The upshot is that the speed of this is roughly similar to my reflection method, but the memory usage is even lower (which is even lower than the default). After everything's said and done I'm seeing memory usage at about 10% of what default used to be, and about 40% of my reflection method.

It can't be understated -- this is some great work by Ryan that should help out a lot.

Thanks again, Ryan!

duncanmcclean added search performance labels Oct 16, 2023

ryanmitchell mentioned this issue Nov 27, 2023

[4.x] Provide searchables()->lazy and insertLazily methods to search indexes #9072

Merged

ryanmitchell mentioned this issue Dec 11, 2023

[4.x] Provide searchable entries and terms lazily #9171

Merged

duncanmcclean closed this as completed in #9171 Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor Memory Management with Search Indexes (Searchables) #8854

Poor Memory Management with Search Indexes (Searchables) #8854

electroflame commented Oct 16, 2023 •

edited

Loading

electroflame commented Oct 16, 2023

jasonvarga commented Oct 16, 2023

ryanmitchell commented Dec 2, 2023

electroflame commented Dec 3, 2023

electroflame commented Dec 7, 2023

ryanmitchell commented Dec 7, 2023

ryanmitchell commented Dec 8, 2023

ryanmitchell commented Dec 11, 2023 •

edited

Loading

electroflame commented Dec 11, 2023

ryanmitchell commented Dec 11, 2023 •

edited

Loading

electroflame commented Dec 12, 2023

ryanmitchell commented Dec 12, 2023

electroflame commented Dec 15, 2023

Poor Memory Management with Search Indexes (Searchables) #8854

Poor Memory Management with Search Indexes (Searchables) #8854

Comments

electroflame commented Oct 16, 2023 • edited Loading

Bug description

How to reproduce

Logs

Environment

Installation

Antlers Parser

Additional details

electroflame commented Oct 16, 2023

jasonvarga commented Oct 16, 2023

ryanmitchell commented Dec 2, 2023

electroflame commented Dec 3, 2023

electroflame commented Dec 7, 2023

ryanmitchell commented Dec 7, 2023

ryanmitchell commented Dec 8, 2023

ryanmitchell commented Dec 11, 2023 • edited Loading

electroflame commented Dec 11, 2023

ryanmitchell commented Dec 11, 2023 • edited Loading

electroflame commented Dec 12, 2023

ryanmitchell commented Dec 12, 2023

electroflame commented Dec 15, 2023

electroflame commented Oct 16, 2023 •

edited

Loading

ryanmitchell commented Dec 11, 2023 •

edited

Loading

ryanmitchell commented Dec 11, 2023 •

edited

Loading