Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor Memory Management with Search Indexes (Searchables) #8854

Closed
electroflame opened this issue Oct 16, 2023 · 13 comments · Fixed by #9171
Closed

Poor Memory Management with Search Indexes (Searchables) #8854

electroflame opened this issue Oct 16, 2023 · 13 comments · Fixed by #9171

Comments

@electroflame
Copy link

electroflame commented Oct 16, 2023

Bug description

When you have a large site with many entries (50k+), updating search indexes can pretty easily lead to running out of memory.

This is largely due to the use of searchables()->all(), which seems to load all entries in an index into memory, before releasing that memory. Initially, I broke collections up into smaller collections (which helps with the Stache) but the search index will always fail as loading all of the searchables for the index (i.e. all collections tied to that index) before letting the garbage collector at it is a poor use of memory.

I'm using the Meilisearch addon, which also makes use of this call, but I've modified it slightly to fix this issue.

Changing this:

// Prepare documents for update
$searchables = $this->searchables()->all()->map(function ($entry) {
    return array_merge(
        $this->searchables()->fields($entry),
        $this->getDefaultFields($entry),
    );
});

// Update documents
$documents = new Documents($searchables);
$this->insertDocuments($documents);

To this:

$searchableProvidersReflection = new \ReflectionProperty(get_class($this->searchables()), 'providers');
$searchableProvidersReflection->setAccessible(true);

//Get the collection containing the provider data.
$providers = $searchableProvidersReflection->getValue($this->searchables());

//Now get the underlying collection of Entries
$collection = $providers['collection'];

//Now we have to get our keys, which should be the array of Searchable collections.
$collectionKeysReflection = new \ReflectionProperty(get_class($collection), 'keys');
$collectionKeysReflection->setAccessible(true);

$keys = $collectionKeysReflection->getValue($collection);

foreach($keys as $key)
{
    $coll = $providers['collection'];
    $this->updateProviderIndex($coll, $key);
    
    $coll = null;
    unset($coll);     

    echo('Indexed '.$key.PHP_EOL);
    sleep(1);
    gc_collect_cycles();
}

With updateProviderIndex looking like this:

private function updateProviderIndex($collection, $key)
{
    $collection->setKeys([$key]);
    $entries = $collection->provide()->map(function ($entry) {
                         return array_merge(
                             $this->searchables()->fields($entry),
                             $this->getDefaultFields($entry),
                         );
                });
    $documents = new Documents($entries);
    $this->insertDocuments($documents);
    
    $entries = null;
    unset($entries);
    
    $searchables = null;
    unset($searchables);
    
    $documents = null;
    unset($documents);
}

As you can probably tell, PHP is not my forte. I made use of Reflection to get at the underlying data structures and update them one-by-one instead of loading everything into memory. This is a very wrong and ugly way to do this (I didn't want to edit core files, so I was forced to work from the outside), but it does work, and it fixes the issue. I'm not sure how the actual fix should be handled, but doing something like this ensures that the memory usage should only go as high as the largest collection you have (so, if needed, you can split collections into smaller ones and add them to the same search index).

How to reproduce

  • Install Statamic
  • Generate a bunch of entries (thousands)
  • Stick them into one or multiple collections
  • Slap those collections into a search index
  • Run search:update
  • Watch memory utilization steadily increase until all of the collections have been indexed, instead of rising and falling as each collection is indexed separately

Logs

No response

Environment

(This is my local machine, but it happens locally and on a live site)
Environment
Laravel Version: 9.52.8
PHP Version: 8.1.9
Composer Version: 2.3.10
Environment: local
Maintenance Mode: OFF

Config: NOT CACHED
Events: NOT CACHED
Routes: NOT CACHED
Views: CACHED

Broadcasting: log
Cache: redis
Database: mysql
Logs: stack / single
Mail: smtp
Queue: redis
Session: file

Antlers: runtime
Stache Watcher: Disabled
Static Caching: Disabled
Version: 4.20.0

Installation

Fresh statamic/statamic site via CLI

Antlers Parser

runtime (new)

Additional details

No response

@electroflame
Copy link
Author

As an aside, I added the index logging for debugging so I could see the progress when the search index was being generated and really appreciated it. By indexing separately it's possible to log when a collection inside of an index has been successfully added to the index, which is really nice visual feedback (and gives you an idea of how far along the indexing is).

@jasonvarga
Copy link
Member

Thank you for this detailed explanation. 👍

@ryanmitchell
Copy link
Contributor

@electroflame do the changes in #9072, available since 4.37.0 help your use case at all?

@electroflame
Copy link
Author

@ryanmitchell Let me check. I'll try to get into this next week since I'll have to play around with the Meilisearch addon to verify.

Thanks for what you've done so far though!

@electroflame
Copy link
Author

Hey @ryanmitchell, I took a look and the code looks great, but when trying to adapt it with the Meilisearch addon I'm getting poor results (which might be related to me doing something wrong). Meilisearch doesn't handle updating the index the same as the Algolia addon, so it's not a one-to-one conversion. Anything special I should be doing?

@ryanmitchell
Copy link
Contributor

After a bit of chat on Discord it seems the effect of this has been minimal. The next obvious step would be to allow the searchable documents to be returned lazily.

@ryanmitchell
Copy link
Contributor

ryanmitchell commented Dec 11, 2023

@electroflame I've opened a PR here to provide the entries and terms lazily. Do you want to pull in the changes using a composer patch and see if that gets you close to the performance you were seeing through your reflection method?

@electroflame
Copy link
Author

@ryanmitchell I'll give it a shot soon -- sometime this week probably. Thanks for working on this!

@ryanmitchell
Copy link
Contributor

ryanmitchell commented Dec 11, 2023

@electroflame I've added a PR to meilisearch too, to make things easier for you: statamic-rad-pack/meilisearch#31

@electroflame
Copy link
Author

Thanks for the hard work @ryanmitchell!

I think this might've solved the memory usage issue. This basically reduces the memory usage down to a negligible amount, and it seems to stay right around there the whole time (i.e. no steadily increasing memory creep like before). This is great -- a huge deal considering where memory usage started.

The downside is that it absolutely tanks performance (i.e. about 45-55 minutes to update search indexes when it usually takes less than two minutes).

That seems to be down to the difference in how the indexing is handled, though. My reflection method essentially chunks it per-collection (provider, etc.) so, in the case of Meilisearch, single documents aren't being sent. What you've got right now is peak memory efficiency (which rocks!), but you lose speed (at least with Meilisearch, although I'd imagine any search index where you have to send data is going to bottleneck in a similar fashion). I'd imagine most installations won't be bothered too much, but in my (admittedly extreme) case, the speed penalty is pretty brutal.

As an aside, it might be useful to expose the providers and their collections via a public method (getProviders() and then getCollection, etc.). It wouldn't necessarily help with memory for this, but it would allow me (and others like me) to get at the underlying collections and process them without resorting to reflection (so it'd be safer long-term).

@ryanmitchell
Copy link
Contributor

Ok - I've updated the chunk size to 100 and I've reconfigured the meilisesarch PR to insert in chunks instead of 1 by 1 statamic-rad-pack/meilisearch#31

let me know if thats any better for you.

@electroflame
Copy link
Author

Just wanted to update that this PR and the linked one for Meilisearch work great. After a bit of Discord-debugging @ryanmitchell was able to track down the few lingering issues I found and they're included in the now-merged PRs.

The upshot is that the speed of this is roughly similar to my reflection method, but the memory usage is even lower (which is even lower than the default). After everything's said and done I'm seeing memory usage at about 10% of what default used to be, and about 40% of my reflection method.

It can't be understated -- this is some great work by Ryan that should help out a lot.

Thanks again, Ryan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants