Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove orphaned states after deleting or updating documents #117

Merged
merged 19 commits into from
Dec 2, 2024

Conversation

daun
Copy link
Contributor

@daun daun commented Nov 26, 2024

Try and tackle #112 by removing unused terms from the state set index after deletion of documents.

Currently blocked by #119. Orphaned terms are currently only deleted after deleting documents. We'll need to find a way to remove them after updating documents as well. The failing test in this PR should pass successfully once there is a solution for that.

Drive-by changes, feel free to revert:

  • Reference fixtures from a helper method
  • Add tests for Loupe factory in-memory instance

daun added 6 commits November 26, 2024 19:07
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
@Toflar
Copy link
Contributor

Toflar commented Nov 28, 2024

Thanks for working on this! I still think we should also implement the logic when deleting only 1 or multiple documents but not all. It's going to make deletion slower but imho it's better to have a correct state rather than it being fast.

Basically what would need to happen is the state set needs to be deleted and then we have to loop over all terms and call stateSet->index() again. We should try to improve this in the future but again, I guess it's better to work correctly now than to be efficient. Wdyt?

@daun
Copy link
Contributor Author

daun commented Nov 28, 2024

@Toflar Makes sense to go for the correct implementation! So basically one would need to truncate the state_set table whenever any documents are deleted, and then rebuild all states? And then check how to make it performant in a future iteration.

@Toflar
Copy link
Contributor

Toflar commented Nov 28, 2024

Yeah, I'm still not sure it makes sense. If you have 30k documents and some 400k terms in your database, removing one document would mean that you have to update the state set for 400k terms (minus the ones you deleted - which can be possibly none). That doesn't sound like a valid solution either.

I mean, there is no problem with keeping the state set - it's not causing any false-positives but if you update your index often and contents change often, then you might end up having a huge state set where half of the states are just useless and obsolete. The question is when to get rid of those 🤔

@daun
Copy link
Contributor Author

daun commented Nov 28, 2024

Fascinating :) I have a feeling it can be done but I'm very probably just missing something about how the algorithm works. Just for comprehension, are my assumptions below true?

  1. The state_set table holds a "compressed" representation of the actual states
  2. There are as many rows as there are known states, let's say 0 to 340 that map to all 340 states in the terms table
  3. The actual long state numbers are stored in state_set.php
  4. This mapping between compressed number and state is an optimization layer
  5. This mapping between compressed number and state (the array from the php file) is always fully loaded into memory

Assuming the above points are true (?), a few naive questions:

  1. What is the mapping layer optimizing? Is it that requiring a state_set.php is always faster than an sql query?
  2. Could the state_set table be replaced by just counting the number of items in the array from state_set.php and storing it somewhere as a single number?
  3. Could the state set be queried live from all terms once on startup, instead of reading the php file and state_set table? I.e. select all unique state columns from the terms table, sort in ascending order. Might be slow as hell.

I think I'm missing a big piece of the puzzle at the moment. I feel like the state_set table isn't required at all if the state_set.php array is already loaded into memory. In that case, one could just unset those keys from the php array whenever a term is deleted and doesn't exist anywhere else. But that would obviously be too easy, so I must be missing something here :)

@Toflar
Copy link
Contributor

Toflar commented Nov 29, 2024

The state_set table holds a "compressed" representation of the actual states

What do you mean by "compressed"? It just holds all states that have been calculated.

There are as many rows as there are known states, let's say 0 to 340 that map to all 340 states in the terms table

No. All the intermediate terms as well. So your foobar term has 6 states that are stored. To get from f to o to o and so on, all those intermediate state are stored in state_set. The algorithm works by looking at the query term and then determining all possible states you could get to with a configurable cost (number of typos). So when you search for foobar then the algorithm takes f and calculates all the possible target states it could reach with e.g. 2 typos. In order to do that, it needs to check which target states exist ($stateSet->has()). Those are thousands of calls.

The actual long state numbers are stored in state_set.php
This mapping between compressed number and state is an optimization layer
This mapping between compressed number and state (the array from the php file) is always fully loaded into memory

No, it's the same data as in the state_set table. It's just dumped so that it can be cached in OPcache which makes it a lot faster than querying the database. And yes, it's loaded into memory because those thousands of has() calls would end up in thousands of SELECT queries making search very slow. Hence, it's loaded into memory.

What is the mapping layer optimizing? Is it that requiring a state_set.php is always faster than an sql query?

I think I have answered this now 😊

Could the state_set table be replaced by just counting the number of items in the array from state_set.php and storing it somewhere as a single number?

The table there is redundant but to me it felt better to have sqlite as the source of truth for all the data and then the state_set.php is just a cache layer that can be recreated whenever needed from the db.

Could the state set be queried live from all terms once on startup, instead of reading the php file and state_set table? I.e. select all unique state columns from the terms table, sort in ascending order. Might be slow as hell.

That should be answered as well now, right?

With @ausi's work, we can now use v3 of https://github.com/Toflar/state-set-index/releases/tag/3.0.0, so removing terms from the state set should now be possible 🥳
Note: The default also changed from Levenshtein to Damerau-Levenshtein which will require adjustments as well. We can either configure the $transpositionCost to 2 for the time being so it's still regular Levenshtein. I will probably have to create a separate PR where we update to v3 first 😊

@daun
Copy link
Contributor Author

daun commented Nov 29, 2024

Thanks for the thorough explanation! Makes perfect sense.

With @ausi's work, we can now use v3 of state-set-index, so removing terms from the state set should now be possible 🥳

Fantastic 🤠 Should we leave this PR open and integrate that, or do you prefer to create a new, separate PR?

What do you mean by "compressed"? It just holds all states that have been calculated.

No, it's the same data as in the state_set table.

I must be using it wrong, then 😵‍💫 The state_set.php in my index holds an array of state numbers, whereas the state_set table holds numbers going incrementally from 0 to the count of states. Hence my assumption of it being a compression layer that refers to the index of the original state stored in the php file. Can this be some config issue, or an issue with how I load documents into the index?

The one on the left is the terms table, sorted by state. They go 6, 9, 10, 11, 16, 28, 42, etc.
The one on the right is the set table. These go 0, 1, 2, 3, 4, 5, 6, 7, 8, etc.

Screenshot 2024-11-29 at 10 21 07 Screenshot 2024-11-29 at 10 21 21

@Toflar
Copy link
Contributor

Toflar commented Nov 29, 2024

Looks correct and normal to me. Your state_set is not incremental, it just happens that on the very low end, almost all states will exist (depending on the alphabet size but with 4 which we use, that's normal). But you will notice that higher numbers have more space between each other :) Sort your state_set desc and you'll see.

Every term gets its end state assigned. So your sea is 74. But to get there you need the state of s and se as well, those are stored in state_set :)

@daun
Copy link
Contributor Author

daun commented Nov 29, 2024

The highest state in the table is 308, though, and they just increment by 1 up until the end. While the highest state in the terms table is 22070891.

Screenshot 2024-11-29 at 10 42 42

@Toflar
Copy link
Contributor

Toflar commented Nov 29, 2024

Fantastic 🤠 Should we leave this PR open and integrate that, or do you prefer to create a new, separate PR?

PR vor the general v3 update is here: #118
And then we can keep your PR, we need to adjust the reviseStorage() or removeOrphans() so that it does not just execute a DELETE query for terms and prefixes_terms (which seems to be missing anyway at the moment - bug) but instead selects those and passes them to the new $stateSetIndex->removeFromIndex($termsToRemove). The rest of your PR is perfectly fine (clearing it all on deleteAllDocuments() is way more efficient so we should keep that).

@Toflar
Copy link
Contributor

Toflar commented Nov 29, 2024

The highest state in the table is 308, though, and they just increment by 1 up until the end. While the highest state in the terms table is 22070891.

Wtf, that would be a bug then. Let me check that.

@daun
Copy link
Contributor Author

daun commented Nov 29, 2024

The highest state in the table is 308, though, and they just increment by 1 up until the end. While the highest state in the terms table is 22070891.

Wtf, that would be a bug then. Let me check that.

From a quick look at the implementation, it seems to be saving the current item's index to the database, rather than the value. I'm getting the results you're describing by making a slight change to the foreach loop in StateSet::persist. With that, I'm seeing the states from the terms table.

Tests seem to be passing either way :)

public function persist(): void
    {
        $this->initialize();

-        foreach ($this->inMemoryStateSet->all() as $state => $data) {
+        foreach ($this->inMemoryStateSet->all() as $state) {
            $this->engine->upsert(IndexInfo::TABLE_NAME_STATE_SET, [
                'state' => $state,
            ], ['state']);
        }

        $all = $this->inMemoryStateSet->all();
        $all = array_combine($this->inMemoryStateSet->all(), array_fill(0, \count($all), true));
        $this->dumpStateSetCache($all);
    }

@Toflar
Copy link
Contributor

Toflar commented Nov 29, 2024

Indeed, funny nobody ever found that - persistence of the state was completely wrong 🤦 9306930
I'll have to backport this to 0.8 and release a fix.
EDIT: Done, 0.8.2 is published.

@daun
Copy link
Contributor Author

daun commented Nov 29, 2024

Indeed, funny nobody ever found that - persistence of the state was completely wrong 🤦 9306930

It wasn't preventing Loupe from working, so no harm done :) Funny enough, I fed the SSI paper and the table structures to an LLM, and it told me about a compression layer which sounds super reasonable 🙃 Hence my naive assumptions.

@Toflar
Copy link
Contributor

Toflar commented Nov 29, 2024

Sounds intriguing - maybe something we can consider in the future 🤣

@ausi
Copy link

ausi commented Nov 29, 2024

Technically the states in the index are lossy compressed as more than one letter map to the same integer. So the lower you configure the alphabet size the more compressed the states are. But I’m not sure how this translates to the storage in SQLite here ☺️

@daun daun changed the title Clear all states after deleting all documents Remove orphaned states after deleting documents Nov 30, 2024
daun added 2 commits November 30, 2024 11:00
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
@daun
Copy link
Contributor Author

daun commented Nov 30, 2024

@Toflar I've updated the PR with logic for removing orphaned terms from the state set index. During testing, issue #119 came up where orphaned terms are not removed after updating existing documents. We'll need to solve that before we can verify this PR. Technically, it should be working, but will rely on the indexer removing orphaned terms after updates as well.

@daun daun changed the title Remove orphaned states after deleting documents Remove orphaned states after deleting or updating documents Nov 30, 2024
daun added 2 commits November 30, 2024 12:40
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
@Toflar
Copy link
Contributor

Toflar commented Dec 2, 2024

Merged main into develop now!

@daun
Copy link
Contributor Author

daun commented Dec 2, 2024

@Toflar Nice, I'll check if this PR needs more work and report back :)

src/Internal/Index/Indexer.php Outdated Show resolved Hide resolved
src/Internal/Index/Indexer.php Outdated Show resolved Hide resolved
daun added 3 commits December 2, 2024 10:54
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
@daun
Copy link
Contributor Author

daun commented Dec 2, 2024

@Toflar Switched to chunked iterators for removing terms, and added logic for cleaning up the prefixes tables as well. Good to go from my end — tests are passing, but might do with a quick manual test from your end as well :)

daun added 2 commits December 2, 2024 11:52
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
@Toflar Toflar merged commit 90c8062 into loupe-php:develop Dec 2, 2024
18 checks passed
@Toflar
Copy link
Contributor

Toflar commented Dec 2, 2024

Thanks a lot for sticking with me @daun!

@daun daun deleted the feat/clear-states branch December 2, 2024 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants