Remove orphaned states after deleting or updating documents #117

daun · 2024-11-26T22:06:07Z

Try and tackle #112 by removing unused terms from the state set index after deletion of documents.

Currently blocked by #119. Orphaned terms are currently only deleted after deleting documents. We'll need to find a way to remove them after updating documents as well. The failing test in this PR should pass successfully once there is a solution for that.

Drive-by changes, feel free to revert:

Reference fixtures from a helper method
Add tests for Loupe factory in-memory instance

Signed-off-by: Philipp Daun <post@philippdaun.net>

Toflar · 2024-11-28T11:18:22Z

Thanks for working on this! I still think we should also implement the logic when deleting only 1 or multiple documents but not all. It's going to make deletion slower but imho it's better to have a correct state rather than it being fast.

Basically what would need to happen is the state set needs to be deleted and then we have to loop over all terms and call stateSet->index() again. We should try to improve this in the future but again, I guess it's better to work correctly now than to be efficient. Wdyt?

daun · 2024-11-28T13:05:13Z

@Toflar Makes sense to go for the correct implementation! So basically one would need to truncate the state_set table whenever any documents are deleted, and then rebuild all states? And then check how to make it performant in a future iteration.

Toflar · 2024-11-28T15:12:18Z

Yeah, I'm still not sure it makes sense. If you have 30k documents and some 400k terms in your database, removing one document would mean that you have to update the state set for 400k terms (minus the ones you deleted - which can be possibly none). That doesn't sound like a valid solution either.

I mean, there is no problem with keeping the state set - it's not causing any false-positives but if you update your index often and contents change often, then you might end up having a huge state set where half of the states are just useless and obsolete. The question is when to get rid of those 🤔

daun · 2024-11-28T17:07:10Z

Fascinating :) I have a feeling it can be done but I'm very probably just missing something about how the algorithm works. Just for comprehension, are my assumptions below true?

The state_set table holds a "compressed" representation of the actual states
There are as many rows as there are known states, let's say 0 to 340 that map to all 340 states in the terms table
The actual long state numbers are stored in state_set.php
This mapping between compressed number and state is an optimization layer
This mapping between compressed number and state (the array from the php file) is always fully loaded into memory

Assuming the above points are true (?), a few naive questions:

What is the mapping layer optimizing? Is it that requiring a state_set.php is always faster than an sql query?
Could the state_set table be replaced by just counting the number of items in the array from state_set.php and storing it somewhere as a single number?
Could the state set be queried live from all terms once on startup, instead of reading the php file and state_set table? I.e. select all unique state columns from the terms table, sort in ascending order. Might be slow as hell.

I think I'm missing a big piece of the puzzle at the moment. I feel like the state_set table isn't required at all if the state_set.php array is already loaded into memory. In that case, one could just unset those keys from the php array whenever a term is deleted and doesn't exist anywhere else. But that would obviously be too easy, so I must be missing something here :)

Toflar · 2024-11-29T09:03:11Z

The state_set table holds a "compressed" representation of the actual states

What do you mean by "compressed"? It just holds all states that have been calculated.

There are as many rows as there are known states, let's say 0 to 340 that map to all 340 states in the terms table

No. All the intermediate terms as well. So your foobar term has 6 states that are stored. To get from f to o to o and so on, all those intermediate state are stored in state_set. The algorithm works by looking at the query term and then determining all possible states you could get to with a configurable cost (number of typos). So when you search for foobar then the algorithm takes f and calculates all the possible target states it could reach with e.g. 2 typos. In order to do that, it needs to check which target states exist ($stateSet->has()). Those are thousands of calls.

The actual long state numbers are stored in state_set.php
This mapping between compressed number and state is an optimization layer
This mapping between compressed number and state (the array from the php file) is always fully loaded into memory

No, it's the same data as in the state_set table. It's just dumped so that it can be cached in OPcache which makes it a lot faster than querying the database. And yes, it's loaded into memory because those thousands of has() calls would end up in thousands of SELECT queries making search very slow. Hence, it's loaded into memory.

What is the mapping layer optimizing? Is it that requiring a state_set.php is always faster than an sql query?

I think I have answered this now 😊

Could the state_set table be replaced by just counting the number of items in the array from state_set.php and storing it somewhere as a single number?

The table there is redundant but to me it felt better to have sqlite as the source of truth for all the data and then the state_set.php is just a cache layer that can be recreated whenever needed from the db.

Could the state set be queried live from all terms once on startup, instead of reading the php file and state_set table? I.e. select all unique state columns from the terms table, sort in ascending order. Might be slow as hell.

That should be answered as well now, right?

With @ausi's work, we can now use v3 of https://github.com/Toflar/state-set-index/releases/tag/3.0.0, so removing terms from the state set should now be possible 🥳
Note: The default also changed from Levenshtein to Damerau-Levenshtein which will require adjustments as well. We can either configure the $transpositionCost to 2 for the time being so it's still regular Levenshtein. I will probably have to create a separate PR where we update to v3 first 😊

daun · 2024-11-29T09:30:50Z

Thanks for the thorough explanation! Makes perfect sense.

With @ausi's work, we can now use v3 of state-set-index, so removing terms from the state set should now be possible 🥳

Fantastic 🤠 Should we leave this PR open and integrate that, or do you prefer to create a new, separate PR?

What do you mean by "compressed"? It just holds all states that have been calculated.

No, it's the same data as in the state_set table.

I must be using it wrong, then 😵‍💫 The state_set.php in my index holds an array of state numbers, whereas the state_set table holds numbers going incrementally from 0 to the count of states. Hence my assumption of it being a compression layer that refers to the index of the original state stored in the php file. Can this be some config issue, or an issue with how I load documents into the index?

The one on the left is the terms table, sorted by state. They go 6, 9, 10, 11, 16, 28, 42, etc.
The one on the right is the set table. These go 0, 1, 2, 3, 4, 5, 6, 7, 8, etc.

Toflar · 2024-11-29T09:38:32Z

Looks correct and normal to me. Your state_set is not incremental, it just happens that on the very low end, almost all states will exist (depending on the alphabet size but with 4 which we use, that's normal). But you will notice that higher numbers have more space between each other :) Sort your state_set desc and you'll see.

Every term gets its end state assigned. So your sea is 74. But to get there you need the state of s and se as well, those are stored in state_set :)

daun · 2024-11-29T09:46:10Z

The highest state in the table is 308, though, and they just increment by 1 up until the end. While the highest state in the terms table is 22070891.

Toflar · 2024-11-29T09:48:07Z

Fantastic 🤠 Should we leave this PR open and integrate that, or do you prefer to create a new, separate PR?

PR vor the general v3 update is here: #118
And then we can keep your PR, we need to adjust the reviseStorage() or removeOrphans() so that it does not just execute a DELETE query for terms and prefixes_terms (which seems to be missing anyway at the moment - bug) but instead selects those and passes them to the new $stateSetIndex->removeFromIndex($termsToRemove). The rest of your PR is perfectly fine (clearing it all on deleteAllDocuments() is way more efficient so we should keep that).

Toflar · 2024-11-29T09:51:40Z

The highest state in the table is 308, though, and they just increment by 1 up until the end. While the highest state in the terms table is 22070891.

Wtf, that would be a bug then. Let me check that.

daun · 2024-11-29T09:57:05Z

The highest state in the table is 308, though, and they just increment by 1 up until the end. While the highest state in the terms table is 22070891.

Wtf, that would be a bug then. Let me check that.

From a quick look at the implementation, it seems to be saving the current item's index to the database, rather than the value. I'm getting the results you're describing by making a slight change to the foreach loop in StateSet::persist. With that, I'm seeing the states from the terms table.

Tests seem to be passing either way :)

public function persist(): void
    {
        $this->initialize();

-        foreach ($this->inMemoryStateSet->all() as $state => $data) {
+        foreach ($this->inMemoryStateSet->all() as $state) {
            $this->engine->upsert(IndexInfo::TABLE_NAME_STATE_SET, [
                'state' => $state,
            ], ['state']);
        }

        $all = $this->inMemoryStateSet->all();
        $all = array_combine($this->inMemoryStateSet->all(), array_fill(0, \count($all), true));
        $this->dumpStateSetCache($all);
    }

Toflar · 2024-11-29T09:57:44Z

Indeed, funny nobody ever found that - persistence of the state was completely wrong 🤦 9306930
I'll have to backport this to 0.8 and release a fix.
EDIT: Done, 0.8.2 is published.

daun · 2024-11-29T10:04:07Z

Indeed, funny nobody ever found that - persistence of the state was completely wrong 🤦 9306930

It wasn't preventing Loupe from working, so no harm done :) Funny enough, I fed the SSI paper and the table structures to an LLM, and it told me about a compression layer which sounds super reasonable 🙃 Hence my naive assumptions.

Toflar · 2024-11-29T10:05:55Z

Sounds intriguing - maybe something we can consider in the future 🤣

ausi · 2024-11-29T13:08:17Z

Technically the states in the index are lossy compressed as more than one letter map to the same integer. So the lower you configure the alphabet size the more compressed the states are. But I’m not sure how this translates to the storage in SQLite here ☺️

Signed-off-by: Philipp Daun <post@philippdaun.net>

daun · 2024-11-30T11:19:52Z

@Toflar I've updated the PR with logic for removing orphaned terms from the state set index. During testing, issue #119 came up where orphaned terms are not removed after updating existing documents. We'll need to solve that before we can verify this PR. Technically, it should be working, but will rely on the indexer removing orphaned terms after updates as well.

Signed-off-by: Philipp Daun <post@philippdaun.net>

Toflar · 2024-12-02T09:03:28Z

Merged main into develop now!

daun · 2024-12-02T09:10:02Z

@Toflar Nice, I'll check if this PR needs more work and report back :)

src/Internal/Index/Indexer.php

Signed-off-by: Philipp Daun <post@philippdaun.net>

daun · 2024-12-02T10:29:42Z

@Toflar Switched to chunked iterators for removing terms, and added logic for cleaning up the prefixes tables as well. Good to go from my end — tests are passing, but might do with a quick manual test from your end as well :)

src/Internal/Index/Indexer.php

Signed-off-by: Philipp Daun <post@philippdaun.net>

Toflar · 2024-12-02T12:03:11Z

Thanks a lot for sticking with me @daun!

daun added 6 commits November 26, 2024 19:07

Move data fixtures

072e6c0

Signed-off-by: Philipp Daun <post@philippdaun.net>

Extend factory tests

5b1d67e

Signed-off-by: Philipp Daun <post@philippdaun.net>

Add tests for deleted states

fe77b52

Signed-off-by: Philipp Daun <post@philippdaun.net>

Add methods for clearing state set

6053b90

Signed-off-by: Philipp Daun <post@philippdaun.net>

Add tests for clearing state set

ca97444

Signed-off-by: Philipp Daun <post@philippdaun.net>

Fix types

d618f17

Signed-off-by: Philipp Daun <post@philippdaun.net>

daun added 2 commits November 30, 2024 09:27

Merge branch 'develop' into feat/clear-states

4767f69

Remove orphaned states and persist on deletion

ecb5d25

Signed-off-by: Philipp Daun <post@philippdaun.net>

daun changed the title ~~Clear all states after deleting all documents~~ Remove orphaned states after deleting documents Nov 30, 2024

daun added 2 commits November 30, 2024 11:00

Restore document deletion test

4b24c7d

Signed-off-by: Philipp Daun <post@philippdaun.net>

Add test for state removal after update

9dc7380

Signed-off-by: Philipp Daun <post@philippdaun.net>

daun mentioned this pull request Nov 30, 2024

Orphaned terms are not removed after updating a document #119

Closed

Revise tests

bb7f904

Signed-off-by: Philipp Daun <post@philippdaun.net>

daun changed the title ~~Remove orphaned states after deleting documents~~ Remove orphaned states after deleting or updating documents Nov 30, 2024

daun added 2 commits November 30, 2024 12:40

Fix style issues

b11e007

Signed-off-by: Philipp Daun <post@philippdaun.net>

Fix test comment

a199979

Signed-off-by: Philipp Daun <post@philippdaun.net>

Merge branch 'develop' into feat/clear-states

5363f5d

Toflar reviewed Dec 2, 2024

View reviewed changes

src/Internal/Index/Indexer.php Outdated Show resolved Hide resolved

src/Internal/Index/Indexer.php Outdated Show resolved Hide resolved

daun added 3 commits December 2, 2024 10:54

Use temporary directories for data storage

b31f61f

Signed-off-by: Philipp Daun <post@philippdaun.net>

Chunk term removal from state set index

704a71b

Signed-off-by: Philipp Daun <post@philippdaun.net>

Clean up orphaned prefixes

46411e8

Signed-off-by: Philipp Daun <post@philippdaun.net>

Toflar reviewed Dec 2, 2024

View reviewed changes

src/Internal/Index/Indexer.php Outdated Show resolved Hide resolved

daun added 2 commits December 2, 2024 11:52

Abstrct method for removing orphaned terms

c3f3f18

Signed-off-by: Philipp Daun <post@philippdaun.net>

Clean up abstraction

cab0a0b

Signed-off-by: Philipp Daun <post@philippdaun.net>

Toflar approved these changes Dec 2, 2024

View reviewed changes

Toflar merged commit 90c8062 into loupe-php:develop Dec 2, 2024
18 checks passed

daun deleted the feat/clear-states branch December 2, 2024 12:04

daun mentioned this pull request Dec 2, 2024

Remove orphaned states when documents are deleted #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove orphaned states after deleting or updating documents #117

Remove orphaned states after deleting or updating documents #117

daun commented Nov 26, 2024 •

edited

Loading

Toflar commented Nov 28, 2024

daun commented Nov 28, 2024

Toflar commented Nov 28, 2024

daun commented Nov 28, 2024

Toflar commented Nov 29, 2024

daun commented Nov 29, 2024

Toflar commented Nov 29, 2024

daun commented Nov 29, 2024

Toflar commented Nov 29, 2024

Toflar commented Nov 29, 2024

daun commented Nov 29, 2024

Toflar commented Nov 29, 2024 •

edited

Loading

daun commented Nov 29, 2024

Toflar commented Nov 29, 2024

ausi commented Nov 29, 2024

daun commented Nov 30, 2024 •

edited

Loading

Toflar commented Dec 2, 2024

daun commented Dec 2, 2024

daun commented Dec 2, 2024

Toflar commented Dec 2, 2024

Remove orphaned states after deleting or updating documents #117

Remove orphaned states after deleting or updating documents #117

Conversation

daun commented Nov 26, 2024 • edited Loading

Toflar commented Nov 28, 2024

daun commented Nov 28, 2024

Toflar commented Nov 28, 2024

daun commented Nov 28, 2024

Toflar commented Nov 29, 2024

daun commented Nov 29, 2024

Toflar commented Nov 29, 2024

daun commented Nov 29, 2024

Toflar commented Nov 29, 2024

Toflar commented Nov 29, 2024

daun commented Nov 29, 2024

Toflar commented Nov 29, 2024 • edited Loading

daun commented Nov 29, 2024

Toflar commented Nov 29, 2024

ausi commented Nov 29, 2024

daun commented Nov 30, 2024 • edited Loading

Toflar commented Dec 2, 2024

daun commented Dec 2, 2024

daun commented Dec 2, 2024

Toflar commented Dec 2, 2024

daun commented Nov 26, 2024 •

edited

Loading

Toflar commented Nov 29, 2024 •

edited

Loading

daun commented Nov 30, 2024 •

edited

Loading