-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove orphaned states after deleting or updating documents #117
Conversation
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Thanks for working on this! I still think we should also implement the logic when deleting only 1 or multiple documents but not all. It's going to make deletion slower but imho it's better to have a correct state rather than it being fast. Basically what would need to happen is the state set needs to be deleted and then we have to loop over all terms and call |
@Toflar Makes sense to go for the correct implementation! So basically one would need to truncate the state_set table whenever any documents are deleted, and then rebuild all states? And then check how to make it performant in a future iteration. |
Yeah, I'm still not sure it makes sense. If you have 30k documents and some 400k terms in your database, removing one document would mean that you have to update the state set for 400k terms (minus the ones you deleted - which can be possibly none). That doesn't sound like a valid solution either. I mean, there is no problem with keeping the state set - it's not causing any false-positives but if you update your index often and contents change often, then you might end up having a huge state set where half of the states are just useless and obsolete. The question is when to get rid of those 🤔 |
Fascinating :) I have a feeling it can be done but I'm very probably just missing something about how the algorithm works. Just for comprehension, are my assumptions below true?
Assuming the above points are true (?), a few naive questions:
I think I'm missing a big piece of the puzzle at the moment. I feel like the |
What do you mean by "compressed"? It just holds all states that have been calculated.
No. All the intermediate terms as well. So your
No, it's the same data as in the
I think I have answered this now 😊
The table there is redundant but to me it felt better to have sqlite as the source of truth for all the data and then the
That should be answered as well now, right? With @ausi's work, we can now use v3 of https://github.com/Toflar/state-set-index/releases/tag/3.0.0, so removing terms from the state set should now be possible 🥳 |
Thanks for the thorough explanation! Makes perfect sense.
Fantastic 🤠 Should we leave this PR open and integrate that, or do you prefer to create a new, separate PR?
I must be using it wrong, then 😵💫 The The one on the left is the terms table, sorted by state. They go 6, 9, 10, 11, 16, 28, 42, etc. ![]() ![]() |
Looks correct and normal to me. Your Every term gets its end state assigned. So your |
PR vor the general v3 update is here: #118 |
Wtf, that would be a bug then. Let me check that. |
From a quick look at the implementation, it seems to be saving the current item's index to the database, rather than the value. I'm getting the results you're describing by making a slight change to the foreach loop in Tests seem to be passing either way :) public function persist(): void
{
$this->initialize();
- foreach ($this->inMemoryStateSet->all() as $state => $data) {
+ foreach ($this->inMemoryStateSet->all() as $state) {
$this->engine->upsert(IndexInfo::TABLE_NAME_STATE_SET, [
'state' => $state,
], ['state']);
}
$all = $this->inMemoryStateSet->all();
$all = array_combine($this->inMemoryStateSet->all(), array_fill(0, \count($all), true));
$this->dumpStateSetCache($all);
} |
Indeed, funny nobody ever found that - persistence of the state was completely wrong 🤦 9306930 |
It wasn't preventing Loupe from working, so no harm done :) Funny enough, I fed the SSI paper and the table structures to an LLM, and it told me about a compression layer which sounds super reasonable 🙃 Hence my naive assumptions. |
Sounds intriguing - maybe something we can consider in the future 🤣 |
Technically the states in the index are lossy compressed as more than one letter map to the same integer. So the lower you configure the alphabet size the more compressed the states are. But I’m not sure how this translates to the storage in SQLite here |
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
@Toflar I've updated the PR with logic for removing orphaned terms from the state set index. During testing, issue #119 came up where orphaned terms are not removed after updating existing documents. We'll need to solve that before we can verify this PR. Technically, it should be working, but will rely on the indexer removing orphaned terms after updates as well. |
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Merged |
@Toflar Nice, I'll check if this PR needs more work and report back :) |
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
@Toflar Switched to chunked iterators for removing terms, and added logic for cleaning up the prefixes tables as well. Good to go from my end — tests are passing, but might do with a quick manual test from your end as well :) |
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Thanks a lot for sticking with me @daun! |
Try and tackle #112 by removing unused terms from the state set index after deletion of documents.
Currently blocked by #119. Orphaned terms are currently only deleted after deleting documents. We'll need to find a way to remove them after updating documents as well. The failing test in this PR should pass successfully once there is a solution for that.
Drive-by changes, feel free to revert: