Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove by word should consider other doc ids #45

Merged
merged 2 commits into from
Jul 25, 2022
Merged

Remove by word should consider other doc ids #45

merged 2 commits into from
Jul 25, 2022

Conversation

mateonunez
Copy link
Collaborator

@mateonunez mateonunez commented Jul 24, 2022

This PR would fix #40.

Thanks a lot to @lucaong for supporting me in this bug.

The removeByWord now considers if the docID exists in docIDs Set, then set the node.end as false.
The exact param will stay in the function for future implementations.

@mateonunez mateonunez requested a review from micheleriva July 24, 2022 17:29
@thomscoder
Copy link
Contributor

LGTM 💯 Bug #40 seems to be fixed.

@micheleriva
Copy link
Member

Thank you @lucaong, @thomscoder, and @mateonunez for working on this one

@micheleriva micheleriva merged commit 1cb8f3f into oramasearch:main Jul 25, 2022
@mateonunez mateonunez deleted the fix/remove-should-check-doc-ids branch July 25, 2022 07:27
Copy link

@lucaong lucaong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the issue, but for the "wrong reason" :) See the comment.


if (node.end || (exact && node.end && nodeWord === word)) {
node.removeDoc(docID);

if (node.children?.size) {
if (node.children?.size && docIDs.has(docID)) {
Copy link

@lucaong lucaong Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the issue, but for the wrong reason :)

docIDs.has(docID) will always return false here, because in the previous line node.removeDoc(docID) removed the document from the set. As a result, node.end = false is never executed, so the whole if block here could be completely removed.

Never setting node.end = false on one hand fixes the issue, and does not cause any correctness problem. On the other hand, it leaves a term in the trie, even when all documents associated to it were removed. This is fine, but will have performance impacts on large tries if a lot of documents are deleted: even when the set of documents associated to a term is empty, the term is never removed, so it is still considered in prefix and fuzzy search.

My recommendation: change the if condition to something like node.children?.size && docIDs.size === 0. That means: if after deleting this one document there are no more documents in the set pointed by this word, remove the word from the trie by making the node non-terminal.

Note that this still leaves one case uncovered, namely when the node is terminal but has no children. In that case, in order to really delete the term from the trie, one has to backtrack to the closest branching, and adjust the nodes to remove the "orphaned" suffix. This could be addressed in another issue, since again it does not affect correctness, but rather only performance in case of a large trie with many deleted terms (that without the proposed refactoring stay in the trie, even though they point to empty document sets).

I am confident that @micheleriva understands what I am pointing at, but otherwise I can offer some more explanation.

@@ -112,12 +112,12 @@ export class Trie {
if (!word) return false;

function removeWord(node: TrieNode, _word: string, docID: string): boolean {
const [nodeWord /**_docs*/] = node.getWord();
const [nodeWord, docIDs] = node.getWord();

if (node.end || (exact && node.end && nodeWord === word)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR (rather to #41), but still worth noting: this condition is completely equivalent to just node.end: that's because x || (x && y) is the same as just x. In the specific case, if node.end is true, the condition short circuits before the ||. If it is false, then the part after the || is also false, because it's a chain of && including node.end. Therefore, the whole exact logic can be removed with no effect.

Whether the resulting condition after removing the part after || is correct, I am not 100% sure (I would advise having more extensive tests), but for sure it is equivalent to the current one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deleting one document also causes other documents to disappear from searches
4 participants