-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support deleting elements? #4
Comments
This works mostly fine for SW-graph: Doing something similar for HNSW is a bit more work, but not too much. My understanding is that HNSW will actually be much more robust to deletion. One disadvantage is that you can't use an optimized static index any more (but that's the cost of all dynamic data structures): At least, something |
Hi @ThomasDelteil,
Agree with @searchivarius. HNSW should be much more robust, as it does not depend on the data order. SW-graph (NSW) can be easily crippled on low-dim data if the first inserted elements are removed. |
Any chance this will be added, either here or in nmslib, some time soon? |
@phdowling sometime, it will. SW-graph in NMSLIB has an experimental deletion feature, but it's not exposed to Python. |
@phdowling I do not think I will have an opportunity to add deletions to hnswlib for several following months. |
Does any hero implement deletion feature. We are looking forward to it. |
@yurymalkov could you describe what the process of reassignments of the edges ? would a naive approach of simply removing the node from the connected nodes on all layers work. those should be easy to find since the links are bi-directional edit : on second thought links are not bi-directional because only the strongest links on each level are preserved. So would the only option then be to make a pass through all nodes in the graph and rewrite those nodes that have connections to the item being deleted |
Have not seen all the code yet, but is there a workaround for solving this issue, like saving and loading index or recreating is the only way to do "hard delete"? |
I ran into a similar issue building the deletion feature for the HNSW implementation in Weaviate (which is written in Golang). Weaviate is a general-purpose database/search engine, so we can't predict in what order or frequency users will be deleting items, so the "flagging-only" approach also isn't feasible for us, for the reasons @yurymalkov outlined very well above. I'd like to share my findings and also present an idea of I'm planning to tackle this issue. Feedback much appreciated. Naive approachI first tried following all edges of the to-be-deleted items and simply reassigning those, i.e. what you described as the "naive approach". However, even at a very small dataset (1000 nodes, Brute force approachSo next I tried the brute-force approach of:
This works well, all the edges are cleaned up and the quality of the index isn't harmed. However, this is terribly slow. Even on my tiny graph deleting all the items took about 10 times as long as inserting them. I don't see this feasible on a graph with a few million or billion nodes. Idea: Tombstones with periodic clean upMy idea to solve this issue is to combine the "flagging approach" with a periodical reassignment of edges. It would look roughly like this: On Delete:
On Search:
Periodically (with configurable frequency):
Possible Benefits
Any thoughts on my outlined approach? Is there any way we can avoid a brute-force cleanup which needs to read every single node in the graph? |
Hi @etiennedi for a detail analysis. I agree that might be a very good solution! Another option is when a new element is being added to pick one of the deleted (tombstoned) elements and update it to the position of the new element. The downside is that it cannot shrink the index at any time (e.g. it makes sense to remove unused element during index saving). |
Hi @etiennedi, your suggestion sounds interesting! Did you finally implemented this solution? I would appropriate if you share your experience. |
Hi @alonre24, yes the approach outline above is what we implemented for async cleanup of deleted objects in Weaviate. You can find the implementation here. At first, we encountered a couple of new bugs during concurrent writes and deletions which could lead to issues down the line - especially around handling entrypoints. For example, in Weaviate we support kNN-style classifications where each classification is an update (the object is updated with the label), leading to a ton of concurrent writes/deletes. This "helped" us highlight any issues rather quickly. We have thus added a lot of integration tests around the edge cases around deletes which can be found here. Since then I haven't been aware of any other bugs related to this, so I'm pretty confident that it works reliably now - so I can definitely recommend this approach. |
Thanks @etiennedi for your detailed answer! |
@yurymalkov I see that hnswlib supports |
Hi @kishorenc it is not implemented. An easy solution is to just rebuild index from scratch (which can be done in c++). That has ~ O(1) complexity per deletion (if triggered at a fraction of the dataset deleted). |
This PR adds a batch persist functionality via the persistDirty() nethid to the graph which only persists dirtied elements in the graph. We store data in four files, a header, the data_level_0, the length, and the link lists. The latter three files map to the in memory representation. Data is never read from disk except on load, serving as a write-through cache. Callers are expected to periodically call persistDirty() in a thread-compatible way. This storage scheme is extremely naive, and is only meant as an improvement to serializing the whole index. We can make many improvements in terms of disk access, layout, caching, and durability.
Thanks for the great implementation, the performance is impressive. I was wondering what it would take to add the ability to delete an element from the graph without degrading too much the performance? Having a special flag on the label, or re-assigning the lost edges?
The text was updated successfully, but these errors were encountered: