-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ngram indexing implementation details #1518
Comments
Could you say more about how the use of a bloom filter avoids the large file problem? Specifically, where the large file problem occurs because it winds up as being a frequent false positive that ripgrep then needs to exhaustively search. Since it's a large file, that search will take a long time, potentially negating the benefits of indexed search in the first place. |
That problem persists. It avoids the problem of large files making lots of ngrams and filling up your index. That is usually a problem with inverted indices, but I suppose that might not be as much of an issue for n-gram only inverted indices. You could partition large files into chunks and built indices/bloom filters on that level instead. N-grams at the boundaries could be problematic. Might be more of a v2 feature than an MVP one. |
Yeah, at least initially, I'm trying to avoid going down the road of having to modify the existing search architecture to support searching anything more granular than a single file. Once you go down that route, you run into all sorts of thorny issues, mostly around presenting the results back to the user. e.g., Supporting things like line numbers and the various context related flags. But yes, large files will definitely fill up the postings lists disproportionately. That in and of itself isn't so much of a problem. The main problem in this space is very common ngrams that themselves have very large postings lists. Traversing those lists ends up taking quite a bit of time. I'm hoping to mitigate that somewhat:
Popping up a level, it seems to me like a bloom filtered based index requires a completely different architecture than an inverted index. It seems to me like it almost requires building up a bunch of chunks of the corpus, where each chunk has a bloom filter (or similar) index. If your query passes the bloom filter, then you must search the entire chunk. And then it seems like you need to repeat this for every chunk. (At least, I think this is how qgrep works. The section "Filtering searched chunks" seems to hint that this is the case.) That would hamper the best case because you always need to search the index for every chunk. Now, the n-gram indexing story I have in my head also uses a chunking strategy. The difference is that the chunks can be merged with low memory usage in linear time. This is essential to the incremental indexing story. The incremental indexing story for the bloom filter approach looks a bit less clear to me, although I haven't thought about it too carefully. Anyway, I will continue to noodle on this. Thanks for the thoughts. Keep them coming if you have them. :-) |
@omac777 I don't have any plans to use a roaring bitmap, so I don't think that work applies. Also, none of your gitlab links work for me. It seems I need an account to see the repos? |
I don't really see why that is the case. In the simplest case, you just have a bloom filter for each file. When querying you build a bloom filter of all your query n-grams and use bitwise operations to check that whether all of your query is in the bloom filter of the file. You then just linearly search all the filters and search the files if there is a match. Linear searching the filters might sound slow, but since it's just bitwise ops, easily SIMD-izable, and scans memory linearly, it's usually quite fast. Also consider that the cost in the number of n-grams in the query is constant. There are hierarchical filtering schemes, but it seems unclear if they are necessary here. |
Yes. That sounds exactly like building a bunch of chunks of the corpus, where each chunk is a single file. The best case performance for an inverted index sounds much better. A hierarchical scheme for bloom filters appears required to achieve that. Otherwise, it doesn't seem like it scales.
Yes, I understand that. But you're still needing to do this for every chunk. It looks like Lucene has a bloom filter postings format, but appears to be specific to fields that have low frequency values (like primary keys). Do you know of any other established IR systems that use a bloom filter instead of an inverted index? |
Okay, but I don't see the difference vs. an inverted index. An inverted index could also refer to a "chunk" of documents instead of referring to a single document. The best case for the inverted index is a single n-gram that matches a single document. The best case for the bloom filter is a combination of n-grams that only matches a single document. That seems like a more likely case to me, due to the nature of n-grams. Inverted indices might not help here, because the n-grams might be individually common but rare in combination. Also consider that inverted indices have a bigger construction and maintenance cost. But I do guess it helps that this is unlikely to be a write-heavy situation, and the inverted indices don't have to store the position in the document as well.
Heh, so this is when I admit my bias. :) I work at @humio where we use a bloom-filter based system search to do regex search (and more) of log data. We have just 2 levels of bloom filters for petabyte sized clusters (though we use more than just bloom filters for filtering). |
As an aside, this problem is very similar to k-mer indexing in bioinformatics, and there is a recent literature review which brings up many of the points of this inverted index/hierarchical BF discussion going on here. |
I think we're misunderstanding each other. An "optimal" inverted index involves these operations:
In the "best" case---pretty much any time you have an uncommon ngram---you do a very small number of operations. But the bloom filter chunking mechanism you're proposing requires visiting every chunk. There is no "best" case that lets you visit a very small subset of the corpus.
Right. The problem there is traversing large postings lists. That can largely be mitigated by chunking the posting lists into blocks. So long as there is a rare ngram somewhere, the intersection can make traversing the longer posting lists very fast. Lucene does this. Each block has its maximum document ID. If you know you can't have any document IDs less than My speculation is that the best case with an inverted index is much better than the best case with a bloom filter. And even in the bloom filter's best case, the inverted index is going to do well.
Hah okay. Fair. I guess I should revise my question to: do you know of any other established IR systems that I can study in detail that use a bloom filter instead of an inverted index? It sounds like this is kind of boiling down into a speculative thing. I'm biased because I'm most familiar with inverted index style IR systems. I know them well. I know how to do incremental updating with them and I know where they do poorly. I'm much less familiar with how to build an IR system using bloom filters, which is just a totally different architecture with fundamentally different trade offs. As is expected, these trade offs are deeply rooted in what kinds of queries are being performed, and my best guess at this point is that an inverted index gives better coverage there. |
@luizirber Interesting paper. I worked in that field in a former life. I found it a bit difficult to pick out any specific details that help this discussion though. Parts of it were quite dense. I also wonder what the impact of alphabet size has, although that's partially mitigated by just picking a larger |
Well, yes, unless you have a hierarchical filter. But a hierarchical filter using eg. the file hierarchy wouldn't be difficult to implement.
Bing replaced their inverted indices with bloom filters with supposedly 10x query capacity: https://danluu.com/bitfunnel-sigir.pdf. Granted, what makes sense in a large scale context may not make as much sense on the scale of ripgrep. |
Thanks, I'll take a look at that paper. |
I am not a pro, but I was thinking you could store a hash of each file in its metadata, this makes possible to detect duplicate files, and so a smaller index, and faster search (if you hit a dup fle). I don't know the possibility of this happening in normal/large codebases. Maybe is not common, or not enough to take in account. As this needs further logic to maintain links between identical files. Sorry the rant, just an amateur that always liked to play with IR. |
There are two strategies for dealing with long posting lists. One is to indeed use integer compression via some method of delta encoding. The other is to encode it as a skip list. This comes in handy when intersecting a common ngram with an uncommon one. You can skip through the vast majority of things in the common posting list without having to decode every item in the list.
There is a lot of work hiding in that "could." |
Hey @BurntSushi,
|
With respect to line breaks, I suspect we need them for multi-line search. So I would include them. With respect to Unicode, for ripgrep in particular, we want bytes. This doesn't mean we ignore "characters outside of ASCII." It just means that we operate at the byte level. If we choose anything other than bytes, then we lose the ability for search to work on data that isn't valid UTF-8. I haven't thought carefully about these things yet, so I could be wrong. They are just my instincts.
If you are doing a lot of work where the main goal is to get that work merged into ripgrep, please first discuss your plan with me by opening a new issue. |
This is split off from #1497 as a space to discuss specific implementation details of the indexing strategy. I'd like to keep this discussion separate so that folks won't feel put off from participating in a higher level discussion around the UX.
TODO: I'll fill in a sketch of my high level plan when I get a chance.
The text was updated successfully, but these errors were encountered: