-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to customize freeze function? #131
Comments
Why not just pick the prefixes yourself ahead of time? I think I really need more details here. An example would be helpful. |
The direct answer to your question is that there is no way to customize this in the current API. |
Because I can't know which prefixes are needed (unless traverse all documents once before build FST) |
for example, we have a set: {"ab", "abc", "abd", "abe", "abf", "ac", "ad"} |
You have to do this anyway, because an FST requires providing the keys in lexicographic order. So you need to know all of your keys before you start building the FST.
It still isn't clear to me why this has to be in FST construction and not in a pre-processing phase. Even if FST construction did this automatically, you'd still need to know what the suffixes that weren't written are so that you can write them somewhere else (as you mention). If I'm still not understanding you, then I think talking about this at a conceptual level isn't going to work. In that case, I'd suggest that you put forward a proposal that adds a new API along with a sketch of how the implementation of that new API might work. |
Yes it requires providing the keys in order.But I don't have just one fst. I will generate many fsts in segments.So I can only know the prefix in a particular segments.I will merge fsts finally, during the final FST build, I should know the prefixes in all documents.
Once a prefix is frozen, all its suffixes are determined and immutable, and we can clearly know all its suffixes.We can record all input after "ab" before "ac", once "ac" is inserted, and we found "", "c", "d", "e", "f" >= the MIN constant, we can write "ac" to fst and write all suffixes elsewhere |
Like lucene, it can use a customize freeze function that only write if the number of children nodes > MIN, otherwise it will be dropped.
This allows only save part of data('prefix') in FST and save other('suffix') in disk.This makes the FST small enough to be fully loaded into memory.
The text was updated successfully, but these errors were encountered: