-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support DelimitedTermFrequencyTokenFilter #27552
Comments
We discussed this internally during FixIt Friday. This is an expert token filter that requires to be used under certain conditions (a tokenizer that does not split on punctuations and a similarity that takes the frequency into account, ...). For this reason we'd like to understand the use cases that this new filter could help with to see if we can have a better integration than just a direct mapping. |
I have a use case. Given a document, I want to detect keywords and keyphrases using ML and then use them to improve relevance. I could store them as an array in |
The use case I'm attempting is a slightly unorthodox use of elastic, but I believe it's a valid one. I have tags on a document that are tokenized and indexed, but also have vote counts. I'm providing a search facility for these tags that utilise the vote count number to retrieve documents that have higher vote counts than other documents tagged with the same tag. I achieved this using Lucene locally, by writing my own duplicating token filter and feeding each term through n number of times, where n is the vote count. I'm able to do a similar thing with elastic where I'd pre-filter the token stream by duplicating the tokens in the request I send to elastic. This obviously won't work well for high values of n, so it's a temporary solution. In fact, I just used the whitespace tokenizer with Lucene which I fed into the aforementioned token filter - there's no real text analysis required for my usage as I'm storing and indexing the tags verbatim, I just want the tf-idf ranking. Thanks for listening! |
Another usecase would be to add related terms to a document if the document is short, also known as document expansion. I created a plugin for now: |
cc @elastic/es-search-aggs |
One problem this solves is allowing users to store distributed representations of text in a field, and use it in ranking. By overwriting term freq, you can create and use arbitrary vector representations in ranking and have them used efficiently. For example, if you have a 5 dimensional embedding out of word2vec/doc2vec, you might store that in a field:
Here the vector output of doc2vec has been scaled to 0-100. Now the similarity between this doc and query terms passed into the same word2vec model becomes rather efficient, as it's using the search engine's built in TF*IDF scoring. I think Lucene & ES need to do a better job supporting arbitrary vector math like this. I'm finding in my work, clients regularly need this kind of functionality. Granted, I don't know if this is the absolute best way to implement such functionality. Perhaps a new field type would be better, that could store multidimensional data using syntactic sugar around this approach. But this is one way I see this used. |
@softwaredoug that's a great use case, and another example of where this feature could be used to reduce the work done at index/query time while increasing the flexibility of the API. Imagine if you could write code in your own app to index a field to produce a term frequency map, and then pass that to ES. Would be a game changer! I'm currently still hacking around the fact that I can't customise term freq by doing term stuffing. |
I have @gkinsman ‘s use case as well (vote counts or impressions). |
Ah interesting @jpountz, I haven't seen those before - might have to take a look. I have two fields per doc for tags - one is an original, stored text field which is the tag and count, e.g: "tag1|3 tag2|4 tag3|1". I should also mention that both fields use just a whitespace analyzer, as the terms are verbatim. |
Out of curiosity, what is the number of unique tags that you have in your index. Is it low-cardinality and conceptually closer to a field (eg. |
Sorry, one more question: do you disable norms on this term-stuffing field or not? |
The latter - I have high cardinality tags :). I don't disable norms on the field currently, but it looks like I should, which would save disk. |
@jpountz thanks, that looks promising! I'm wondering though why I couldn't take the value of what I index directly into scoring? It seems I have to take saturation, log, or sigmoid. I don't think I could implement a dot product using that feature very easily. To do a dot product, I would need to take the term freq directly, multiply it at query time by a query time boost. |
It was designed this way because such functionality is easier to evolve over time when it has few constrained parameters and its use-case is well understood. I do agree that it doesn't address all use-cases. @gkinsman needs a way to handle a feature vector with arbitrary (and potentially many) dimensions (the tags) and I have heard about similar use-cases in the past so I think we should do something as well. For instance @jimczi mentioned to me that in order to boost transactions in e-commerce, it is possible to attach search keywords that lead to a transaction to documents and store the likeliness of transaction in a custom term frequency. Then searching this field in addition to the usual designation/description fields is expected to help relevance and conversion rate. I am hopeful that adding something like the new Once you would have your dot product, what would you do with it? Just sum it up with the bm25 score? |
Some of this is a tangent, probably good for another issue related to working with embeddings or other vector representations in the search engine: On what I'd do with it, certainly a feature for learning to rank. Embeddings are often fed as input to models. I'm sort of inspired by what Vespa does by giving a very math-looking interface to take arbitrary tensors or other values as input and implement a direct scoring function. In part they do this because they can't anticipate whatever exotic model based on vectors you'll want to use. I'm not sure I'd advocate Elasticsearch going that radically in that direction. For more manual cases, certainly a summation would be useful. Honestly, it could be hard to anticipate completely. In a way it's a different kind of text similarity comparable to a The problem is word2vec models are basically a giant matrix of every word/document and their corresponding vector representation. Very memory intense to keep in the search engine. Perhaps there's alternate embedding methods that are more memory efficient? Alternatively, or in addition, I wonder if an explicit mapping type that took a vector would be useful? In your example, you use object notation, but could you also do just a list
Or if you truly are looking to support vectors ranging from [-1.0, 1.0] then hide the details in Elasticsearch (whether its term freq or a multivalued numerical field):
Then a "feature" query (or whatever other 'dot product' query) could do the rest Finally it's important to recognize these vectors could mean more than text, they could be:
So a general solution that let's you store and compute vector similarities could be useful for a lot of interesting 'search' applications |
There are two end use cases that immediately come to mind.
We often are trying to boost content based on some metric of popularity. Often it is a list of users who have liked or commented on something. By adding the user ids as a list of items we can give some weighting and somewhat correct for users who like tons of content due to the IDF. Controlling the TF though would allow us to give extra weight to particular users who tend to do a good job of recommending content within a specific category or on certain sites. We use this sort of ranking for search, related posts, and content recommendations We've also looked at classifying sites by topic. Something like the Chinese Restaurant Process to group together similar sites. Then each site could be weighted by how much it belongs in a particular topic.
I recently convinced our team to move from developing on Neo4J to ES, but it was not a simple mapping because of the NLP they were applying. In particular they were doing:
We had to make a number of compromises to get it to work in ES. Got it working, but it was not easy to line it up with how data scientists really think about such problems. Also, if you want my very open source, opinionated reason for wanting this for the long term:
-- Me Ultimately, Lucene is tuned to process term frequencies very fast and efficiently. Controlling those term frequencies provides a lot of flexibility and speed vs the other work arounds. |
This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates elastic#27552
@jpountz Is there an advantage or difference between using |
Yes, Scoring functions are black boxes to |
This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552
Feature Request
I'd like to use the DelimitedTermFrequencyTokenFilter that was added to Lucene 7, but it doesn't appear to be available via Elastic's API yet.
AFAIK it just needs to be added to CommonAnalysisPlugin.java, as a factory and token filter already exist within Lucene. I'd like to not have to use a custom plugin for this.
Thanks!
The text was updated successfully, but these errors were encountered: