-
Notifications
You must be signed in to change notification settings - Fork 1
Domain clustering #7
Comments
After discussing this with Doug Cutting, it seems that the top search engines have a 2-step query process, where they first fetch the first 100+ results from the index, and then perform post-processing to aggregate/filter them down to 10 results. This is obviously more complicated but it enables domain clustering and also makes it easier for developers to tweak results without a full re-index. There could be a large number of rules implemented there, which would be quite easy to unit test properly. Pagination will break though, so one solution for fetching page 3 for instance would be to always compute page 1 (or fetch it from cache?), and then have results sorted only with the elasticsearch score starting from page 2 (excluding documents from page 1). |
|
Now I remember I worked on something to produce a lightweight fingerprinting technique (did this for plagiarism detection).. and I tested it to be robust, albeit on a relatively small test set. Here goes:
|
Current UIDemo is limited on homepage so this issue is not very visible yet but with the full index it will be a big enhancement to cluster results by domain, to improve the diversity of the results and avoid showing results from just a couple of popular domains.
Google has often changed their algorithm for that: http://searchengineland.com/google-domain-clustering-change-159997
How to implement it on our side? I don't have an easy answer with our current Elasticsearch setup.
The text was updated successfully, but these errors were encountered: