Skip to content
This repository has been archived by the owner on Dec 27, 2017. It is now read-only.

Domain clustering #7

Open
sylvinus opened this issue Feb 26, 2016 · 3 comments
Open

Domain clustering #7

sylvinus opened this issue Feb 26, 2016 · 3 comments

Comments

@sylvinus
Copy link
Contributor

Current UIDemo is limited on homepage so this issue is not very visible yet but with the full index it will be a big enhancement to cluster results by domain, to improve the diversity of the results and avoid showing results from just a couple of popular domains.

Google has often changed their algorithm for that: http://searchengineland.com/google-domain-clustering-change-159997

How to implement it on our side? I don't have an easy answer with our current Elasticsearch setup.

@sylvinus
Copy link
Contributor Author

sylvinus commented Mar 5, 2016

After discussing this with Doug Cutting, it seems that the top search engines have a 2-step query process, where they first fetch the first 100+ results from the index, and then perform post-processing to aggregate/filter them down to 10 results.

This is obviously more complicated but it enables domain clustering and also makes it easier for developers to tweak results without a full re-index. There could be a large number of rules implemented there, which would be quite easy to unit test properly.

Pagination will break though, so one solution for fetching page 3 for instance would be to always compute page 1 (or fetch it from cache?), and then have results sorted only with the elasticsearch score starting from page 2 (excluding documents from page 1).

@OriPekelman
Copy link

  1. You should consider aggregating by URL prefixes and not by domain. Some domains can be huge and the significance of sub domains vs prefixes is not stable (blog.exemple.com vs exemple.com/blog).
  2. Pagination should not be that much of an issue as long as the clustering is more or less predictable. You never care about results below 300. I mean who cares what is on page 30 ... and who cares about precise ordering for the 400th result. So its enough to pre-fetch enough results to create nice 3 initial pages and have whatever method to display the others (either you don't care and have duplicates there .. or we do some cleanup).

@OriPekelman
Copy link

Now I remember I worked on something to produce a lightweight fingerprinting technique (did this for plagiarism detection).. and I tested it to be robust, albeit on a relatively small test set. Here goes:

  1. We do whatever initial cleanup to extract the text
  2. We take the most infrequent terms from the document with something like https://www.elastic.co/guide/en/elasticsearch/reference/2.x/docs-termvectors.html possibly some filtering required to avoid tokens that might be junk.
  3. We take the top-most infrequent term find the first occurrence and construct a trigram (the word preceding it and the word following it) we take the second .. then the third .. and we concatenate those. This is our fingerprint (the length of our fingerprint should be tested againts the corpus) we index our fingerprint in a field. Both strict matching and MLT queries are going to give you very interesting results on this one. Even with a short fingerprint you will find that you have very few collisions.
  4. The assumption here is that term frequency is mostly stable on a large corpus so the fingerprint remains stable. Of course it would be better to have something totally deterministic later.
  5. Also mostly works with simpler mechanism to choose your trigrams, as long as they are mostly deterministic and rarely fall on very frequent terms .. although it will just make for longer fingerprints.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants