Domain clustering #7

sylvinus · 2016-02-26T06:46:21Z

Current UIDemo is limited on homepage so this issue is not very visible yet but with the full index it will be a big enhancement to cluster results by domain, to improve the diversity of the results and avoid showing results from just a couple of popular domains.

Google has often changed their algorithm for that: http://searchengineland.com/google-domain-clustering-change-159997

How to implement it on our side? I don't have an easy answer with our current Elasticsearch setup.

sylvinus · 2016-03-05T21:51:40Z

After discussing this with Doug Cutting, it seems that the top search engines have a 2-step query process, where they first fetch the first 100+ results from the index, and then perform post-processing to aggregate/filter them down to 10 results.

This is obviously more complicated but it enables domain clustering and also makes it easier for developers to tweak results without a full re-index. There could be a large number of rules implemented there, which would be quite easy to unit test properly.

Pagination will break though, so one solution for fetching page 3 for instance would be to always compute page 1 (or fetch it from cache?), and then have results sorted only with the elasticsearch score starting from page 2 (excluding documents from page 1).

OriPekelman · 2016-03-13T15:05:57Z

You should consider aggregating by URL prefixes and not by domain. Some domains can be huge and the significance of sub domains vs prefixes is not stable (blog.exemple.com vs exemple.com/blog).
Pagination should not be that much of an issue as long as the clustering is more or less predictable. You never care about results below 300. I mean who cares what is on page 30 ... and who cares about precise ordering for the 400th result. So its enough to pre-fetch enough results to create nice 3 initial pages and have whatever method to display the others (either you don't care and have duplicates there .. or we do some cleanup).

OriPekelman · 2016-03-15T08:47:53Z

Now I remember I worked on something to produce a lightweight fingerprinting technique (did this for plagiarism detection).. and I tested it to be robust, albeit on a relatively small test set. Here goes:

We do whatever initial cleanup to extract the text
We take the most infrequent terms from the document with something like https://www.elastic.co/guide/en/elasticsearch/reference/2.x/docs-termvectors.html possibly some filtering required to avoid tokens that might be junk.
We take the top-most infrequent term find the first occurrence and construct a trigram (the word preceding it and the word following it) we take the second .. then the third .. and we concatenate those. This is our fingerprint (the length of our fingerprint should be tested againts the corpus) we index our fingerprint in a field. Both strict matching and MLT queries are going to give you very interesting results on this one. Even with a short fingerprint you will find that you have very few collisions.
The assumption here is that term frequency is mostly stable on a large corpus so the fingerprint remains stable. Of course it would be better to have something totally deterministic later.
Also mostly works with simpler mechanism to choose your trigrams, as long as they are mostly deterministic and rarely fall on very frequent terms .. although it will just make for longer fingerprints.

sylvinus added the help wanted label Feb 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain clustering #7

Domain clustering #7

sylvinus commented Feb 26, 2016

sylvinus commented Mar 5, 2016

OriPekelman commented Mar 13, 2016

OriPekelman commented Mar 15, 2016

Domain clustering #7

Domain clustering #7

Comments

sylvinus commented Feb 26, 2016

sylvinus commented Mar 5, 2016

OriPekelman commented Mar 13, 2016

OriPekelman commented Mar 15, 2016