Skip to content

IntelCompH2020/solr-ewb-sims

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

solr-ewb-sims

This code implements a Java plugin for Solr to retrieve pairs of documents with a semantic similarity larger than a certain threshold in a given topic model.

This plugin is a modification of the "Vector Scoring Plugin for Solr : Dot Product and Cosine Similarity" to be adequate for Solr 9 and the purpose stated above.

For it to work, similarities among each pair of documents for a given topic model (model_name) should be indexed in each document in a corpus collection using payload notation (i.e., 1312|99.99 23453|98.7...) in the field sim_{model_name}.

Here, for each document id_i, each component id_x|score_x within its sim_{model_name} field corresponds to the similarity among the document id_i and id_x.

Considering the plugin has been correctly added to Solr, the query can be used as follows:

{!vs f=sim_mallet-10 vector="80,90"}

This query launches the Solr plugin and searches the sim_mallet-10 field of each document in the Cordis collection for documents with a similarity between 80% and 90%. For each document, it returns a double, which indicates the range of those documents that are between this similarity percentage.

Let's consider that for the document with id = 83456, its sim_mallet-10 field is:

sim_mallet-10 = 1312|99.99 23453|98.7 34534|96.5 34534|94.5 4564|93.75 45645|92.24 785|91.14 6876|89.88 45645|88.15 4564|85.47 53456|83.12 35645|82.08 4564|81.45 13563|80.12 13563|78.92….

If the Solr query returns the score 7.13, this means that it will keep the documents between positions 7 and 13 that are within this range. In the example they would be:

6876|89.88 45645|88.15 4564|85.47 53456|83.12 35645|82.08 4564|81.45 13563|80.12

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages