Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Lucene's FeatureField. #30618

Merged
merged 9 commits into from
May 23, 2018
Merged

Expose Lucene's FeatureField. #30618

merged 9 commits into from
May 23, 2018

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented May 15, 2018

Lucene has a new FeatureField which gives the ability to record numeric
features as term frequencies. Its main benefit is that it allows to boost
queries with the values of these features and efficiently skip non-competitive
documents at the same time using block-max WAND and indexed impacts.

Lucene has a new `FeatureField` which gives the ability to record numeric
features as term frequencies. Its main benefit is that it allows to boost
queries with the values of these features and efficiently skip non-competitive
documents at the same time using block-max WAND and indexed impacts.
@jpountz jpountz added >feature release highlight :Search Foundations/Mapping Index mappings, including merging and defining field types v7.0.0 labels May 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented May 16, 2018

@jpountz Thanks Adrien, this is a very interesting and necessary feature. Excited to have in elasticsearch!

I am wondering if there an intention to index multiple values for a feature. With you current PR, if I try to index multiple values:

{ "index" : { "_index" : "findex", "_type" : "_doc", "_id" : "2" } }
{  "text" : "newspaper", "pagerank": [100.0, 200] }

I am getting the following in the explanation of query score (looks like multiple values got converted to the max float value):

"_explanation": {
	"value": 88.72284,
	"description": "Log function on the _feature field for the pagerank feature, computed as w * log(a + S) from:",
	"details": [
		{
			"value": 1.0,
			"description": "w, weight of this function",
			"details": []
		},
		{
			"value": 4.0,
			"description": "a, scaling factor",
			"details": []
		},
		{
			"value": 3.4028235E38,
			"description": "S, feature value",
			"details": []
		}
	]
}

@jpountz
Copy link
Contributor Author

jpountz commented May 16, 2018

Thanks @mayya-sharipova, this is a very good observation, we should reject multi-valued fields explicitly!

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ways to modify the score, this query has the benefit of being able to
efficiently skip non-competitive hits when
<<search-uri-request,`track_total_hits`>> is set to `false`. Speedups may be
spectacular.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@mayya-sharipova
Copy link
Contributor

@jpountz Thanks. Just one thing I want to clarify for myself. What does this phrase mean in the explanation below? What is w/2 here?

"k, pivot feature value that would give a score contribution equal to w/2"

"_explanation": {
	"value": 0.13026857,
	"description": "Saturation function on the _feature field for the url_length feature, computed as w * S / (S + k) from:",
	"details": [
		{
			"value": 1.0,
			"description": "w, weight of this function",
			"details": []
		},
		{
			"value": 0.33333334,
			"description": "k, pivot feature value that would give a score contribution equal to w/2",
			"details": []
		},
		{
			"value": 0.049926758,
			"description": "S, feature value",
			"details": []
		}
	]
}

@jpountz
Copy link
Contributor Author

jpountz commented May 17, 2018

Thanks for testing @mayya-sharipova. It should be S/2 indeed, w doesn't make sense.

@jpountz
Copy link
Contributor Author

jpountz commented May 18, 2018

Actually I read too quickly, the current explanation is correct: the score is computed as w * S / (S + k). So when S is equal to k, this becomes w * k / (k + k) = w / 2. I suspect it might be a bit confusing due to the fact that on the Lucene side we give users an explicit way to configure the boost (w) because from a Lucene perspective that's another query wrapper to use. However I don't think we need to expose it to Elasticsearch users since all queries already support configuring a boost, so if you would need to boost the impact of the feature query by 2, you could just do:

{
  "query": {
    "feature": {
      "field": "pagerank",
      "boost": 2
    }
  }
}

@mayya-sharipova
Copy link
Contributor

@jpountz Thanks for the detailed explanation

@jpountz jpountz merged commit 886db84 into elastic:master May 23, 2018
@jpountz jpountz deleted the feature_field branch May 23, 2018 06:55
dnhatn added a commit that referenced this pull request May 24, 2018
* master:
  [DOCS] Fixes typos in security settings
  Fix GeoShapeQueryBuilder serialization after backport
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (#30820)
  Painless: Types Section Clean Up (#30283)
  Add support for indexed shape routing in geo_shape query (#30760)
  [test] java tests for archive packaging (#30734)
  Revert "Make http pipelining support mandatory (#30695)" (#30813)
  [DOCS] Fix more edit URLs in Stack Overview (#30704)
  Use correct cluster state version for node fault detection (#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (#30797)
  [TEST] Don't expect acks when isolating nodes
  Add a `format` option to `docvalue_fields`. (#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (#30780)
  Only ack cluster state updates successfully applied on all nodes (#30672)
  Expose Lucene's FeatureField. (#30618)
  Fix a grammatical error in the 'search types' documentation.
  Remove http pipelining from integration test case (#30788)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature release highlight :Search Foundations/Mapping Index mappings, including merging and defining field types v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants