-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add randomScore function in script_score query #40186
Add randomScore function in script_score query #40186
Conversation
To make script_score query to have the same features as function_score query, we need to add randomScore function. This function should be able to produce different random scores on different index shards. It also needs to be able to produce random scores based on the internal Lucene Document Ids. To achieve this three variables have been added to the score script context: - _doc for the internal Lucene doc id - _shard for the shard id - _indexName for the index name Closes elastic#31461
Pinging @elastic/es-search |
Hey @mayya-sharipova! @rjernst and I were just discussing how it's not great for the user to have to enter something like shard_id and doc_id into this method. I believe we can make this easier for you and our users with a modified version of ClassBinding, but need a bit more time to think about it. I realize this blocking some other things, so I will focus on an answer for you by tomorrow. |
@jdconrad Thanks Jack, any other way to access share_id or doc_id through ClassBinding or any other way will also work for us. Looking forward to learning more from you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayya-sharipova Thanks for using class bindings here! I'm hopeful this will be easier on the user for random scores. I'm happy to have things (doc id, shard id, etc.) be added to the base script in this case since the information is readily available as part of the script already -- it definitely makes sense to me to do this. I left a few comments and questions.
@rjernst Would you please also take a look at this PR?
} | ||
|
||
public double randomNotReproducible() { | ||
return rnd.nextDouble(); | ||
public double randomScore(String seedValue, int seed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my clarification the seedValue here can change between documents? Or rather is even expected to? Is the seed also expected to change between documents or would it make more sense to consider the seed as read only for a script and have it be part of the constructor?
Also, now that this class binding has direct access to the ScoreScript reference, would it make more sense to have the user just pass in the field name for the seedValue and resolve it from the doc within this class to prevent the user from passing in an arbitrary String?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdconrad Right, seedValue
is expected to be changed between documents, so different scores are produced for different documents. And you are right about seed
value, it would make more sense to have it as a part of the constructor. I will modify the code accordingly.
would . it make more sense to have the user just pass in the field name for the seedValue and resolve it from the doc within this class to prevent the user from passing in an arbitrary String?
Good point! I will experiment with this.
public RandomScoreDoc(ScoreScript scoreScript) { | ||
this.scoreScript = scoreScript; | ||
} | ||
public double randomScore(int seed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as the seed above? Should this be considered read-only?
/** | ||
* @return the internal document ID | ||
*/ | ||
public int getDocId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should rename these methods to something like _getDocId(). This way the user cannot access them directly through a script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdconrad Thanks Jack, I will rename these method
} | ||
} | ||
|
||
public void setShard(int shardId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing we probably don't want the user setting the shard id in a script so maybe we can rename this to _setShard? @rjernst What do you think here?
/** | ||
* @return shard id or throws an exception if shard is not set up for this script instance | ||
*/ | ||
public int getShardId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
/** | ||
* @return index name or throws an exception if the index name is not set up for this script instance | ||
*/ | ||
public String getIndex() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
this.shardId = shardId; | ||
} | ||
|
||
public void setIndexName(String indexName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
@elasticmachine run CLA |
@jdconrad Thanks for the review, Jack. I have addressed you feedback in the subsequent commit. This is ready for another round of review. One thing I have noticed that ScriptDocValues::get(index) produces wrong results in a document is missing field. We need to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding this!
@elasticmachine run elasticsearch-ci/bwc |
@rjernst Ryan, would like to review this PR as well? Or I can merge it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit. I actually not like having the underscore methods, but they are necessary for now until some refactoring can be done in painless to not automatically expose all getters on the script classes.
@@ -53,26 +51,49 @@ public static double sigmoid(double value, double k, double a){ | |||
return Math.pow(value,a) / (Math.pow(k,a) + Math.pow(value,a)); | |||
} | |||
|
|||
public static final class RandomScore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment here like the one on RandomScoreDoc briefly describing how this differs? Maybe call it RandomScoreField?
To make script_score query to have the same features as function_score query, we need to add randomScore function. This function produces different random scores on different index shards. It is also able to produce random scores based on the internal Lucene Document Ids.
* master: (25 commits) [DOCS] Correct keystore commands for Email and Jira actions in Watcher (elastic#40417) [DOCS] Document common settings for snapshot repository plugins (elastic#40475) Remove with(out)-system-key tests (elastic#40547) Geo Point parse error fix (elastic#40447) Handle null retention leases in WaitForNoFollowersStep (elastic#40477) [DOCS] Adds anchors for ruby client (elastic#39867) Mute DataFrameAuditorIT#testAuditorWritesAudits Disable integTest when Docker is not available (elastic#40585) Add randomScore function in script_score query (elastic#40186) Test fixtures krb5 (elastic#40297) Correct ILM metadata minimum compatibility version (elastic#40569) SQL: Centralize SQL test dependencies version handling (elastic#40551) Mute testTracerLog Mute testHttpInput Include functions' aliases in the list of functions (elastic#40584) Optimise rejection of out-of-range `long` values (elastic#40325) Add docs for cluster.remote.*.proxy setting (elastic#40281) Migrate systemd packaging tests from bats to java (elastic#39954) Move top-level pipeline aggs out of QuerySearchResult (elastic#40319) Use openjdk 12 in packer cache script (elastic#40498) ...
With this commit we align our benchmarks with elastic/elasticsearch#40186 which has removed the function `randomReproducible` and added a new function `randomScore`. Relates elastic/elasticsearch#40186
We would like to deprecate function_score query for script_score.
To make script_score query to have the same features
as function_score query, we need to add randomScore
function.
This function should be able to produce different
random scores on different index shards.
It also needs to be able to produce random scores
based on the internal Lucene Document Ids.
To achieve this, three variables have been added to
the score script context:
Closes #31461