Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle missing and multiple values in script #29611

Conversation

mayya-sharipova
Copy link
Contributor

Previously in script for numeric fields, there was no way to check if
a document is missing a value. Also certain operations on multiple-
values fields were missing.

This PR adds the following:

  • return null for doc['field'] if a document is missing a 'field1':
    Now we can do this:
    if (doc['field'] == null) {return -1;} return doc['field'].value; or
    doc['field']?.value ?: -1

  • add the following functions for multiple-valued numeric fields:
    doc['field'].min returns the minumum amoung values
    doc['field'].max returns the maximum amoung values
    doc['field'].sum returns the sum of amoung values
    doc['field'].avg returns the average of values

Closes #29286

@mayya-sharipova mayya-sharipova force-pushed the missing-and-multiple-values-in-script branch from fda0be2 to cec0419 Compare April 19, 2018 16:02
Previously in script for numeric fields, there was no way to check if
a document is missing a value. Also certain operations on multiple-
values fields were missing.

This PR adds the following:

- return null for doc['field'] if a document is missing a 'field1':
    Now we can do this:
    if (doc['field'] == null) {return -1;} return doc['field'].value;  or
    doc['field']?.value ?: -1

- add the following functions for multiple-valued numeric fields:
    doc['field'].min returns the minumum amoung values
    doc['field'].max returns the maximum amoung values
    doc['field'].sum returns the sum of amoung values
    doc['field'].avg returns the average of values

Closes elastic#29286
@mayya-sharipova mayya-sharipova force-pushed the missing-and-multiple-values-in-script branch from cec0419 to 5aeab57 Compare April 20, 2018 16:34
@mayya-sharipova mayya-sharipova added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Apr 20, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Mayya, thanks for tackling this. The changes seem good. My one concern is this is a backwards incompatible change. Users currently may not be checking the fields for existence and will suddenly get NPE's if they are using the default values for docs with missing fields. One approach to solve this problem would be to have a setting that allows users to turn this missing field return null on in 6.x. If the flag is not set, we give a deprecation warning. Then remove the flag altogether in 7.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Apr 25, 2018

@jdconrad Thanks Jack for your review. Your concern makes sense.

Do you think we should add an index-level setting for 6.x:

"painless.missing_value.null" : true 

The default will be false.
And are you suggesting if this setting is not set, while executing script for all null values, we will output a deprecation warning?

@rjernst what do you think?

@rjernst
Copy link
Member

rjernst commented Apr 25, 2018

I think doing a deprecation either through a setting or a sysprop is a good idea. Deprecating getDate() has been a pain and we have ended up with sometimes flakey tests because we did deprecation logging inside ScriptDocValues.

If doing through a sysprop, we have the advantage of the code change being very simple. We would just need one place somewhere during node startup that checked what value we statically read, and emitting a deprecation log message if it is set to the old behavior.

For a setting, I think we would want this as a cluster setting, since scripts can be run across many indices, and it would not make sense (or be possible, since scripts are compiled once for the node) to vary this per index/shard. However, passing the cluster setting down and updating may be tricky. At least, it will require a few levels of signature changes to IndicesService, QueryShardContext and SearchLookup classes. I think we would want to get a BooleanSupplier down into LeafDocLookup, for which the return value is tied to this cluster setting. This is tricky because we don't currently have ClusterSettings (which is needed to hook up a settings listener) in IndicesService.

As far as when we should emit the deprecation warning, I think in either case (sysprop or cluster setting) it should be during startup.

@jdconrad
Copy link
Contributor

I agree on the one time deprecation warning on start up. Otherwise the logs get filled with a huge amount of spam whenever a script is used.

Previously in script for numeric fields, there was no way to check if
a document is missing a value. Also certain operations on multiple-
values fields were missing.

This PR adds the following:

- add the following functions for multiple-valued numeric fields:
    doc['field'].min returns the minumum amoung values
    doc['field'].max returns the maximum amoung values
    doc['field'].sum returns the sum of amoung values
    doc['field'].avg returns the average of values

- return null for doc['field'] if a document is missing a 'field1':
    Now we can do this:
    if (doc['field'] == null) {return -1;} return doc['field'].value; or
    doc['field']?.value ?: -1
    This new behaviour will only work if the following system property is set:
    `export ES_JAVA_OPTS="-Des.script.null_for_missing_value=true"'

Closes elastic#29286
Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks again for doing this!

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. I have some requests.

@@ -19,6 +19,7 @@
=== Breaking Java Changes

=== Deprecations
Returning 0 for missing numeric fields in script is deprecated. PR: 29611
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this, since the changelog is being taken another direction.

about this forthcoming change.


===== Multiple values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the getting started guide is the place for either of these things to be documented. Maybe @debadair has a better suggestion for where they could fit in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the "examples" section is really the "getting started" guide, and the other pieces should probably be split out. Not sure we want to do that as part of this change, though. For now, I'd rename this Multi-value field operations , put it after the regular expressions section, and bump it up to a level 3 heading so it's a separate page.

will be changed in the next major version of elasticsearch:
if a document is missing a field `field`, `doc['field']` for this document
will return `null`. You can set a system property
`export ES_JAVA_OPTS="-Des.script.null_for_missing_value=true"' on a node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should link to setting jvm.options, since that is OS agnostic.

[float]
===== Missing values

Currently by default, if a document is missing a numeric field `field`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would describe this a bit differently. I suggest using 2 different paragraphs. The first should describe the current behavior (but it is not necessary to say "currently by default"). This should also not state "returns 0" as it is field type dependent what is returned (and all field types behavior should be described).

The next paragraph can then describe the new option. The last sentence can mention the default behavior will change in a future release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to keep the missing values info in the Accessing doc values section. I'd probably pare it down to something like:

If you request the value from a numeric field that isn’t in the document, Painless returns zero.

IMPORTANT: Starting in 7.0, Painless returns a value of null instead of zero if the field does not exist. To enable this behavior now, set the following system property:
+
export ES_JAVA_OPTS="-Des.script.null_for_missing_value=true”
+
If you do not enable this behavior, a deprecation warning is logged on start up.

@@ -25,6 +25,7 @@ esplugin {
}

integTestCluster {
systemProperty 'es.script.null_for_missing_value', 'true'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have another integ test where this value is left out, to ensure the old behavior still works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst Thanks for the feedback. What is a proper way to create another integ test?

I have got the following code, but I am not sure how to add module mapper-extras to additionalClusterTestCluster as well.

integTestCluster {
  module project.project(':modules:mapper-extras')
}

Task additionalClusterTest = tasks.create(
        name: "additionalClusterTest", type: RestIntegTestTask)
additionalClusterTestCluster {
  systemProperty 'es.script.null_for_missing_value', 'true'
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to add module mapper-extras to additionalClusterTestCluster

There is not need to add it, since rest integ tests run against the entire distribution (ie it will have all modules).

Copy link
Member

@rjernst rjernst May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it would be better to do it. So, use this for the cluster config:

systemProperty 'es.script.null_for_missing_value', 'true'
distribution = 'integ-test-zip'
module project // add the lang-painless module itself
module project.project(':modules:mapper-extras')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst thanks a lot for the suggestion

@@ -718,6 +718,10 @@ public void onTimeout(TimeValue timeout) {
writePortsFile("transport", transport.boundAddress());
}

if (!ScriptModule.NULL_FOR_MISSING_VALUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== false please

@@ -718,6 +718,10 @@ public void onTimeout(TimeValue timeout) {
writePortsFile("transport", transport.boundAddress());
}

if (!ScriptModule.NULL_FOR_MISSING_VALUE)
logger.warn("Script: returning 0 for missing numeric values is deprecated. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be using a DeprecationLogger, so the message goes to the deprecation log.

@@ -91,7 +92,8 @@ public void setDocument(int docId) {
localCacheFieldData.put(fieldName, scriptValues);
}
try {
scriptValues.setNextDocId(docId);
boolean docHasValues = scriptValues.setNextDocId(docId);
if (!docHasValues && ScriptModule.NULL_FOR_MISSING_VALUE) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== false please

@@ -83,7 +83,7 @@
scripts.put("doc['random_score']", vars -> {
Map<?, ?> doc = (Map) vars.get("doc");
ScriptDocValues.Doubles randomScore = (ScriptDocValues.Doubles) doc.get("random_score");
return randomScore.getValue();
if (randomScore != null) {return randomScore.getValue();} else return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to be on a single line, please use a ternary statement. The same applies to the other tests here that were changed in a similar way.

[float]
===== Missing values

Currently by default, if a document is missing a numeric field `field`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to keep the missing values info in the Accessing doc values section. I'd probably pare it down to something like:

If you request the value from a numeric field that isn’t in the document, Painless returns zero.

IMPORTANT: Starting in 7.0, Painless returns a value of null instead of zero if the field does not exist. To enable this behavior now, set the following system property:
+
export ES_JAVA_OPTS="-Des.script.null_for_missing_value=true”
+
If you do not enable this behavior, a deprecation warning is logged on start up.

about this forthcoming change.


===== Multiple values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the "examples" section is really the "getting started" guide, and the other pieces should probably be split out. Not sure we want to do that as part of this change, though. For now, I'd rename this Multi-value field operations , put it after the regular expressions section, and bump it up to a level 3 heading so it's a separate page.


There is a number of operations designed for numeric fields,
if a document has multiple values in such a field:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd edit the intro to:

Painless supports the following operations for multi-value numeric fields:

In the list, I'd drop "among values" from min & max. For sum and average, I'd probably say "of the values" rather than "of all values".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@debadair thanks for the suggestion, noted

Previously in script for numeric fields, there was no way to check if
a document is missing a value. Also certain operations on multiple-
values fields were missing.

This PR adds the following:

- add the following functions for multiple-valued numeric fields:
    doc['field'].min returns the minumum amoung values
    doc['field'].max returns the maximum amoung values
    doc['field'].sum returns the sum of amoung values
    doc['field'].avg returns the average of values

- return null for doc['field'] if a document is missing a 'field1':
    Now we can do this:
    if (doc['field'] == null) {return -1;} return doc['field'].value; or
    doc['field']?.value ?: -1
    This new behaviour will only work if the following system property is set:
    `export ES_JAVA_OPTS="-Des.script.null_for_missing_value=true"'
@mayya-sharipova mayya-sharipova force-pushed the missing-and-multiple-values-in-script branch from 3b58d22 to 89a3bd6 Compare May 18, 2018 20:01
@mayya-sharipova mayya-sharipova force-pushed the missing-and-multiple-values-in-script branch from 89a3bd6 to 3b5f421 Compare May 18, 2018 20:26
@mayya-sharipova
Copy link
Contributor Author

@rjernst Ryan, I have addressed all your and Deb's comments. Can you please continue the review when you have available time. Thanks.

@rjernst rjernst removed the review label Oct 10, 2018
@mayya-sharipova
Copy link
Contributor Author

Closed in favour of #30975

Multiple values still need to be addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants