Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WeightedAvg metric aggregation #31037

Merged
merged 15 commits into from
Jul 23, 2018
Merged

Conversation

polyfractal
Copy link
Contributor

@polyfractal polyfractal commented Jun 1, 2018

WIP, but putting this up to see how @colings86 feels about the MultiValueSource stuff. Still needs loads of tests, comments and documentation.

Notable changes in this PR:

  • A new MultiValueSource and associated classes has been added. This allows aggs to define multiple sources of values, as well as define a script, format, multi-value mode, etc for each independently
  • The old MultiValueSource and related kin (from Matrix Agg module) are renamed to ArrayValueSource because it takes multiple fields in an array. This was done because I couldn't find a good way to refactor the matrix aggs to use the new multi-value style, but didn't want to leave the name the same (and it also caused conflict issues).
    A new overload for MultiValueMode was added. I wanted to reuse the capabilities of MultiValueMode, but all the existing selectors always returned true for advanceDoc() and set a default value. I wanted the normal advanceDoc() behavior, but the multiple-mode avg/sum/min/max functionality when a field has multiple values.
  • Adds a WeightedAvg metric agg which uses the new functionality.

The new multivalue stuff tries to be reasonably generic, allowing the agg to define how fields are exposed via helpers. For example, the weighted_avg defines two fields like this:

{
  "weighted_avg": {
    "value": {        // first defined field
      "field": "value_field",
      "script": {...},
      "missing": "..."
    },
    "weight": {       // second defined field
      "field": "weight_field",
      "script": {...},
      "missing": "..."
    },
    "format": "...",      //common fields
    "value_type": "..."
  }
}

Closes #15731

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

new overload for MultiValueMode was added. I wanted to reuse the capabilities of MultiValueMode, but all the existing selectors always returned true for advanceDoc() and set a default value

Agreed, we should refactor MultiValueMode to decouple selection from applying a default value.

docWeights.advanceExact(doc);
final double weight = docWeights.doubleValue();

weights.increment(bucket, weight);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we sum up weights using kahan summation too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ good point, should definitely have kahan summation too.

compensations = bigArrays.grow(compensations, bucket + 1);

if (docValues.advanceExact(doc)) {
docWeights.advanceExact(doc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you assert that it returns true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@polyfractal polyfractal added review and removed WIP labels Jun 19, 2018
@polyfractal
Copy link
Contributor Author

Ok, added some more tests, documentation and some small fixes. I think this is ready for a review now.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polyfractal I left some changes but I like where this is going.

|Parameter Name |Description |Required |Default Value
|`field` | The field that weights should be extracted from |Required |
|`missing` | A weight to use if the field is missing entirely |Optional |
|`multi` | If a document has multiple values for the field, how should the values be combined |Optional | `avg`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say weights instead of values?

|`field` | The field that weights should be extracted from |Required |
|`missing` | A weight to use if the field is missing entirely |Optional |
|`multi` | If a document has multiple values for the field, how should the values be combined |Optional | `avg`
|`script` | A script which provides the values for the document. This is mutually exclusive with `field` |Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say weights instead of values?

double newSum = sum + corrected;
sumCompensation = (newSum - sum) - corrected;
sum = newSum;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comments to the loop to explain why wee need each of the conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not entirely sure. @jpountz, are the conditionals to keep the naive behavior with infinites? E.g. if an infinite is added it converts the final value to infinite, whereas kahan summing would do something different?

So it's basically bwc with how we did things before?

DoubleArray sums;
DoubleArray sumCompensations;
DoubleArray weightCompensations;
DocValueFormat format;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be made private and final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private yes, but not final. They are grown down in the collector (e.g. weights = bigArrays.grow(weights, bucket + 1);)

import java.util.Map;
import java.util.Objects;

public abstract class MultiValuesSourceAggregationBuilder<VS extends ValuesSource, AB extends MultiValuesSourceAggregationBuilder<VS, AB>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth adding a JavaDoc to this class. Additionally I would point out in the JavaDoc that this class makes the assumption that all ValuesSources are of the same value type. I think this is a fine assumption to make, at least for now but its worth pointing it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, I didn't even think of that limitation. Will document.

If/when we need multiple value source types... that's gonna get fun :/

import java.io.IOException;
import java.util.function.BiFunction;

public class MultiValuesSourceFieldConfig implements Writeable, ToXContentFragment {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this maybe wrap ValuesSourceConfig so we ensure as feeatures are added to one they are added to the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the layout is a bit tricky, and naming is maybe (probably) confusing. Open to suggestions.

MultiValuesSourceFieldConfig is spiritually related to ValuesSourceParseHelper#declareFields() in that it is basically the parser and builder object for the commonly shared fields

I think MultiValuesSourceConfig is closer to what you were expecting, which is the final object. This contains a map of fields, where each entry's value is a Wrapper object containing a ValuesSourceConfig and a MultiValueMode.

So the underlying features should be shared, but the parsing is indeed still different. I can see if there's a way to share the parsing, but it may be tricky since the regular ValuesSourceConfig also defines a targetValueType as part of the common fields, but that only applies to the total MultiValuesSource, not the individual fields.

I'll poke at it a bit. Might be easier to zoom about this when you're back too.

@polyfractal
Copy link
Contributor Author

Jenkins, run gradle build tests

Fixes an issue where assertions were being tripped on REST tests due
to using the wrong stream ctor
@polyfractal
Copy link
Contributor Author

@colings86 this should be good to go for another review whenever you have time, no rush :)

@polyfractal
Copy link
Contributor Author

@colings86 Removed the multi-value mode as discussed, but decided to also remove MultiValuesSourceConfig and just use a Map everywhere. Seemed silly to have a wrapper around the map without any additional functionality, and it didn't save much in the way of typing due to the length of it's name vs. a map :)

This this is good to go for another review whenever you have a few minutes.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of minor comments but LGTM

If you have this situation, you will need to specify a `script` for the weight field, and use the script
to combine the multiple values into a single value to be used.

This single weight will be applied independently to each value extracted from the `value` field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if its worth having an example of a single weight being applied to each value independantly to help solidify what we mean?

"single document. Use a script to combine multiple weights-per-doc into a single value.");
}
// There should always be one weight if advanceExact lands us here, either
// a real weight or a `missing` value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing value -> missing weight

@polyfractal polyfractal added v6.4.0 and removed review labels Jul 23, 2018
@polyfractal polyfractal merged commit 6ba144a into elastic:master Jul 23, 2018
polyfractal added a commit that referenced this pull request Jul 23, 2018
Adds a new single-value metrics aggregation that computes the weighted
average of numeric values that are extracted from the aggregated
documents. These values can be extracted from specific numeric
fields in the documents.

When calculating a regular average, each datapoint has an equal "weight"; it
contributes equally to the final value.  In contrast, weighted averages
scale each datapoint differently.  The amount that each datapoint contributes
to the final value is extracted from the document, or provided by a script.

As a formula, a weighted average is the `∑(value * weight) / ∑(weight)`

A regular average can be thought of as a weighted average where every value has
an implicit weight of `1`.

Closes #15731
dnhatn added a commit that referenced this pull request Jul 25, 2018
* 6.x:
  Security: revert to old way of merging automata (#32254)
  Fix a test bug in RangeQueryBuilderTests introduced in the field aliases backport.
  Introduce Application Privileges with support for Kibana RBAC (#32309)
  Undo a debugging change that snuck in during the field aliases merge.
  [test] port linux package packaging tests (#31943)
  Painless: Update More Methods to New Naming Scheme (#32305)
  Tribe: Add error with secure settings copied to tribe (#32298)
  Add V_6_3_3 version constant
  Add ERR to ranking evaluation documentation (#32314)
  [DOCS] Added link to 6.3.2 RNs
  [DOCS] Updates 6.3.2 release notes with PRs from ml-cpp repo (#32334)
  [Kerberos] Add Kerberos authentication support (#32263)
  [ML] Extract persistent task methods from MlMetadata (#32319)
  Backport - Add Snapshots Status API to High Level Rest Client (#32295)
  Make release notes ignore the `>test-failure` label. (#31309)
  [DOCS] Adds release highlights for search for 6.4 (#32095)
  Allow Integ Tests to run in a FIPS-140 JVM (#32316)
  Add support for field aliases to 6.x. (#32184)
  Register ERR metric with NamedXContentRegistry (#32320)
  fixes broken build for third-party-tests (#32315) Relates #31918 / Closes infra/issues/6085
  [DOCS] Rollup Caps API incorrectly mentions GET Jobs API (#32280)
  Rest HL client: Add put watch action (#32026) (#32191)
  Add WeightedAvg metric aggregation (#31037)
  Consistent encoder names (#29492)
  Switch monitoring to new style Requests (#32255)
  specify subdirs of lib, bin, modules in package (#32253)
  Rename ranking evaluation `quality_level` to `metric_score` (#32168)
  Add new permission for JDK11 to load JAAS libraries (#32132)
  Switch x-pack:core to new style Requests (#32252)
  Watcher: Store username on watch execution (#31873)
  Silence SSL reload test that fails on JDK 11
  Painless: Clean up add methods in PainlessLookup (#32258)
  CCE when re-throwing "shard not available" exception in TransportShardMultiGetAction (#32185)
  Fail shard if IndexShard#storeStats runs into an IOException (#32241)
  Fix `range` queries on `_type` field for singe type indices (#31756) (#32161)
  AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated
  Add new fields to monitoring template for Beats state (#32085) (#32273)
  [TEST] improve REST high-level client naming conventions check (#32244)
  Check that client methods match API defined in the REST spec (#31825)
dnhatn added a commit that referenced this pull request Jul 25, 2018
* master:
  Security: revert to old way of merging automata (#32254)
  Networking: Fix test leaking buffer (#32296)
  Undo a debugging change that snuck in during the field aliases merge.
  Painless: Update More Methods to New Naming Scheme (#32305)
  [TEST] Fix assumeFalse -> assumeTrue in SSLReloadIntegTests
  Ingest: Support integer and long hex values in convert (#32213)
  Introduce fips_mode setting and associated checks (#32326)
  Add V_6_3_3 version constant
  [DOCS] Removed extraneous callout number.
  Rest HL client: Add put license action (#32214)
  Add ERR to ranking evaluation documentation (#32314)
  Introduce Application Privileges with support for Kibana RBAC (#32309)
  Build: Shadow x-pack:protocol into x-pack:plugin:core (#32240)
  [Kerberos] Add Kerberos authentication support (#32263)
  [ML] Extract persistent task methods from MlMetadata (#32319)
  Add Restore Snapshot High Level REST API
  Register ERR metric with NamedXContentRegistry (#32320)
  fixes broken build for third-party-tests (#32315)
  Allow Integ Tests to run in a FIPS-140 JVM (#31989)
  [DOCS] Rollup Caps API incorrectly mentions GET Jobs API (#32280)
  awaitsfix testRandomClusterStateUpdates
  [TEST] add version skip to weighted_avg tests
  Consistent encoder names (#29492)
  Add WeightedAvg metric aggregation (#31037)
  Switch monitoring to new style Requests (#32255)
  Rename ranking evaluation `quality_level` to `metric_score` (#32168)
  Fix a test bug around nested aggregations and field aliases. (#32287)
  Add new permission for JDK11 to load JAAS libraries (#32132)
  Silence SSL reload test that fails on JDK 11
  [test] package pre-install java check (#32259)
  specify subdirs of lib, bin, modules in package (#32253)
  Switch x-pack:core to new style Requests (#32252)
  awaitsfix SSLConfigurationReloaderTests
  Painless: Clean up add methods in PainlessLookup (#32258)
  Fail shard if IndexShard#storeStats runs into an IOException (#32241)
  AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated
  Remove unnecessary warning supressions (#32250)
  CCE when re-throwing "shard not available" exception in TransportShardMultiGetAction (#32185)
  Add new fields to monitoring template for Beats state (#32085)
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants