Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector/Formula Aggregation Inconsistency #4023

Closed
stanbrub opened this issue Jun 16, 2023 · 8 comments · Fixed by #5368 or #5371
Closed

Vector/Formula Aggregation Inconsistency #4023

stanbrub opened this issue Jun 16, 2023 · 8 comments · Fixed by #5368 or #5371
Assignees
Labels
bug Something isn't working core Core development tasks query engine
Milestone

Comments

@stanbrub
Copy link
Contributor

stanbrub commented Jun 16, 2023

The behavior of vector aggregations compared to in-line formulas or agg_by aggregations that do the same thing is wildly different in the handling of return types and type overflows.

Here are some observations:

  • For integer overflows, no error is thrown to alert the user (Assume savvy users to avoid performance hit of checks?)
  • Summing 2 billion and 1 billion
    • Vector Sum: Returns max signed 32 bit integer value
    • In-line Formula Sum: Returns a negative integer
    • AggBy Sum: Returns a long with the correct value

This could be confusing users. Fixing the Inline Formula is simple; just cast the operands to long. But even if there are workarounds for all scenarios, is it not possible to handle overflows consistently?

vector_function_inconsistency.py.txt

@stanbrub stanbrub added bug Something isn't working triage labels Jun 16, 2023
@rcaudy rcaudy added query engine core Core development tasks and removed triage labels Jun 17, 2023
@rcaudy rcaudy added this to the Backlog milestone Jun 17, 2023
@rcaudy
Copy link
Member

rcaudy commented Jun 17, 2023

Some of these differences or issues are indeed deliberate. Others are worth discussing.

@chipkent
Copy link
Member

chipkent commented Apr 15, 2024

Contents of the example file:

from deephaven.column import int_col, long_col, double_col, float_col, string_col
from deephaven import agg, new_table

tbl = new_table([
    string_col('group', ['1B', '1B']),
    int_col('values', [1_000_000_000, 2_000_000_000])
])
grouped = tbl.group_by(['group'])
results = grouped.update([
    'int_vector_sum = sum(values)',
    'int_formula_sum = values[0] + values[1]',
    'long_formula_sum = (long)values[0] + values[1]',
    'int_vector_avg = avg(values)',
    'int_formula_avg = (values[0] + values[1]) / 2',
    'long_formula_avg = ((long)values[0] + values[1]) / 2'
])

results2 = tbl.agg_by([
    agg.avg('int_aggby_avg=values'), agg.sum_('int_aggby_sum=values')], 
    ['group'])

image

@chipkent
Copy link
Member

chipkent commented Apr 15, 2024

More ...

from deephaven.column import int_col, long_col, double_col, float_col, string_col
from deephaven import agg, new_table

tbl = new_table([
    string_col('group', ['1B', '1B']),
    int_col('values', [1_000_000_000, 2_000_000_000])
])
grouped = tbl.group_by(['group'])
results = grouped.update([
    'int_vector_sum = sum(values)',
    'int_formula_sum = values[0] + values[1]',
    'long_formula_sum = (long)values[0] + values[1]',
    'int_vector_avg = avg(values)',
    'int_formula_avg = (values[0] + values[1]) / 2',
    'long_formula_avg = ((long)values[0] + values[1]) / 2'
])

results2 = tbl.agg_by([
    agg.avg('int_aggby_avg=values'), agg.sum_('int_aggby_sum=values')], 
    ['group'])

m = results.meta_table
m2 = results2.meta_table

image

image

@chipkent
Copy link
Member

Changing to a long column results in consistent results:

image

Going to short also results in overflow.

image

@chipkent
Copy link
Member

Looking through Numeric.ftl, there are a few methods that may need to change:

Plain old bug:

  • percentile: should return the base type. The implementation is just dumb and returns double.

Widen return type because of overflow to be consistent with other aggs:

  • sum
  • product
  • cumsum
  • cumprod

Kind of weird:

  • wsum: currently returns double for all types, but if both inputs are integer types, it might make sense to widen to a long.
  • median: currently returns double for all types. This allows it to handle the average case. I'm not sure there is a better option.
from deephaven.column import int_col, long_col, double_col, float_col, string_col
from deephaven import agg, new_table, updateby as uby

tbl = new_table([
    string_col('group', ['1B', '1B']),
    int_col('values', [1_000_000_000, 2_000_000_000])
])
grouped = tbl.group_by(['group'])
results = grouped.update([
    'pct = percentile(0.5, values)',
    'sm = sum(values)',
    'csum = cumsum(values)',
    'cprod = cumprod(values)'
])

results2 = tbl.agg_by([
    agg.pct(0.5, 'pct=values'), agg.sum_('sm=values')], 
    ['group'])

result3 = tbl.update_by([uby.cum_sum("csum=values"), uby.cum_prod("cprod=values")])

m = results.meta_table
m2 = results2.meta_table
m3 = result3.meta_table
image image

@chipkent
Copy link
Member

This illustrates the "weird" wsum case:

from deephaven import empty_table
from deephaven import agg, new_table, updateby as uby

t = empty_table(3).update(["X = i", "Y=2*i"])

results = t.group_by().update(["ws = wsum(X,Y)"])

results2 = t.agg_by([agg.weighted_sum("X", "ws=Y")])

m = results.meta_table
m2 = results2.meta_table
image image

Here the aggregator is using long while the query library is using double.

@lbooker42
Copy link
Contributor

Regarding:

Widen return type because of overflow to be consistent with other aggs:

  • sum
  • product
  • cumsum
  • cumprod

I encountered the type-preserving behavior in these numeric functions when using them to verify UpdateBy operations, but assumed intention and the users would manually widen if overflow was undesirable. I support the idea of default-widening the return types.

Pointing out more inconsistencies, update_by is consistent in widening integral primitives to long but handles float input types variously. Some examples:

  • ema(float) => double
  • cum_sum(float) => float
  • cum_prod(float) => float
  • rolling_sum(float) => float
  • rolling_prod(float) => double

@lbooker42
Copy link
Contributor

This illustrates the "weird" wsum case:

agg.weighted_sum adaptively returns a long or double based on the input type. Is that the behavior we'd like in the library functions?

chipkent added a commit that referenced this issue May 6, 2024
Aggregation operations in query library functions and built-in query aggregations are inconsistent.  This PR makes them consistent.  Query library functions were changed.

* `percentile` now returns the primitive type.
* `sum` returns a widened type of `double` for floating point inputs or `long` for integer inputs.
* `product` returns a widened type of `double` for floating point inputs or `long` for integer inputs.
* `cumsum` returns a widened type of `double[]` for floating point inputs or `long[]` for integer inputs.
* `cumprod` returns a widened type of `double[]` for floating point inputs or `long[]` for integer inputs.
* `wsum` returns a widened type of `long` for all integer inputs and `double` for inputs containing floating points.

Note:  Because the types have changed, the NULL return values have changed as well.
 
Resolves #4023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment