Vector/Formula Aggregation Inconsistency #4023

stanbrub · 2023-06-16T23:48:59Z

The behavior of vector aggregations compared to in-line formulas or agg_by aggregations that do the same thing is wildly different in the handling of return types and type overflows.

Here are some observations:

For integer overflows, no error is thrown to alert the user (Assume savvy users to avoid performance hit of checks?)
Summing 2 billion and 1 billion
- Vector Sum: Returns max signed 32 bit integer value
- In-line Formula Sum: Returns a negative integer
- AggBy Sum: Returns a long with the correct value

This could be confusing users. Fixing the Inline Formula is simple; just cast the operands to long. But even if there are workarounds for all scenarios, is it not possible to handle overflows consistently?

vector_function_inconsistency.py.txt

rcaudy · 2023-06-17T00:21:37Z

Some of these differences or issues are indeed deliberate. Others are worth discussing.

chipkent · 2024-04-15T17:39:30Z

Contents of the example file:

from deephaven.column import int_col, long_col, double_col, float_col, string_col
from deephaven import agg, new_table

tbl = new_table([
    string_col('group', ['1B', '1B']),
    int_col('values', [1_000_000_000, 2_000_000_000])
])
grouped = tbl.group_by(['group'])
results = grouped.update([
    'int_vector_sum = sum(values)',
    'int_formula_sum = values[0] + values[1]',
    'long_formula_sum = (long)values[0] + values[1]',
    'int_vector_avg = avg(values)',
    'int_formula_avg = (values[0] + values[1]) / 2',
    'long_formula_avg = ((long)values[0] + values[1]) / 2'
])

results2 = tbl.agg_by([
    agg.avg('int_aggby_avg=values'), agg.sum_('int_aggby_sum=values')], 
    ['group'])

chipkent · 2024-04-15T17:56:12Z

More ...

from deephaven.column import int_col, long_col, double_col, float_col, string_col
from deephaven import agg, new_table

tbl = new_table([
    string_col('group', ['1B', '1B']),
    int_col('values', [1_000_000_000, 2_000_000_000])
])
grouped = tbl.group_by(['group'])
results = grouped.update([
    'int_vector_sum = sum(values)',
    'int_formula_sum = values[0] + values[1]',
    'long_formula_sum = (long)values[0] + values[1]',
    'int_vector_avg = avg(values)',
    'int_formula_avg = (values[0] + values[1]) / 2',
    'long_formula_avg = ((long)values[0] + values[1]) / 2'
])

results2 = tbl.agg_by([
    agg.avg('int_aggby_avg=values'), agg.sum_('int_aggby_sum=values')], 
    ['group'])

m = results.meta_table
m2 = results2.meta_table

chipkent · 2024-04-15T18:31:03Z

Changing to a long column results in consistent results:

Going to short also results in overflow.

chipkent · 2024-04-15T19:13:49Z

Looking through Numeric.ftl, there are a few methods that may need to change:

Plain old bug:

percentile: should return the base type. The implementation is just dumb and returns double.

Widen return type because of overflow to be consistent with other aggs:

sum
product
cumsum
cumprod

Kind of weird:

wsum: currently returns double for all types, but if both inputs are integer types, it might make sense to widen to a long.
median: currently returns double for all types. This allows it to handle the average case. I'm not sure there is a better option.

from deephaven.column import int_col, long_col, double_col, float_col, string_col
from deephaven import agg, new_table, updateby as uby

tbl = new_table([
    string_col('group', ['1B', '1B']),
    int_col('values', [1_000_000_000, 2_000_000_000])
])
grouped = tbl.group_by(['group'])
results = grouped.update([
    'pct = percentile(0.5, values)',
    'sm = sum(values)',
    'csum = cumsum(values)',
    'cprod = cumprod(values)'
])

results2 = tbl.agg_by([
    agg.pct(0.5, 'pct=values'), agg.sum_('sm=values')], 
    ['group'])

result3 = tbl.update_by([uby.cum_sum("csum=values"), uby.cum_prod("cprod=values")])

m = results.meta_table
m2 = results2.meta_table
m3 = result3.meta_table

chipkent · 2024-04-15T19:22:01Z

This illustrates the "weird" wsum case:

from deephaven import empty_table
from deephaven import agg, new_table, updateby as uby

t = empty_table(3).update(["X = i", "Y=2*i"])

results = t.group_by().update(["ws = wsum(X,Y)"])

results2 = t.agg_by([agg.weighted_sum("X", "ws=Y")])

m = results.meta_table
m2 = results2.meta_table

Here the aggregator is using long while the query library is using double.

lbooker42 · 2024-04-15T20:54:41Z

Regarding:

Widen return type because of overflow to be consistent with other aggs:

sum

product

cumsum

cumprod

I encountered the type-preserving behavior in these numeric functions when using them to verify UpdateBy operations, but assumed intention and the users would manually widen if overflow was undesirable. I support the idea of default-widening the return types.

Pointing out more inconsistencies, update_by is consistent in widening integral primitives to long but handles float input types variously. Some examples:

ema(float) => double
cum_sum(float) => float
cum_prod(float) => float
rolling_sum(float) => float
rolling_prod(float) => double

lbooker42 · 2024-04-15T20:57:52Z

This illustrates the "weird" wsum case:

agg.weighted_sum adaptively returns a long or double based on the input type. Is that the behavior we'd like in the library functions?

Aggregation operations in query library functions and built-in query aggregations are inconsistent. This PR makes them consistent. Query library functions were changed. * `percentile` now returns the primitive type. * `sum` returns a widened type of `double` for floating point inputs or `long` for integer inputs. * `product` returns a widened type of `double` for floating point inputs or `long` for integer inputs. * `cumsum` returns a widened type of `double[]` for floating point inputs or `long[]` for integer inputs. * `cumprod` returns a widened type of `double[]` for floating point inputs or `long[]` for integer inputs. * `wsum` returns a widened type of `long` for all integer inputs and `double` for inputs containing floating points. Note: Because the types have changed, the NULL return values have changed as well. Resolves #4023

stanbrub added bug Something isn't working triage labels Jun 16, 2023

rcaudy added query engine core Core development tasks and removed triage labels Jun 17, 2023

rcaudy added this to the Backlog milestone Jun 17, 2023

rcaudy assigned rcaudy, cpwright, chipkent and rbasralian Jun 17, 2023

chipkent mentioned this issue Apr 16, 2024

Fix inconsistencies in aggregations and query library functions. #5368

Merged

lbooker42 mentioned this issue Apr 16, 2024

Widen returned types for UpdateBy floating point operations. #5371

Merged

chipkent closed this as completed in #5368 May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector/Formula Aggregation Inconsistency #4023

Vector/Formula Aggregation Inconsistency #4023

stanbrub commented Jun 16, 2023 •

edited

Loading

rcaudy commented Jun 17, 2023

chipkent commented Apr 15, 2024 •

edited

Loading

chipkent commented Apr 15, 2024 •

edited

Loading

chipkent commented Apr 15, 2024

chipkent commented Apr 15, 2024

chipkent commented Apr 15, 2024

lbooker42 commented Apr 15, 2024

lbooker42 commented Apr 15, 2024

Vector/Formula Aggregation Inconsistency #4023

Vector/Formula Aggregation Inconsistency #4023

Comments

stanbrub commented Jun 16, 2023 • edited Loading

rcaudy commented Jun 17, 2023

chipkent commented Apr 15, 2024 • edited Loading

chipkent commented Apr 15, 2024 • edited Loading

chipkent commented Apr 15, 2024

chipkent commented Apr 15, 2024

chipkent commented Apr 15, 2024

lbooker42 commented Apr 15, 2024

lbooker42 commented Apr 15, 2024

stanbrub commented Jun 16, 2023 •

edited

Loading

chipkent commented Apr 15, 2024 •

edited

Loading

chipkent commented Apr 15, 2024 •

edited

Loading