Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect avg calculation from vector runtime #5516

Closed
philrz opened this issue Dec 4, 2024 · 1 comment · Fixed by #5528
Closed

Incorrect avg calculation from vector runtime #5516

philrz opened this issue Dec 4, 2024 · 1 comment · Fixed by #5528
Labels
bug Something isn't working

Comments

@philrz
Copy link
Contributor

philrz commented Dec 4, 2024

Repro is with super commit 200f373.

Test data is the attached data.bsup.gz. This is a simplification of an issue that first surfaced when running the mgbench bench3/q5 query.

I start by confirming the expected result using the original BSUP data in our sequential runtime and seeing it matches with the equivalent query in DuckDB (modulo the typical/tiny floating point differences).

$ super -version
Version: v1.18.0-181-g200f3732

$ super -c 'from data.bsup.gz | avg(event_value)'
1058.523401720017

$ super -f parquet -o data.parquet data.bsup.gz 

$ duckdb -c "SELECT AVG(event_value) FROM data.parquet"
┌────────────────────┐
│  avg(event_value)  │
│       double       │
├────────────────────┤
│ 1058.5234017218875 │
└────────────────────┘

Running the same through the vector runtime when querying as either Parquet or CSUP, I get a significantly different result.

$ SUPER_VAM=1 super -c 'from data.parquet | avg(event_value)'
821.8078935529076

$ super -f csup -o data.csup data.bsup.gz 

$ SUPER_VAM=1 super -c 'from data.csup | avg(event_value)'
821.5508514472242

If I run the query through the sequential runtime with either of those vector-friendly formats, I once again get the correct result.

$ super -c 'from data.csup | avg(event_value)'
1058.523401720017

$ super -c 'from data.parquet | avg(event_value)'
1058.523401720017
@philrz philrz added the bug Something isn't working label Dec 4, 2024
mattnibs added a commit that referenced this issue Dec 11, 2024
Fix a bug in the vector runtime for the avg aggregation where null
values were improperly being counted

Closes #5516
mattnibs added a commit that referenced this issue Dec 11, 2024
Fix a bug in the vector runtime for the avg aggregation where null
values were improperly being counted

Closes #5516
@philrz
Copy link
Contributor Author

philrz commented Dec 12, 2024

Verified in super commit 883ffd2.

I now see avg results from Parquet and CSUP files in the vector runtime that are similar to those from the original BSUP in the sequential runtime.

$ super -version
Version: v1.18.0-191-g883ffd2b

$ super -c 'from data.bsup.gz | avg(event_value)'
1058.523401720017

$ super -f parquet -o data.parquet data.bsup.gz

$ super -f csup -o data.csup data.bsup.gz

$ SUPER_VAM=1 super -c 'from data.parquet | avg(event_value)'
1058.5297770606794

$ SUPER_VAM=1 super -c 'from data.csup | avg(event_value)'
1058.5234017218877

It's interesting that even though both the Parquet and CSUP files are generated from the same original BSUP data, the calculations with these in the vector runtime differ from each other and also differ from the calculation with BSUP in the sequential runtime. This doesn't come across as a bug necessarily since users are accustomed to seeing small differences in precision with floating point math, so I expect this might all be explained by something about parallel operations. But I point it out in case it's at all surprising.

FWIW, if we use the same Parquet and CSUP files to do the calculation with the sequential runtime, now the result does match what we saw with BSUP in the sequential runtime.

$ super -c 'from data.parquet | avg(event_value)'
1058.523401720017

$ super -c 'from data.csup | avg(event_value)'
1058.523401720017

I've opened separate issue #5530 in case that's worthy of closer scrutiny.

Thanks @mattnibs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant