Implement more vectorized aggregate functions #7200

akuzm · 2024-08-14T20:02:58Z

Vectorize common aggregate functions like min, max, sum, avg, stddev, variance for arithmetic types, for no grouping and grouping on segmentby columns.

Tsbench shows up to 11x improvement and 2x on average on affected queries: https://grafana.ops.savannah-dev.timescale.com/d/fasYic_4z/compare-akuzm?orgId=1&var-branch=All&var-run1=3730&var-run2=3728&var-threshold=0&var-use_historical_thresholds=true&var-threshold_expression=2.5%20%2A%20percentile_cont%280.90%29&var-exact_suite_version=false&from=now-2d&to=now

Depends on:

It's a little confusing because they only live during the creation of decompression plan. Put them into a separate struct instead.

also unroll the loop

This reverts commit 7852d55f3061b82a3ce1cf8d7575f64c2d14aa0b.

erimatnor

Submitting the comments I have so far. Haven't finished a full review, but I haven't found anything major yet. But I figured I shouldn't sit on this for too long, so let's start with some feedback.

tsl/src/nodes/vector_agg/function/agg_vector_validity_helper.c

tsl/src/nodes/vector_agg/function/float48_accum_single.c

erimatnor · 2024-09-17T07:26:42Z

tsl/src/nodes/vector_agg/function/float48_accum_types.c

+#undef AGG_NAME
+#undef NEED_SXX


Not sure why these two are udnef:ed here. I don't see them being defined anywhere above or in included files. Is it just a precaution?

They are defined in float48_accum_templates.c before including these files. Generally, the idea is that the template file undefines the macros that it is parameterized by. The float48_accum_types.c is parameterized by these two macros to generate the two families of transition functions that either require or not the Sxx state. I'm following this approach for template files in vectorized filters as well. Not sure where to best describe this.

erimatnor · 2024-09-17T07:32:42Z

tsl/src/nodes/vector_agg/function/int24_avg_accum_single.c

+#undef PG_TYPE
+#undef CTYPE
+#undef DATUM_TO_CTYPE


There's some inconsistency w.r.t. where these macros are defined and undefined.

You first define the macros listed here outside the file, including AGG_NAME, but then you don't undef all of the macros here. For example, AGG_NAME is still defined.

It would be good to have some kind of consistent approach/rule for this. For example, let's say all the defines and undefs would be handled outside. Otherwise it is a bit difficult to understand the intention.

Same comment as above, there's only a single instantiation of this template for a particular type, so the type is un-defined at the end, but multiple instantiations for the particular aggregate (AVG), so that one is un-defined above.

erimatnor · 2024-09-17T07:37:24Z

tsl/src/nodes/vector_agg/function/float48_accum_single.c

+	/*
+	 * Vector registers can be up to 512 bits wide.
+	 */
+#define UNROLL_SIZE ((int) (512 / 8 / sizeof(CTYPE)))


This seems like a global, hardware-specific limit. It is used in multiple places so I suggest moving it into a header where you can define it only once, as a macro on CTYPE.

Also wondering if there's a compiler header file that defines the vector register size available that can be used instead of hard-coding 512 here?

#define UNROLL_SIZE(CTYPE) ((int) 512 / 8 / sizeof(CTYPE)))

This is just a guesstimate that is specific to a particular unrolled loop, no deep meaning. So I'd keep it local. Maybe we can make it more architecture-specific, for now I just wrote a "least-effort" version that is vectorized at least somehow.

I am just wondering if there's a standard header file we can import that tells us AVX-512 is available and there is some "define" related to the unroll size that says "512".

Not sure if there's an easy cross-platform way to do it. Generally we want to unroll to the biggest available size across architectures, if there's only a smaller width available it will still lead to vectorizable code.

erimatnor · 2024-09-17T08:07:08Z

tsl/src/nodes/vector_agg/function/functions.c

+	const uint64 *restrict validity = (uint64 *) vector->buffers[0];
+	/* First, process the full words. */
+	for (int i = 0; i < n / 64; i++)


What happens here on 32-bit platforms? I guess it is just slower?

They still have the uint64 support, so it should work just as well. Probably slower, I didn't test.

Yes, the compiler has uint64 support, but it is not a native word-length type on those machines so they need to emulate 64-bit integers.

We do build for 32 bit, but it is probably not a platform we optimize for in terms of performance. It might be good to point out that we currently don't care about optimizing for 32 bit specifically.

I'm not sure where best to do it. This logic applies to every place in code that uses 64-bit integers.

tsl/src/nodes/vector_agg/function/sum_float_single.c

erimatnor

Approving. This seems mostly like boilerplate stuff. I didn't deeply check the correctness of the aggregate function implementations. I guess tests should reveal any issue there.

The macro and pre-processor stuff is a little bit hard to follow sometimes. I am wondering if that could be cleaned up, but I guess it is what we have for now.

@MiguelTubio

This release contains performance improvements and bug fixes since the 2.16.1 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6882: Allow DELETE on the compressed chunks without decompression. * timescale#7033 Use MERGE statement on CAgg Refresh * timescale#7126: Add functions to show the compression information. * timescale#7147: Vectorize partial aggregation for `sum * timescale#7200: Vectorize common aggregate functions like `min`, `max`, `sum`, `avg`, `stddev`, `variance` for compressed columns of arithmetic types, when there is grouping on segmentby columns or no grouping. * timescale#7204: Track additional extensions in telemetry. * timescale#7207: Refactor the `decompress_batches_scan` functions for easier maintenance. * timescale#7209: Add a function to drop the `osm` chunk. * timescale#7275: Add support for RETURNING clause for MERGE * timescale#7295 Support ALTER TABLE SET ACCESS METHOD on hypertable **Bugfixes** * timescale#7187: Fix the string literal length for the `compressed_data_info` function. * timescale#7191: Fix creating default indexes on chunks when migrating the data. * timescale#7195: Fix the `segment by` and `order by` checks when dropping a column from a compressed hypertable. * timescale#7201: Use the generic extension description when building `apt` and `rpm` loader packages. * timescale#7227: Add an index to the `compression_chunk_size` catalog table. * timescale#7229: Fix the foreign key constraints where the index and the constraint column order are different. * timescale#7230: Do not propagate the foreign key constraints to the `osm` chunk. * timescale#7234: Release the cache after accessing the cache entry. * timescale#7258 Force English in the pg_config command executed by cmake to avoid unexpected building errors * timescale#7270 Fix memory leak in compressed DML batch filtering * timescale#7286: Fix index column check while searching for index * timescale#7290 Add check for NULL offset for caggs built on top of caggs * timescale#7301 Make foreign key behaviour for hypertables consistent * timescale#7318: Fix chunk skipping range filtering * timescale#7320 Set license specific extension comment in install script **Thanks** * @MiguelTubio for reporting and fixing a Windows build error * @posuch for reporting the misleading extension description in the generic loader packages. * @snyrkill for discovering and reporting the issue

@MiguelTubio

This release adds support for PostgreSQL 17, significantly improves the performance of continuous aggregate refreshes, and contains performance improvements for analytical queries and delete operations over compressed hypertables. We recommend that you upgrade at the next available opportunity. **Highlighted features in TimescaleDB v2.17.0** * Full PostgreSQL 17 support for all existing features. TimescaleDB v2.17 is available for PostgreSQL 14, 15, 16, and 17. * Significant performance improvements for continuous aggregate policies: continuous aggregate refresh is now using `merge` instead of deleting old materialized data and re-inserting. This update can decrease dramatically the amount of data that must be written on the continuous aggregate in the presence of a small number of changes, reduce the `i/o` cost of refreshing a continuous aggregate, and generate fewer Write-Ahead Logs (`WAL`). Overall, continuous aggregate policies will be more lightweight, use less system resources, and complete faster. * Increased performance for real-time analytical queries over compressed hypertables: we are excited to introduce additional Single Instruction, Multiple Data (`SIMD`) vectorization optimization to our engine by supporting vectorized execution for queries that group by using the `segment_by` column(s) and aggregate using the basic aggregate functions (`sum`, `count`, `avg`, `min`, `max`). Stay tuned for more to come in follow-up releases! Support for grouping on additional columns, filtered aggregation, vectorized expressions, and `time_bucket` is coming soon. * Improved performance of deletes on compressed hypertables when a large amount of data is affected. This improvement speeds up operations that delete whole segments by skipping the decompression step. It is enabled for all deletes that filter by the `segment_by` column(s). **PostgreSQL 14 deprecation announcement** We will continue supporting PostgreSQL 14 until April 2025. Closer to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 14 support will not be included going forward. **Features** * #6882: Allow delete of full segments on compressed chunks without decompression. * #7033: Use `merge` statement on continuous aggregates refresh. * #7126: Add functions to show the compression information. * #7147: Vectorize partial aggregation for `sum(int4)` with grouping on `segment by` columns. * #7204: Track additional extensions in telemetry. * #7207: Refactor the `decompress_batches_scan` functions for easier maintenance. * #7209: Add a function to drop the `osm` chunk. * #7275: Add support for the `returning` clause for `merge`. * #7200: Vectorize common aggregate functions like `min`, `max`, `sum`, `avg`, `stddev`, `variance` for compressed columns of arithmetic types, when there is grouping on `segment by` columns or no grouping. **Bug fixes** * #7187: Fix the string literal length for the `compressed_data_info` function. * #7191: Fix creating default indexes on chunks when migrating the data. * #7195: Fix the `segment by` and `order by` checks when dropping a column from a compressed hypertable. * #7201: Use the generic extension description when building `apt` and `rpm` loader packages. * #7227: Add an index to the `compression_chunk_size` catalog table. * #7229: Fix the foreign key constraints where the index and the constraint column order are different. * #7230: Do not propagate the foreign key constraints to the `osm` chunk. * #7234: Release the cache after accessing the cache entry. * #7258: Force English in the `pg_config` command executed by `cmake` to avoid the unexpected building errors. * #7270: Fix the memory leak in compressed DML batch filtering. * #7286: Fix the index column check while searching for the index. * #7290: Add check for null offset for continuous aggregates built on top of continuous aggregates. * #7301: Make foreign key behavior for hypertables consistent. * #7318: Fix chunk skipping range filtering. * #7320: Set the license specific extension comment in the install script. **Thanks** * @MiguelTubio for reporting and fixing the Windows build error. * @posuch for reporting the misleading extension description in the generic loader packages. * @snyrkill for discovering and reporting the issue with continuous aggregates built on top of continuous aggregates.

@MiguelTubio

This release adds support for PostgreSQL 17, significantly improves the performance of continuous aggregate refreshes, and contains performance improvements for analytical queries and delete operations over compressed hypertables. We recommend that you upgrade at the next available opportunity. **Highlighted features in TimescaleDB v2.17.0** * Full PostgreSQL 17 support for all existing features. TimescaleDB v2.17 is available for PostgreSQL 14, 15, 16, and 17. * Significant performance improvements for continuous aggregate policies: continuous aggregate refresh is now using `merge` instead of deleting old materialized data and re-inserting. This update can decrease dramatically the amount of data that must be written on the continuous aggregate in the presence of a small number of changes, reduce the `i/o` cost of refreshing a continuous aggregate, and generate fewer Write-Ahead Logs (`WAL`). Overall, continuous aggregate policies will be more lightweight, use less system resources, and complete faster. * Increased performance for real-time analytical queries over compressed hypertables: we are excited to introduce additional Single Instruction, Multiple Data (`SIMD`) vectorization optimization to our engine by supporting vectorized execution for queries that group by using the `segment_by` column(s) and aggregate using the basic aggregate functions (`sum`, `count`, `avg`, `min`, `max`). Stay tuned for more to come in follow-up releases! Support for grouping on additional columns, filtered aggregation, vectorized expressions, and `time_bucket` is coming soon. * Improved performance of deletes on compressed hypertables when a large amount of data is affected. This improvement speeds up operations that delete whole segments by skipping the decompression step. It is enabled for all deletes that filter by the `segment_by` column(s). **PostgreSQL 14 deprecation announcement** We will continue supporting PostgreSQL 14 until April 2025. Closer to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 14 support will not be included going forward. **Features** * #6882: Allow delete of full segments on compressed chunks without decompression. * #7033: Use `merge` statement on continuous aggregates refresh. * #7126: Add functions to show the compression information. * #7147: Vectorize partial aggregation for `sum(int4)` with grouping on `segment by` columns. * #7204: Track additional extensions in telemetry. * #7207: Refactor the `decompress_batches_scan` functions for easier maintenance. * #7209: Add a function to drop the `osm` chunk. * #7275: Add support for the `returning` clause for `merge`. * #7200: Vectorize common aggregate functions like `min`, `max`, `sum`, `avg`, `stddev`, `variance` for compressed columns of arithmetic types, when there is grouping on `segment by` columns or no grouping. **Bug fixes** * #7187: Fix the string literal length for the `compressed_data_info` function. * #7191: Fix creating default indexes on chunks when migrating the data. * #7195: Fix the `segment by` and `order by` checks when dropping a column from a compressed hypertable. * #7201: Use the generic extension description when building `apt` and `rpm` loader packages. * #7227: Add an index to the `compression_chunk_size` catalog table. * #7229: Fix the foreign key constraints where the index and the constraint column order are different. * #7230: Do not propagate the foreign key constraints to the `osm` chunk. * #7234: Release the cache after accessing the cache entry. * #7258: Force English in the `pg_config` command executed by `cmake` to avoid the unexpected building errors. * #7270: Fix the memory leak in compressed DML batch filtering. * #7286: Fix the index column check while searching for the index. * #7290: Add check for null offset for continuous aggregates built on top of continuous aggregates. * #7301: Make foreign key behavior for hypertables consistent. * #7318: Fix chunk skipping range filtering. * #7320: Set the license specific extension comment in the install script. **Thanks** * @MiguelTubio for reporting and fixing the Windows build error. * @posuch for reporting the misleading extension description in the generic loader packages. * @snyrkill for discovering and reporting the issue with continuous aggregates built on top of continuous aggregates. --------- Signed-off-by: Pallavi Sontakke <pallavi@timescale.com> Signed-off-by: Yannis Roussos <iroussos@gmail.com> Signed-off-by: Sven Klemm <31455525+svenklemm@users.noreply.github.com> Co-authored-by: Yannis Roussos <iroussos@gmail.com> Co-authored-by: atovpeko <114177030+atovpeko@users.noreply.github.com> Co-authored-by: Sven Klemm <31455525+svenklemm@users.noreply.github.com>

@MiguelTubio

This release adds support for PostgreSQL 17, significantly improves the performance of continuous aggregate refreshes, and contains performance improvements for analytical queries and delete operations over compressed hypertables. We recommend that you upgrade at the next available opportunity. **Highlighted features in TimescaleDB v2.17.0** * Full PostgreSQL 17 support for all existing features. TimescaleDB v2.17 is available for PostgreSQL 14, 15, 16, and 17. * Significant performance improvements for continuous aggregate policies: continuous aggregate refresh is now using `merge` instead of deleting old materialized data and re-inserting. This update can decrease dramatically the amount of data that must be written on the continuous aggregate in the presence of a small number of changes, reduce the `i/o` cost of refreshing a continuous aggregate, and generate fewer Write-Ahead Logs (`WAL`). Overall, continuous aggregate policies will be more lightweight, use less system resources, and complete faster. * Increased performance for real-time analytical queries over compressed hypertables: we are excited to introduce additional Single Instruction, Multiple Data (`SIMD`) vectorization optimization to our engine by supporting vectorized execution for queries that group by using the `segment_by` column(s) and aggregate using the basic aggregate functions (`sum`, `count`, `avg`, `min`, `max`). Stay tuned for more to come in follow-up releases! Support for grouping on additional columns, filtered aggregation, vectorized expressions, and `time_bucket` is coming soon. * Improved performance of deletes on compressed hypertables when a large amount of data is affected. This improvement speeds up operations that delete whole segments by skipping the decompression step. It is enabled for all deletes that filter by the `segment_by` column(s). **PostgreSQL 14 deprecation announcement** We will continue supporting PostgreSQL 14 until April 2025. Closer to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 14 support will not be included going forward. **Features** * timescale#6882: Allow delete of full segments on compressed chunks without decompression. * timescale#7033: Use `merge` statement on continuous aggregates refresh. * timescale#7126: Add functions to show the compression information. * timescale#7147: Vectorize partial aggregation for `sum(int4)` with grouping on `segment by` columns. * timescale#7204: Track additional extensions in telemetry. * timescale#7207: Refactor the `decompress_batches_scan` functions for easier maintenance. * timescale#7209: Add a function to drop the `osm` chunk. * timescale#7275: Add support for the `returning` clause for `merge`. * timescale#7200: Vectorize common aggregate functions like `min`, `max`, `sum`, `avg`, `stddev`, `variance` for compressed columns of arithmetic types, when there is grouping on `segment by` columns or no grouping. **Bug fixes** * timescale#7187: Fix the string literal length for the `compressed_data_info` function. * timescale#7191: Fix creating default indexes on chunks when migrating the data. * timescale#7195: Fix the `segment by` and `order by` checks when dropping a column from a compressed hypertable. * timescale#7201: Use the generic extension description when building `apt` and `rpm` loader packages. * timescale#7227: Add an index to the `compression_chunk_size` catalog table. * timescale#7229: Fix the foreign key constraints where the index and the constraint column order are different. * timescale#7230: Do not propagate the foreign key constraints to the `osm` chunk. * timescale#7234: Release the cache after accessing the cache entry. * timescale#7258: Force English in the `pg_config` command executed by `cmake` to avoid the unexpected building errors. * timescale#7270: Fix the memory leak in compressed DML batch filtering. * timescale#7286: Fix the index column check while searching for the index. * timescale#7290: Add check for null offset for continuous aggregates built on top of continuous aggregates. * timescale#7301: Make foreign key behavior for hypertables consistent. * timescale#7318: Fix chunk skipping range filtering. * timescale#7320: Set the license specific extension comment in the install script. **Thanks** * @MiguelTubio for reporting and fixing the Windows build error. * @posuch for reporting the misleading extension description in the generic loader packages. * @snyrkill for discovering and reporting the issue with continuous aggregates built on top of continuous aggregates. --------- Signed-off-by: Pallavi Sontakke <pallavi@timescale.com> Signed-off-by: Yannis Roussos <iroussos@gmail.com> Signed-off-by: Sven Klemm <31455525+svenklemm@users.noreply.github.com> Co-authored-by: Yannis Roussos <iroussos@gmail.com> Co-authored-by: atovpeko <114177030+atovpeko@users.noreply.github.com> Co-authored-by: Sven Klemm <31455525+svenklemm@users.noreply.github.com>

akuzm added 30 commits March 28, 2024 14:32

full switch

13ba173

fix the build

be203fd

remove the old planning approach

ee8b1f4

remove more of old planning

e146937

typos

753bf0d

use enum indexes for settings

4a4f20b

cleanup

beba737

benchmark separate vectorized agg (2024-03-28 no. 1)

2dbda15

split out common code

175cbf2

show costs in explain

21faf6e

wrong prefix

fa2fb4d

Merge remote-tracking branch 'akuzm/vector-separate' into HEAD

0ed166f

Remove temporary data from DecompressChunkPath

30a6069

It's a little confusing because they only live during the creation of decompression plan. Put them into a separate struct instead.

rename

e25267d

typo

5e6221d

benchmark separate vectorized agg (2024-03-29 no. 2)

5c4af48

produce partials for each batch

4130683

also unroll the loop

benchmark separate vectorized agg (2024-03-29 no. 3)

e7f01ab

more generic interface

cff844d

fix outer_var resolution

209838e

Revert "disable filters"

dfb92af

This reverts commit 7852d55f3061b82a3ce1cf8d7575f64c2d14aa0b.

support filters?

6ef84c1

fix outer_var resolution

4db7cea

fix ref

287f3b4

Merge remote-tracking branch 'akuzm/vector-separate' into HEAD

a50069f

fix for filtered out batches

9bdae30

benchmark vectorized agg with filter (2024-03-29 no. 4)

398f317

fix build on windows

eaca282

something that doesn't work

fdca7a7

fix

60c6eab

akuzm added 10 commits September 10, 2024 16:42

int128avgstate

fafeb6a

fix 14

f8d11d9

Merge remote-tracking branch 'origin/main' into HEAD

79828b3

benchmark aggregate functions (2024-09-10 no. 1)

153d4cc

some improvements to float sum

71ec5ae

benchmark aggregate functions (2024-09-11 no. 2)

71a99ca

separate translation units

087193a

cleanup

5f2a5d4

Merge remote-tracking branch 'origin/main' into HEAD

2591e61

changelog

b84dd50

akuzm marked this pull request as ready for review September 12, 2024 13:25

akuzm added 2 commits September 12, 2024 15:35

changelog

8b3cab1

fix

15cc10d

erimatnor reviewed Sep 18, 2024

View reviewed changes

review comments

00716dc

erimatnor approved these changes Sep 24, 2024

View reviewed changes

fabriziomello assigned akuzm Sep 24, 2024

akuzm added this to the TimescaleDB 2.17.0 milestone Sep 25, 2024

svenklemm approved these changes Sep 25, 2024

View reviewed changes

akuzm added 2 commits September 26, 2024 16:07

Merge remote-tracking branch 'origin/main' into HEAD

a414e26

changelog

f77f0db

akuzm enabled auto-merge (squash) September 26, 2024 14:09

akuzm merged commit 0cc00e7 into timescale:main Sep 26, 2024
40 of 41 checks passed

akuzm deleted the agg-functions branch September 26, 2024 14:24

pallavisontakke mentioned this pull request Oct 3, 2024

Release 2.17.0 #7285

Merged

pallavisontakke mentioned this pull request Oct 8, 2024

Release 2.17.0 #7328

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more vectorized aggregate functions #7200

Implement more vectorized aggregate functions #7200

akuzm commented Aug 14, 2024 •

edited

Loading

erimatnor left a comment

erimatnor Sep 17, 2024

akuzm Sep 18, 2024

erimatnor Sep 17, 2024

akuzm Sep 18, 2024

erimatnor Sep 17, 2024

akuzm Sep 18, 2024

erimatnor Sep 23, 2024 •

edited

Loading

akuzm Sep 26, 2024

erimatnor Sep 17, 2024

akuzm Sep 18, 2024

erimatnor Sep 23, 2024

akuzm Sep 26, 2024

erimatnor left a comment

		#undef AGG_NAME
		#undef NEED_SXX

Implement more vectorized aggregate functions #7200

Implement more vectorized aggregate functions #7200

Conversation

akuzm commented Aug 14, 2024 • edited Loading

erimatnor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erimatnor Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erimatnor left a comment

Choose a reason for hiding this comment

akuzm commented Aug 14, 2024 •

edited

Loading

erimatnor Sep 23, 2024 •

edited

Loading