Dedicated `ranges::minmax` vectorization that does not unnecessarily track element pointer #4384

AlexGuteniev · 2024-02-10T18:58:40Z

Towards #2803

New header

Can't have extern "C" functions that use templates, so the new header accommodates non-template minmax structure variations.

Sign handing

Unlike _element counterpart, the integer signedness affects function selection rather than passed as runtime parameter.

For minmax_element, computing element pointer requires index, so cmpgt+ blendv pair is needed to capture it. cmpgt is only available in signed form, so unsigned types are supported by adjustment of the input values.

For minmax values, computing only vertical min/max without the index for most of types doesn't need the cmpgt+ blendv pair, as there are signed and unsigned min/max operation. The only exception is 64-bit integer support: there are no min/max ops, so still using blendv. For consistency, enforced also by the template use, even for 64-bit integers the sign is still passed as function selection

Benchmark

_val are new results, we indirect for value there

I put output of the benchmark before and after in a single table

--------------------------------------------------------------------------------------------------
Benchmark                                old time        old CPU        new time       new CPU   
--------------------------------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>                1345 ns         1350 ns       1336 ns         1350 ns   
bm<uint8_t, 8021, Op::Max>                1192 ns         1193 ns       2146 ns         1842 ns   
bm<uint8_t, 8021, Op::Both>               1675 ns         1674 ns       1883 ns         1918 ns   
bm<uint8_t, 8021, Op::Min_val>            8482 ns         8370 ns        665 ns          666 ns   
bm<uint8_t, 8021, Op::Max_val>            7954 ns         8022 ns        567 ns          578 ns   
bm<uint8_t, 8021, Op::Both_val>          27377 ns        27623 ns      30814 ns        30691 ns   
bm<uint16_t, 8021, Op::Min>               1841 ns         1814 ns       2091 ns         2100 ns   
bm<uint16_t, 8021, Op::Max>               1840 ns         1859 ns       1826 ns         1859 ns   
bm<uint16_t, 8021, Op::Both>              1819 ns         1842 ns       1999 ns         2009 ns   
bm<uint16_t, 8021, Op::Min_val>          10433 ns        10498 ns        591 ns          600 ns   
bm<uint16_t, 8021, Op::Max_val>          10183 ns        10254 ns        619 ns          614 ns   
bm<uint16_t, 8021, Op::Both_val>         22395 ns        21973 ns      23348 ns        23438 ns   
bm<uint32_t, 8021, Op::Min>               2455 ns         2455 ns       2283 ns         2288 ns   
bm<uint32_t, 8021, Op::Max>               3746 ns         3749 ns       3977 ns         4011 ns   
bm<uint32_t, 8021, Op::Both>              4417 ns         4443 ns       4784 ns         4649 ns   
bm<uint32_t, 8021, Op::Min_val>           2182 ns         2176 ns       1107 ns         1099 ns   
bm<uint32_t, 8021, Op::Max_val>           3453 ns         3453 ns       1688 ns         1674 ns   
bm<uint32_t, 8021, Op::Both_val>          4251 ns         4248 ns       1075 ns         1099 ns   
bm<uint64_t, 8021, Op::Min>               6909 ns         6801 ns       6827 ns         6801 ns   
bm<uint64_t, 8021, Op::Max>               6867 ns         6801 ns       6985 ns         6975 ns   
bm<uint64_t, 8021, Op::Both>              7646 ns         7499 ns       7570 ns         7499 ns   
bm<uint64_t, 8021, Op::Min_val>           6619 ns         6696 ns       5925 ns         5938 ns   
bm<uint64_t, 8021, Op::Max_val>           6942 ns         6975 ns       5912 ns         5859 ns   
bm<uint64_t, 8021, Op::Both_val>          8901 ns         8789 ns       6103 ns         5999 ns   
bm<int8_t, 8021, Op::Min>                  901 ns          907 ns        915 ns          921 ns   
bm<int8_t, 8021, Op::Max>                  833 ns          837 ns        885 ns          889 ns   
bm<int8_t, 8021, Op::Both>                1091 ns         1099 ns       1148 ns         1147 ns   
bm<int8_t, 8021, Op::Min_val>             6150 ns         6278 ns        398 ns          394 ns   
bm<int8_t, 8021, Op::Max_val>             6979 ns         6975 ns        414 ns          419 ns   
bm<int8_t, 8021, Op::Both_val>           26893 ns        26681 ns      23398 ns        23542 ns   
bm<int16_t, 8021, Op::Min>                1626 ns         1639 ns       1764 ns         1765 ns   
bm<int16_t, 8021, Op::Max>                1701 ns         1660 ns       1729 ns         1709 ns   
bm<int16_t, 8021, Op::Both>               1690 ns         1709 ns       1720 ns         1744 ns   
bm<int16_t, 8021, Op::Min_val>            9462 ns         9277 ns        858 ns          865 ns   
bm<int16_t, 8021, Op::Max_val>            9000 ns         8998 ns        856 ns          858 ns   
bm<int16_t, 8021, Op::Both_val>          19110 ns        19043 ns      18287 ns        18415 ns   
bm<int32_t, 8021, Op::Min>                2164 ns         2176 ns       3264 ns         3191 ns   
bm<int32_t, 8021, Op::Max>                3370 ns         3418 ns       3489 ns         3488 ns   
bm<int32_t, 8021, Op::Both>               4296 ns         4238 ns       4274 ns         4332 ns   
bm<int32_t, 8021, Op::Min_val>            2115 ns         2086 ns       1619 ns         1639 ns   
bm<int32_t, 8021, Op::Max_val>            3439 ns         3376 ns        928 ns          924 ns   
bm<int32_t, 8021, Op::Both_val>           4079 ns         4049 ns       1025 ns         1025 ns   
bm<int64_t, 8021, Op::Min>                6861 ns         6836 ns       6842 ns         6801 ns   
bm<int64_t, 8021, Op::Max>                6898 ns         6801 ns       6989 ns         6975 ns   
bm<int64_t, 8021, Op::Both>               8925 ns         8789 ns       8976 ns         8998 ns   
bm<int64_t, 8021, Op::Min_val>            6714 ns         6696 ns       5947 ns         5859 ns   
bm<int64_t, 8021, Op::Max_val>            6853 ns         6975 ns       5860 ns         5859 ns   
bm<int64_t, 8021, Op::Both_val>           7852 ns         7743 ns       6108 ns         6138 ns   
bm<float, 8021, Op::Min>                  3662 ns         3683 ns       3260 ns         3296 ns   
bm<float, 8021, Op::Max>                  3477 ns         3449 ns       3548 ns         3530 ns   
bm<float, 8021, Op::Both>                 5127 ns         5190 ns       5140 ns         5190 ns   
bm<float, 8021, Op::Min_val>              3375 ns         3376 ns       2873 ns         2916 ns   
bm<float, 8021, Op::Max_val>              3534 ns         3610 ns       2848 ns         2888 ns   
bm<float, 8021, Op::Both_val>             4958 ns         4743 ns       2916 ns         2916 ns   
bm<double, 8021, Op::Min>                 6103 ns         6138 ns       6017 ns         5999 ns   
bm<double, 8021, Op::Max>                 6001 ns         5999 ns       6292 ns         6278 ns   
bm<double, 8021, Op::Both>                7307 ns         7324 ns       7773 ns         7847 ns   
bm<double, 8021, Op::Min_val>             6021 ns         5999 ns       5963 ns         5999 ns   
bm<double, 8021, Op::Max_val>             5984 ns         5999 ns       5768 ns         5781 ns   
bm<double, 8021, Op::Both_val>            7120 ns         7150 ns       6024 ns         5999 ns

Some rows show order of magnitude improvement.
Some show difference within results variation.

I interpret these results as the change being pure improvement, and the row that don't show it are memory-bound rather than CPU bound, and this can change on a different machine or by reducing input data amount.

stl/inc/__msvc_minmax.hpp

Co-authored-by: A. Jiang <de34@live.cn>

stl/inc/algorithm

stl/inc/xutility

stl/src/vector_algorithms.cpp

tests/std/include/test_min_max_element_support.hpp

This is a bug fix, not only a change to address a pedantic comment!

…_Val1` and `_Val2`. Regex renamed: `\[\]$(__m128[id]?) _First, \1 _Second$ \{ return (\w+)$_First, _Second$; \}` to: `[]($1 _Val1, $1 _Val2) { return $2(_Val1, _Val2); }`

Remove comments from `_mm_min_epi16` and `_mm_max_epi16` - they're SSE2, not SSE4.1.

StephanTLavavej · 2024-02-12T00:53:59Z

Thanks! I resolved the remaining feedback comments, and decided to simply require SSE 4.2 for the new optimization, consistent with the previous optimization. This is a lot easier to reason about.

Don't need _To_address for initializer_list.
_STL_ASSERT non-empty inputs, update citations.
Cleanup comment punctuation.
Avoid risk by always guarding with _Use_sse42().
Make room for comments by renaming lambda _First and _Second to _Val1 and _Val2.
- Regex renamed: \[\]$(__m128[id]?) _First, \1 _Second$ \{ return (\w+)$_First, _Second$; \}
- To: []($1 _Val1, $1 _Val2) { return $2(_Val1, _Val2); }
Comment all _Minmax_traits_MEOW intrinsics above SSE2.
- Remove comments from _mm_min_epi16 and _mm_max_epi16 - they're SSE2, not SSE4.1.

⚠️ Note to self:

Need MSVC-internal changes to add <__msvc_minmax.hpp>.

StephanTLavavej · 2024-02-15T23:09:25Z

I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-02-16T21:12:00Z

Thanks for maximizing performance! 😹 🚀 😻

AlexGuteniev added 6 commits February 10, 2024 17:36

initial implementation

2082d8f

benchmark

e38454c

move vectorization out

108ed21

missing max

838da2a

bemchmark copypaste error

1b61bb0

_val is less confusing

15cd2d4

AlexGuteniev requested a review from a team as a code owner February 10, 2024 18:58

AlexGuteniev added 10 commits February 10, 2024 21:00

valement

6ea297a

format

ddedb4a

format

09469ab

no top level const in declaration

461ca2b

header unit

80e2470

ADL

5448180

check projection

0eb5322

copypasta cleanup

939dc33

Improve fallback

be2aa9b

format

1ecc70e

frederick-vs-ja reviewed Feb 11, 2024

View reviewed changes

stl/inc/__msvc_minmax.hpp Outdated Show resolved Hide resolved

AlexGuteniev and others added 2 commits February 11, 2024 15:29

Update stl/inc/__msvc_minmax.hpp

b5be1e4

Co-authored-by: A. Jiang <de34@live.cn>

turn tails

c01bf7d

StephanTLavavej added the performance Must go faster label Feb 11, 2024

StephanTLavavej requested changes Feb 11, 2024

View reviewed changes

AlexGuteniev added 6 commits February 11, 2024 22:40

Use SSE responsible

99917b4

empty vector check

e55e7f2

Don't mimic _Minmax_element fallback

35626ce

clear pointers

47bc608

no horizontal position

c3ba612

unload extra _Load

def1e7a

AlexGuteniev and others added 12 commits February 11, 2024 23:42

We'll hide Slavic accent

f149a83

non-type template param is already const enough

5a91a03

constant result

459ae03

range of is_constant_evaluated()

e9185a9

<!> scope for _M_ARM64EC <!>

cf1e3da

This is a bug fix, not only a change to address a pedantic comment!

the who understands English articles

1743e20

Don't need _To_address for initializer_list.

1603f0b

_STL_ASSERT non-empty inputs, update citations.

13155b3

Cleanup comment punctuation.

7c678db

Avoid risk by always guarding with _Use_sse42().

c201a3f

Make room for comments by renaming lambda _First and _Second to `…

35fd4b7

…_Val1` and `_Val2`. Regex renamed: `\[\]$(__m128[id]?) _First, \1 _Second$ \{ return (\w+)$_First, _Second$; \}` to: `[]($1 _Val1, $1 _Val2) { return $2(_Val1, _Val2); }`

Comment all _Minmax_traits_MEOW intrinsics above SSE2.

d604bfd

Remove comments from `_mm_min_epi16` and `_mm_max_epi16` - they're SSE2, not SSE4.1.

StephanTLavavej approved these changes Feb 12, 2024

View reviewed changes

StephanTLavavej assigned CaseyCarter and StephanTLavavej Feb 12, 2024

CaseyCarter approved these changes Feb 16, 2024

View reviewed changes

CaseyCarter removed their assignment Feb 16, 2024

StephanTLavavej merged commit c53ac59 into microsoft:main Feb 16, 2024
35 checks passed

AlexGuteniev deleted the simple_max branch February 16, 2024 21:37

AlexGuteniev mentioned this pull request Feb 17, 2024

vector_algorithms.cpp, *minmax*: invert the condition to improve *_element cases a bit more #4401

Merged

AlexGuteniev mentioned this pull request Mar 6, 2024

Should ranges::minmax be auto-vectorized instead of manual vectorization. #4453

Closed

StephanTLavavej mentioned this pull request Mar 27, 2024

vector_algorithms.cpp: Remove the distinction between SSE2 and SSE4.2 #4536

Closed

AlexGuteniev mentioned this pull request May 7, 2024

minmax 8 and 16 bit elements are not vectorized #4660

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedicated `ranges::minmax` vectorization that does not unnecessarily track element pointer #4384

Dedicated `ranges::minmax` vectorization that does not unnecessarily track element pointer #4384

AlexGuteniev commented Feb 10, 2024 •

edited

Loading

StephanTLavavej commented Feb 12, 2024 •

edited

Loading

StephanTLavavej commented Feb 15, 2024

StephanTLavavej commented Feb 16, 2024

Dedicated ranges::minmax vectorization that does not unnecessarily track element pointer #4384

Dedicated ranges::minmax vectorization that does not unnecessarily track element pointer #4384

Conversation

AlexGuteniev commented Feb 10, 2024 • edited Loading

New header

Sign handing

Benchmark

StephanTLavavej commented Feb 12, 2024 • edited Loading

⚠️ Note to self:

StephanTLavavej commented Feb 15, 2024

StephanTLavavej commented Feb 16, 2024

Dedicated `ranges::minmax` vectorization that does not unnecessarily track element pointer #4384

Dedicated `ranges::minmax` vectorization that does not unnecessarily track element pointer #4384

AlexGuteniev commented Feb 10, 2024 •

edited

Loading

StephanTLavavej commented Feb 12, 2024 •

edited

Loading