Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedicated ranges::minmax vectorization that does not unnecessarily track element pointer #4384

Merged
merged 36 commits into from
Feb 16, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Feb 10, 2024

Towards #2803

New header

Can't have extern "C" functions that use templates, so the new header accommodates non-template minmax structure variations.

Sign handing

Unlike _element counterpart, the integer signedness affects function selection rather than passed as runtime parameter.

For minmax_element, computing element pointer requires index, so cmpgt+ blendv pair is needed to capture it. cmpgt is only available in signed form, so unsigned types are supported by adjustment of the input values.

For minmax values, computing only vertical min/max without the index for most of types doesn't need the cmpgt+ blendv pair, as there are signed and unsigned min/max operation. The only exception is 64-bit integer support: there are no min/max ops, so still using blendv. For consistency, enforced also by the template use, even for 64-bit integers the sign is still passed as function selection

Benchmark

_val are new results, we indirect for value there

I put output of the benchmark before and after in a single table
--------------------------------------------------------------------------------------------------
Benchmark                                old time        old CPU        new time       new CPU   
--------------------------------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>                1345 ns         1350 ns       1336 ns         1350 ns   
bm<uint8_t, 8021, Op::Max>                1192 ns         1193 ns       2146 ns         1842 ns   
bm<uint8_t, 8021, Op::Both>               1675 ns         1674 ns       1883 ns         1918 ns   
bm<uint8_t, 8021, Op::Min_val>            8482 ns         8370 ns        665 ns          666 ns   
bm<uint8_t, 8021, Op::Max_val>            7954 ns         8022 ns        567 ns          578 ns   
bm<uint8_t, 8021, Op::Both_val>          27377 ns        27623 ns      30814 ns        30691 ns   
bm<uint16_t, 8021, Op::Min>               1841 ns         1814 ns       2091 ns         2100 ns   
bm<uint16_t, 8021, Op::Max>               1840 ns         1859 ns       1826 ns         1859 ns   
bm<uint16_t, 8021, Op::Both>              1819 ns         1842 ns       1999 ns         2009 ns   
bm<uint16_t, 8021, Op::Min_val>          10433 ns        10498 ns        591 ns          600 ns   
bm<uint16_t, 8021, Op::Max_val>          10183 ns        10254 ns        619 ns          614 ns   
bm<uint16_t, 8021, Op::Both_val>         22395 ns        21973 ns      23348 ns        23438 ns   
bm<uint32_t, 8021, Op::Min>               2455 ns         2455 ns       2283 ns         2288 ns   
bm<uint32_t, 8021, Op::Max>               3746 ns         3749 ns       3977 ns         4011 ns   
bm<uint32_t, 8021, Op::Both>              4417 ns         4443 ns       4784 ns         4649 ns   
bm<uint32_t, 8021, Op::Min_val>           2182 ns         2176 ns       1107 ns         1099 ns   
bm<uint32_t, 8021, Op::Max_val>           3453 ns         3453 ns       1688 ns         1674 ns   
bm<uint32_t, 8021, Op::Both_val>          4251 ns         4248 ns       1075 ns         1099 ns   
bm<uint64_t, 8021, Op::Min>               6909 ns         6801 ns       6827 ns         6801 ns   
bm<uint64_t, 8021, Op::Max>               6867 ns         6801 ns       6985 ns         6975 ns   
bm<uint64_t, 8021, Op::Both>              7646 ns         7499 ns       7570 ns         7499 ns   
bm<uint64_t, 8021, Op::Min_val>           6619 ns         6696 ns       5925 ns         5938 ns   
bm<uint64_t, 8021, Op::Max_val>           6942 ns         6975 ns       5912 ns         5859 ns   
bm<uint64_t, 8021, Op::Both_val>          8901 ns         8789 ns       6103 ns         5999 ns   
bm<int8_t, 8021, Op::Min>                  901 ns          907 ns        915 ns          921 ns   
bm<int8_t, 8021, Op::Max>                  833 ns          837 ns        885 ns          889 ns   
bm<int8_t, 8021, Op::Both>                1091 ns         1099 ns       1148 ns         1147 ns   
bm<int8_t, 8021, Op::Min_val>             6150 ns         6278 ns        398 ns          394 ns   
bm<int8_t, 8021, Op::Max_val>             6979 ns         6975 ns        414 ns          419 ns   
bm<int8_t, 8021, Op::Both_val>           26893 ns        26681 ns      23398 ns        23542 ns   
bm<int16_t, 8021, Op::Min>                1626 ns         1639 ns       1764 ns         1765 ns   
bm<int16_t, 8021, Op::Max>                1701 ns         1660 ns       1729 ns         1709 ns   
bm<int16_t, 8021, Op::Both>               1690 ns         1709 ns       1720 ns         1744 ns   
bm<int16_t, 8021, Op::Min_val>            9462 ns         9277 ns        858 ns          865 ns   
bm<int16_t, 8021, Op::Max_val>            9000 ns         8998 ns        856 ns          858 ns   
bm<int16_t, 8021, Op::Both_val>          19110 ns        19043 ns      18287 ns        18415 ns   
bm<int32_t, 8021, Op::Min>                2164 ns         2176 ns       3264 ns         3191 ns   
bm<int32_t, 8021, Op::Max>                3370 ns         3418 ns       3489 ns         3488 ns   
bm<int32_t, 8021, Op::Both>               4296 ns         4238 ns       4274 ns         4332 ns   
bm<int32_t, 8021, Op::Min_val>            2115 ns         2086 ns       1619 ns         1639 ns   
bm<int32_t, 8021, Op::Max_val>            3439 ns         3376 ns        928 ns          924 ns   
bm<int32_t, 8021, Op::Both_val>           4079 ns         4049 ns       1025 ns         1025 ns   
bm<int64_t, 8021, Op::Min>                6861 ns         6836 ns       6842 ns         6801 ns   
bm<int64_t, 8021, Op::Max>                6898 ns         6801 ns       6989 ns         6975 ns   
bm<int64_t, 8021, Op::Both>               8925 ns         8789 ns       8976 ns         8998 ns   
bm<int64_t, 8021, Op::Min_val>            6714 ns         6696 ns       5947 ns         5859 ns   
bm<int64_t, 8021, Op::Max_val>            6853 ns         6975 ns       5860 ns         5859 ns   
bm<int64_t, 8021, Op::Both_val>           7852 ns         7743 ns       6108 ns         6138 ns   
bm<float, 8021, Op::Min>                  3662 ns         3683 ns       3260 ns         3296 ns   
bm<float, 8021, Op::Max>                  3477 ns         3449 ns       3548 ns         3530 ns   
bm<float, 8021, Op::Both>                 5127 ns         5190 ns       5140 ns         5190 ns   
bm<float, 8021, Op::Min_val>              3375 ns         3376 ns       2873 ns         2916 ns   
bm<float, 8021, Op::Max_val>              3534 ns         3610 ns       2848 ns         2888 ns   
bm<float, 8021, Op::Both_val>             4958 ns         4743 ns       2916 ns         2916 ns   
bm<double, 8021, Op::Min>                 6103 ns         6138 ns       6017 ns         5999 ns   
bm<double, 8021, Op::Max>                 6001 ns         5999 ns       6292 ns         6278 ns   
bm<double, 8021, Op::Both>                7307 ns         7324 ns       7773 ns         7847 ns   
bm<double, 8021, Op::Min_val>             6021 ns         5999 ns       5963 ns         5999 ns   
bm<double, 8021, Op::Max_val>             5984 ns         5999 ns       5768 ns         5781 ns   
bm<double, 8021, Op::Both_val>            7120 ns         7150 ns       6024 ns         5999 ns   

Some rows show order of magnitude improvement.
Some show difference within results variation.

I interpret these results as the change being pure improvement, and the row that don't show it are memory-bound rather than CPU bound, and this can change on a different machine or by reducing input data amount.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner February 10, 2024 18:58
AlexGuteniev and others added 2 commits February 11, 2024 15:29
Co-authored-by: A. Jiang <de34@live.cn>
@StephanTLavavej StephanTLavavej added the performance Must go faster label Feb 11, 2024
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
tests/std/include/test_min_max_element_support.hpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

StephanTLavavej commented Feb 12, 2024

Thanks! I resolved the remaining feedback comments, and decided to simply require SSE 4.2 for the new optimization, consistent with the previous optimization. This is a lot easier to reason about.

  • Don't need _To_address for initializer_list.
  • _STL_ASSERT non-empty inputs, update citations.
  • Cleanup comment punctuation.
  • Avoid risk by always guarding with _Use_sse42().
  • Make room for comments by renaming lambda _First and _Second to _Val1 and _Val2.
    • Regex renamed: \[\]\((__m128[id]?) _First, \1 _Second\) \{ return (\w+)\(_First, _Second\); \}
    • To: []($1 _Val1, $1 _Val2) { return $2(_Val1, _Val2); }
  • Comment all _Minmax_traits_MEOW intrinsics above SSE2.
    • Remove comments from _mm_min_epi16 and _mm_max_epi16 - they're SSE2, not SSE4.1.

⚠️ Note to self:

Need MSVC-internal changes to add <__msvc_minmax.hpp>.

@StephanTLavavej
Copy link
Member

I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@CaseyCarter CaseyCarter removed their assignment Feb 16, 2024
@StephanTLavavej StephanTLavavej merged commit c53ac59 into microsoft:main Feb 16, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for maximizing performance! 😹 🚀 😻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants