NNUE: Use ranges to avoid read/write to memory? #5547

octopus-prime · 2024-08-22T12:34:07Z

octopus-prime
Aug 22, 2024

A very naive question again...

        struct alignas(CacheLineSize) Buffer {
            alignas(CacheLineSize) typename decltype(fc_0)::OutputBuffer fc_0_out;
            alignas(CacheLineSize) typename decltype(ac_sqr_0)::OutputType
              ac_sqr_0_out[ceil_to_multiple<IndexType>(FC_0_OUTPUTS * 2, 32)];
            alignas(CacheLineSize) typename decltype(ac_0)::OutputBuffer ac_0_out;
            alignas(CacheLineSize) typename decltype(fc_1)::OutputBuffer fc_1_out;
            alignas(CacheLineSize) typename decltype(ac_1)::OutputBuffer ac_1_out;
            alignas(CacheLineSize) typename decltype(fc_2)::OutputBuffer fc_2_out;

            Buffer() { std::memset(this, 0, sizeof(*this)); }
        };

There is a buffer for every propagate, right?

            {
                const __m256i words0 =
                  _mm256_srli_epi16(_mm256_packus_epi32(_mm256_load_si256(&in[i * 4 + 0]),
                                                        _mm256_load_si256(&in[i * 4 + 1])),
                                    WeightScaleBits);
                const __m256i words1 =
                  _mm256_srli_epi16(_mm256_packus_epi32(_mm256_load_si256(&in[i * 4 + 2]),
                                                        _mm256_load_si256(&in[i * 4 + 3])),
                                    WeightScaleBits);
                _mm256_store_si256(&out[i], _mm256_permutevar8x32_epi32(
                                              _mm256_packs_epi16(words0, words1), Offsets));
            }

And so there are lots of read and writes to these buffers...

Can we avoid some of these buffers and read/writes?

Let's focus on these buffers.

            alignas(CacheLineSize) typename decltype(ac_sqr_0)::OutputType
              ac_sqr_0_out[ceil_to_multiple<IndexType>(FC_0_OUTPUTS * 2, 32)];
            alignas(CacheLineSize) typename decltype(ac_0)::OutputBuffer ac_0_out;
            alignas(CacheLineSize) typename decltype(ac_1)::OutputBuffer ac_1_out;

Could we try to work with ranges instead of these buffers?

// simd_r = input range of simd

// 4 int32x8_t -> 1 int8x32_t
simd_r auto sqr_clipped_relu(simd_r auto input);

// 4 int32x8_t -> 1 int8x32_t
simd_r auto clipped_relu(simd_r auto input);

// change in eval
// old:
        fc_0.propagate(transformedFeatures, buffer.fc_0_out);
        ac_sqr_0.propagate(buffer.fc_0_out, buffer.ac_sqr_0_out);
        ac_0.propagate(buffer.fc_0_out, buffer.ac_0_out);
        std::memcpy(buffer.ac_sqr_0_out + FC_0_OUTPUTS, buffer.ac_0_out,
                    FC_0_OUTPUTS * sizeof(typename decltype(ac_0)::OutputType));
        fc_1.propagate(buffer.ac_sqr_0_out, buffer.fc_1_out);
        ac_1.propagate(buffer.fc_1_out, buffer.ac_1_out);
        fc_2.propagate(buffer.ac_1_out, buffer.fc_2_out);

new:
        fc_0.propagate(transformedFeatures, buffer.fc_0_out);
    simd_r auto x = sqr_clipped_relu(buffer.fc_0_out);
    simd_r auto y = clipped_relu(buffer.fc_0_out);
    simd_r auto z = std::views::concat(x, y);
        fc_1.propagate(z, buffer.fc_1_out);
    simd_r auto a = clipped_relu(buffer.fc_1_out);
        fc_2.propagate(a, buffer.fc_2_out);

Did someone try?

cj5716 · 2024-08-22T13:20:40Z

cj5716
Aug 22, 2024

These changes are very possible, but will effectively have 0 speed difference, given that most of the time spent in eval is fc_0. I do think it is neater to implement fc_0 + sqr_clipped_relu + clipped_relu as something like an fac_0 rather than the form you have here, though.

I also have a patch that saves a store before fc_0, which makes it of decent magnitude. However, I've been unable to debug it (https://github.com/cj5716/Stockfish/tree/ill-just-git-gud/). Perhaps looking into this part of the code would be better?

P.S. feel free to test these speedups on fishtest! I wish you luck for your first contribution!

3 replies

octopus-prime Aug 22, 2024
Author

to get your idea right...

if sqr_clipped_relu and clipped_relu are part of fac_0
can i just add the elements of sqr_clipped_relu with the elements of clipped_relu?
to produce only one range of output elements?!

and later we just multiply greater (summed) inputs with weights - but less often (L2 instead of 2 * L2)

octopus-prime Aug 22, 2024
Author

oh no... int8 is to small to add 2 int8.

cj5716 Aug 22, 2024

what i mean is, its neater to group

    fc_0.propagate(transformedFeatures, buffer.fc_0_out);
    simd_r auto x = sqr_clipped_relu(buffer.fc_0_out);
    simd_r auto y = clipped_relu(buffer.fc_0_out);
    simd_r auto z = std::views::concat(x, y);

as just

    fac_0.propagate(transformedFeatures, buffer.fac_0_out);

MinetaS · 2024-08-22T14:38:16Z

MinetaS
Aug 22, 2024

NNUE code is designed to be flexible to NN architecture changes (different L1/L2/L3 size or more/fewer layers). All layer buffers after FT are likely to be placed within L1 cache so the memory read/write speed is already fast enough. Even if it's slightly faster (with ridiculously short TC), it's another question whether maintainers would accept it or not.

0 replies

Sopel97 · 2024-08-22T15:09:07Z

Sopel97
Aug 22, 2024

there are probably a few stores that could be omitted (I assume loads are omitted where possible with O3), but ultimately the vast majority of the cost is in the first 2 layers where the outputs are too large to fit in registers and they are needed as a whole before evaluation of the next layer can proceed due to the nature of fully connected layers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NNUE: Use ranges to avoid read/write to memory? #5547

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

NNUE: Use ranges to avoid read/write to memory? #5547

octopus-prime Aug 22, 2024

Replies: 3 comments · 3 replies

cj5716 Aug 22, 2024

octopus-prime Aug 22, 2024 Author

octopus-prime Aug 22, 2024 Author

cj5716 Aug 22, 2024

MinetaS Aug 22, 2024

Sopel97 Aug 22, 2024

octopus-prime
Aug 22, 2024

Replies: 3 comments 3 replies

cj5716
Aug 22, 2024

octopus-prime Aug 22, 2024
Author

octopus-prime Aug 22, 2024
Author

MinetaS
Aug 22, 2024

Sopel97
Aug 22, 2024