Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Span<T>.Fill implementation #51365

Merged
merged 5 commits into from
Apr 17, 2021

Conversation

GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Apr 16, 2021

This optimizes Span<T>.Fill via three primary mechanisms:

  • For T = byte, forwards directly to the initblk (memset) implementation
  • For T = <primitive>, uses a SIMD-optimized worker loop if feasible
  • Removes requirement for caller to stack-spill the span argument before calling main worker API

The central SIMD loop doesn't attempt to perform any type of alignment optimization. We can consider adding this in the future if benchmarking shows this to be a worthwhile addition.

I also didn't investigate any other call sites throughout the runtime + libraries to see if they should be migrated from whatever existing code they might have to this Span<T>.Fill implementation. That can come as a future commit to this PR or as a future PR.

Benchmark code
[GenericTypeArguments(typeof(byte))]
[GenericTypeArguments(typeof(char))]
[GenericTypeArguments(typeof(int))]
[GenericTypeArguments(typeof(long))]
[GenericTypeArguments(typeof(float))]
[GenericTypeArguments(typeof(double))]
[GenericTypeArguments(typeof(decimal))]
[GenericTypeArguments(typeof(string))]
public class SpanFillRunner<T>
{
    private T[] _arr;

    private T _value;

    [Params(0, 3, 7, 15, 16, 24, 128, 512)]
    public int Size;

    [GlobalSetup]
    public void Setup()
    {
        _arr = new T[Size];

        _value = (T)((IConvertible)42).ToType(typeof(T), null);
    }

    [Benchmark]
    public void Fill()
    {
        var arr = _arr;
        _ = arr.Length; // prove not null
        arr.AsSpan().Fill(_value);
    }
}

Benchmark results:

byte

(Note: The internal memset routine uses nontemporal stores, which could explain the blazing fast runtime.)

Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.882 ns 0.0093 ns 0.0083 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.382 ns 0.0320 ns 0.0300 ns 0.74 0.01
Fill Job-NKOJQM main 3 3.449 ns 0.0172 ns 0.0144 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 1.865 ns 0.0445 ns 0.0395 ns 0.54 0.01
Fill Job-NKOJQM main 7 3.439 ns 0.0112 ns 0.0100 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 1.854 ns 0.0195 ns 0.0163 ns 0.54 0.01
Fill Job-NKOJQM main 15 3.386 ns 0.0115 ns 0.0102 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 1.857 ns 0.0219 ns 0.0183 ns 0.55 0.01
Fill Job-NKOJQM main 16 3.424 ns 0.0115 ns 0.0096 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 1.845 ns 0.0181 ns 0.0151 ns 0.54 0.01
Fill Job-NKOJQM main 24 3.211 ns 0.0173 ns 0.0153 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 1.918 ns 0.0093 ns 0.0087 ns 0.60 0.00
Fill Job-NKOJQM main 128 5.484 ns 0.0543 ns 0.0454 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 4.321 ns 0.1117 ns 0.1602 ns 0.79 0.03
Fill Job-NKOJQM main 512 7.521 ns 0.0423 ns 0.0395 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 6.451 ns 0.0442 ns 0.0345 ns 0.86 0.01
char
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.880 ns 0.0569 ns 0.0532 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.355 ns 0.0039 ns 0.0036 ns 0.72 0.02
Fill Job-NKOJQM main 3 3.436 ns 0.0137 ns 0.0128 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.511 ns 0.0167 ns 0.0131 ns 0.73 0.00
Fill Job-NKOJQM main 7 4.199 ns 0.0524 ns 0.0490 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 2.300 ns 0.0121 ns 0.0113 ns 0.55 0.01
Fill Job-NKOJQM main 15 5.261 ns 0.0325 ns 0.0304 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 2.591 ns 0.0353 ns 0.0330 ns 0.49 0.01
Fill Job-NKOJQM main 16 5.263 ns 0.0192 ns 0.0170 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.516 ns 0.0150 ns 0.0140 ns 0.48 0.00
Fill Job-NKOJQM main 24 6.809 ns 0.0126 ns 0.0106 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 2.296 ns 0.0105 ns 0.0098 ns 0.34 0.00
Fill Job-NKOJQM main 128 30.955 ns 0.1179 ns 0.1103 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 3.300 ns 0.0942 ns 0.1723 ns 0.10 0.01
Fill Job-NKOJQM main 512 130.535 ns 1.7414 ns 1.5437 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 7.081 ns 0.1504 ns 0.1407 ns 0.05 0.00
int
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.603 ns 0.0630 ns 0.1183 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.720 ns 0.1739 ns 0.5127 ns 0.98 0.23
Fill Job-NKOJQM main 3 2.611 ns 0.0225 ns 0.0211 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.030 ns 0.0258 ns 0.0241 ns 0.78 0.01
Fill Job-NKOJQM main 7 2.638 ns 0.0194 ns 0.0181 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 2.062 ns 0.0059 ns 0.0055 ns 0.78 0.01
Fill Job-NKOJQM main 15 4.415 ns 0.0084 ns 0.0075 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 1.614 ns 0.0045 ns 0.0042 ns 0.37 0.00
Fill Job-NKOJQM main 16 5.104 ns 0.0169 ns 0.0149 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.060 ns 0.0046 ns 0.0041 ns 0.40 0.00
Fill Job-NKOJQM main 24 7.028 ns 0.0173 ns 0.0162 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 1.815 ns 0.0157 ns 0.0147 ns 0.26 0.00
Fill Job-NKOJQM main 128 29.413 ns 0.0861 ns 0.0805 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 5.971 ns 0.0835 ns 0.1609 ns 0.21 0.01
Fill Job-NKOJQM main 512 117.060 ns 0.8512 ns 0.7962 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 17.274 ns 0.0655 ns 0.0512 ns 0.15 0.00
long
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.367 ns 0.0170 ns 0.0159 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.151 ns 0.0086 ns 0.0080 ns 0.84 0.01
Fill Job-NKOJQM main 3 2.940 ns 0.0849 ns 0.0977 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.055 ns 0.0052 ns 0.0044 ns 0.69 0.02
Fill Job-NKOJQM main 7 2.635 ns 0.0131 ns 0.0122 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 1.598 ns 0.0203 ns 0.0189 ns 0.61 0.01
Fill Job-NKOJQM main 15 4.427 ns 0.0123 ns 0.0115 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 1.662 ns 0.0103 ns 0.0097 ns 0.38 0.00
Fill Job-NKOJQM main 16 5.116 ns 0.0095 ns 0.0079 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.526 ns 0.0072 ns 0.0067 ns 0.49 0.00
Fill Job-NKOJQM main 24 6.957 ns 0.0298 ns 0.0279 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 2.974 ns 0.0103 ns 0.0096 ns 0.43 0.00
Fill Job-NKOJQM main 128 29.398 ns 0.0854 ns 0.0799 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 13.634 ns 0.0477 ns 0.0423 ns 0.46 0.00
Fill Job-NKOJQM main 512 116.306 ns 0.1551 ns 0.1295 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 60.719 ns 0.1973 ns 0.1845 ns 0.52 0.00
float
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.367 ns 0.0189 ns 0.0177 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.141 ns 0.0078 ns 0.0073 ns 0.83 0.01
Fill Job-NKOJQM main 3 3.011 ns 0.0395 ns 0.0369 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.055 ns 0.0073 ns 0.0069 ns 0.68 0.01
Fill Job-NKOJQM main 7 2.742 ns 0.0432 ns 0.0404 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 2.054 ns 0.0233 ns 0.0218 ns 0.75 0.01
Fill Job-NKOJQM main 15 4.565 ns 0.1167 ns 0.1949 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 1.590 ns 0.0297 ns 0.0264 ns 0.34 0.02
Fill Job-NKOJQM main 16 4.693 ns 0.0448 ns 0.0419 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.078 ns 0.0031 ns 0.0027 ns 0.44 0.00
Fill Job-NKOJQM main 24 6.848 ns 0.0843 ns 0.0789 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 1.815 ns 0.0172 ns 0.0161 ns 0.27 0.00
Fill Job-NKOJQM main 128 29.045 ns 0.0866 ns 0.0810 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 6.326 ns 0.0459 ns 0.0429 ns 0.22 0.00
Fill Job-NKOJQM main 512 115.432 ns 0.4247 ns 0.3973 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 32.200 ns 0.1476 ns 0.1381 ns 0.28 0.00
double
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.358 ns 0.0233 ns 0.0207 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.131 ns 0.0163 ns 0.0153 ns 0.83 0.01
Fill Job-NKOJQM main 3 2.956 ns 0.0491 ns 0.0459 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.024 ns 0.0366 ns 0.0342 ns 0.68 0.02
Fill Job-NKOJQM main 7 2.931 ns 0.0346 ns 0.0323 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 1.625 ns 0.0335 ns 0.0314 ns 0.55 0.01
Fill Job-NKOJQM main 15 4.455 ns 0.0533 ns 0.0499 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 1.678 ns 0.0621 ns 0.0829 ns 0.36 0.01
Fill Job-NKOJQM main 16 5.098 ns 0.0513 ns 0.0480 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.508 ns 0.0802 ns 0.1150 ns 0.51 0.02
Fill Job-NKOJQM main 24 6.853 ns 0.0835 ns 0.0781 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 2.943 ns 0.0378 ns 0.0353 ns 0.43 0.00
Fill Job-NKOJQM main 128 28.694 ns 0.2794 ns 0.2613 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 13.549 ns 0.0357 ns 0.0298 ns 0.47 0.00
Fill Job-NKOJQM main 512 115.105 ns 1.1767 ns 1.1007 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 60.911 ns 0.7672 ns 0.7177 ns 0.53 0.01
decimal

(This exercises non-SIMD code paths.)

Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.387 ns 0.0146 ns 0.0136 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.350 ns 0.0227 ns 0.0212 ns 0.97 0.02
Fill Job-NKOJQM main 3 3.848 ns 0.0300 ns 0.0281 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.878 ns 0.0370 ns 0.0346 ns 0.75 0.01
Fill Job-NKOJQM main 7 5.386 ns 0.0591 ns 0.0553 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 4.827 ns 0.0740 ns 0.0692 ns 0.90 0.01
Fill Job-NKOJQM main 15 10.885 ns 0.0963 ns 0.0804 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 10.343 ns 0.0469 ns 0.0391 ns 0.95 0.01
Fill Job-NKOJQM main 16 11.545 ns 0.0925 ns 0.0866 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 10.919 ns 0.0234 ns 0.0207 ns 0.95 0.01
Fill Job-NKOJQM main 24 16.824 ns 0.0822 ns 0.0769 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 16.434 ns 0.0904 ns 0.0846 ns 0.98 0.01
Fill Job-NKOJQM main 128 87.277 ns 0.2652 ns 0.2480 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 86.800 ns 0.1626 ns 0.1358 ns 0.99 0.00
Fill Job-NKOJQM main 512 347.978 ns 0.8310 ns 0.7773 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 349.540 ns 1.1889 ns 1.1121 ns 1.00 0.00
string

(This exercises non-SIMD code paths.)

Method Job Toolchain Size Mean Error StdDev Ratio
Fill Job-NKOJQM main 0 9.361 ns 0.0327 ns 0.0306 ns 1.00
Fill Job-DYLJDS spanfill 0 11.414 ns 0.0994 ns 0.0930 ns 1.22
Fill Job-NKOJQM main 3 22.621 ns 0.1303 ns 0.1088 ns 1.00
Fill Job-DYLJDS spanfill 3 15.888 ns 0.0810 ns 0.0758 ns 0.70
Fill Job-NKOJQM main 7 37.180 ns 0.4089 ns 0.3825 ns 1.00
Fill Job-DYLJDS spanfill 7 21.691 ns 0.3115 ns 0.2762 ns 0.58
Fill Job-NKOJQM main 15 62.400 ns 0.4737 ns 0.4431 ns 1.00
Fill Job-DYLJDS spanfill 15 32.190 ns 0.0847 ns 0.0792 ns 0.52
Fill Job-NKOJQM main 16 66.787 ns 0.2268 ns 0.2121 ns 1.00
Fill Job-DYLJDS spanfill 16 34.926 ns 0.2742 ns 0.2565 ns 0.52
Fill Job-NKOJQM main 24 97.488 ns 0.2009 ns 0.1781 ns 1.00
Fill Job-DYLJDS spanfill 24 46.627 ns 0.1166 ns 0.1034 ns 0.48
Fill Job-NKOJQM main 128 448.163 ns 0.8712 ns 0.8149 ns 1.00
Fill Job-DYLJDS spanfill 128 200.046 ns 0.5331 ns 0.4452 ns 0.45
Fill Job-NKOJQM main 512 1,744.556 ns 3.5163 ns 3.2891 ns 1.00
Fill Job-DYLJDS spanfill 512 748.678 ns 1.5795 ns 1.4002 ns 0.43

Resolves #24806.
Resolves #7049.

/cc @carlossanlop @jozkee

@GrabYourPitchforks GrabYourPitchforks added enhancement Product code improvement that does NOT require public API changes/additions area-System.Memory tenet-performance Performance related issue labels Apr 16, 2021
@GrabYourPitchforks GrabYourPitchforks added this to the 6.0.0 milestone Apr 16, 2021
@ghost
Copy link

ghost commented Apr 16, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

This optimizes Span<T>.Fill via three primary mechanisms:

  • For T = byte, forwards directly to the initblk (memset) implementation
  • For T = <primitive>, uses a SIMD-optimized worker loop if feasible
  • Removes requirement for caller to stack-spill the span argument before calling main worker API

The central SIMD loop doesn't attempt to perform any type of alignment optimization. We can consider adding this in the future if benchmarking shows this to be a worthwhile addition.

I also didn't investigate any other call sites throughout the runtime + libraries to see if they should be migrated from whatever existing code they might have to this Span<T>.Fill implementation. That can come as a future commit to this PR or as a future PR.

Benchmark results:

byte

(Note: The internal memset routine uses nontemporal stores, which could explain the blazing fast runtime.)

Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.882 ns 0.0093 ns 0.0083 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.382 ns 0.0320 ns 0.0300 ns 0.74 0.01
Fill Job-NKOJQM main 3 3.449 ns 0.0172 ns 0.0144 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 1.865 ns 0.0445 ns 0.0395 ns 0.54 0.01
Fill Job-NKOJQM main 7 3.439 ns 0.0112 ns 0.0100 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 1.854 ns 0.0195 ns 0.0163 ns 0.54 0.01
Fill Job-NKOJQM main 15 3.386 ns 0.0115 ns 0.0102 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 1.857 ns 0.0219 ns 0.0183 ns 0.55 0.01
Fill Job-NKOJQM main 16 3.424 ns 0.0115 ns 0.0096 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 1.845 ns 0.0181 ns 0.0151 ns 0.54 0.01
Fill Job-NKOJQM main 24 3.211 ns 0.0173 ns 0.0153 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 1.918 ns 0.0093 ns 0.0087 ns 0.60 0.00
Fill Job-NKOJQM main 128 5.484 ns 0.0543 ns 0.0454 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 4.321 ns 0.1117 ns 0.1602 ns 0.79 0.03
Fill Job-NKOJQM main 512 7.521 ns 0.0423 ns 0.0395 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 6.451 ns 0.0442 ns 0.0345 ns 0.86 0.01
char
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.880 ns 0.0569 ns 0.0532 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.355 ns 0.0039 ns 0.0036 ns 0.72 0.02
Fill Job-NKOJQM main 3 3.436 ns 0.0137 ns 0.0128 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.511 ns 0.0167 ns 0.0131 ns 0.73 0.00
Fill Job-NKOJQM main 7 4.199 ns 0.0524 ns 0.0490 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 2.300 ns 0.0121 ns 0.0113 ns 0.55 0.01
Fill Job-NKOJQM main 15 5.261 ns 0.0325 ns 0.0304 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 2.591 ns 0.0353 ns 0.0330 ns 0.49 0.01
Fill Job-NKOJQM main 16 5.263 ns 0.0192 ns 0.0170 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.516 ns 0.0150 ns 0.0140 ns 0.48 0.00
Fill Job-NKOJQM main 24 6.809 ns 0.0126 ns 0.0106 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 2.296 ns 0.0105 ns 0.0098 ns 0.34 0.00
Fill Job-NKOJQM main 128 30.955 ns 0.1179 ns 0.1103 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 3.300 ns 0.0942 ns 0.1723 ns 0.10 0.01
Fill Job-NKOJQM main 512 130.535 ns 1.7414 ns 1.5437 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 7.081 ns 0.1504 ns 0.1407 ns 0.05 0.00
int
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.880 ns 0.0569 ns 0.0532 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.355 ns 0.0039 ns 0.0036 ns 0.72 0.02
Fill Job-NKOJQM main 3 3.436 ns 0.0137 ns 0.0128 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.511 ns 0.0167 ns 0.0131 ns 0.73 0.00
Fill Job-NKOJQM main 7 4.199 ns 0.0524 ns 0.0490 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 2.300 ns 0.0121 ns 0.0113 ns 0.55 0.01
Fill Job-NKOJQM main 15 5.261 ns 0.0325 ns 0.0304 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 2.591 ns 0.0353 ns 0.0330 ns 0.49 0.01
Fill Job-NKOJQM main 16 5.263 ns 0.0192 ns 0.0170 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.516 ns 0.0150 ns 0.0140 ns 0.48 0.00
Fill Job-NKOJQM main 24 6.809 ns 0.0126 ns 0.0106 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 2.296 ns 0.0105 ns 0.0098 ns 0.34 0.00
Fill Job-NKOJQM main 128 30.955 ns 0.1179 ns 0.1103 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 3.300 ns 0.0942 ns 0.1723 ns 0.10 0.01
Fill Job-NKOJQM main 512 130.535 ns 1.7414 ns 1.5437 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 7.081 ns 0.1504 ns 0.1407 ns 0.05 0.00
long
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.880 ns 0.0569 ns 0.0532 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.355 ns 0.0039 ns 0.0036 ns 0.72 0.02
Fill Job-NKOJQM main 3 3.436 ns 0.0137 ns 0.0128 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.511 ns 0.0167 ns 0.0131 ns 0.73 0.00
Fill Job-NKOJQM main 7 4.199 ns 0.0524 ns 0.0490 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 2.300 ns 0.0121 ns 0.0113 ns 0.55 0.01
Fill Job-NKOJQM main 15 5.261 ns 0.0325 ns 0.0304 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 2.591 ns 0.0353 ns 0.0330 ns 0.49 0.01
Fill Job-NKOJQM main 16 5.263 ns 0.0192 ns 0.0170 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.516 ns 0.0150 ns 0.0140 ns 0.48 0.00
Fill Job-NKOJQM main 24 6.809 ns 0.0126 ns 0.0106 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 2.296 ns 0.0105 ns 0.0098 ns 0.34 0.00
Fill Job-NKOJQM main 128 30.955 ns 0.1179 ns 0.1103 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 3.300 ns 0.0942 ns 0.1723 ns 0.10 0.01
Fill Job-NKOJQM main 512 130.535 ns 1.7414 ns 1.5437 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 7.081 ns 0.1504 ns 0.1407 ns 0.05 0.00
float
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.367 ns 0.0189 ns 0.0177 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.141 ns 0.0078 ns 0.0073 ns 0.83 0.01
Fill Job-NKOJQM main 3 3.011 ns 0.0395 ns 0.0369 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.055 ns 0.0073 ns 0.0069 ns 0.68 0.01
Fill Job-NKOJQM main 7 2.742 ns 0.0432 ns 0.0404 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 2.054 ns 0.0233 ns 0.0218 ns 0.75 0.01
Fill Job-NKOJQM main 15 4.565 ns 0.1167 ns 0.1949 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 1.590 ns 0.0297 ns 0.0264 ns 0.34 0.02
Fill Job-NKOJQM main 16 4.693 ns 0.0448 ns 0.0419 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.078 ns 0.0031 ns 0.0027 ns 0.44 0.00
Fill Job-NKOJQM main 24 6.848 ns 0.0843 ns 0.0789 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 1.815 ns 0.0172 ns 0.0161 ns 0.27 0.00
Fill Job-NKOJQM main 128 29.045 ns 0.0866 ns 0.0810 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 6.326 ns 0.0459 ns 0.0429 ns 0.22 0.00
Fill Job-NKOJQM main 512 115.432 ns 0.4247 ns 0.3973 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 32.200 ns 0.1476 ns 0.1381 ns 0.28 0.00
double
Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.358 ns 0.0233 ns 0.0207 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.131 ns 0.0163 ns 0.0153 ns 0.83 0.01
Fill Job-NKOJQM main 3 2.956 ns 0.0491 ns 0.0459 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.024 ns 0.0366 ns 0.0342 ns 0.68 0.02
Fill Job-NKOJQM main 7 2.931 ns 0.0346 ns 0.0323 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 1.625 ns 0.0335 ns 0.0314 ns 0.55 0.01
Fill Job-NKOJQM main 15 4.455 ns 0.0533 ns 0.0499 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 1.678 ns 0.0621 ns 0.0829 ns 0.36 0.01
Fill Job-NKOJQM main 16 5.098 ns 0.0513 ns 0.0480 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 2.508 ns 0.0802 ns 0.1150 ns 0.51 0.02
Fill Job-NKOJQM main 24 6.853 ns 0.0835 ns 0.0781 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 2.943 ns 0.0378 ns 0.0353 ns 0.43 0.00
Fill Job-NKOJQM main 128 28.694 ns 0.2794 ns 0.2613 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 13.549 ns 0.0357 ns 0.0298 ns 0.47 0.00
Fill Job-NKOJQM main 512 115.105 ns 1.1767 ns 1.1007 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 60.911 ns 0.7672 ns 0.7177 ns 0.53 0.01
decimal

(This exercises non-SIMD code paths.)

Method Job Toolchain Size Mean Error StdDev Ratio RatioSD
Fill Job-NKOJQM main 0 1.387 ns 0.0146 ns 0.0136 ns 1.00 0.00
Fill Job-DYLJDS spanfill 0 1.350 ns 0.0227 ns 0.0212 ns 0.97 0.02
Fill Job-NKOJQM main 3 3.848 ns 0.0300 ns 0.0281 ns 1.00 0.00
Fill Job-DYLJDS spanfill 3 2.878 ns 0.0370 ns 0.0346 ns 0.75 0.01
Fill Job-NKOJQM main 7 5.386 ns 0.0591 ns 0.0553 ns 1.00 0.00
Fill Job-DYLJDS spanfill 7 4.827 ns 0.0740 ns 0.0692 ns 0.90 0.01
Fill Job-NKOJQM main 15 10.885 ns 0.0963 ns 0.0804 ns 1.00 0.00
Fill Job-DYLJDS spanfill 15 10.343 ns 0.0469 ns 0.0391 ns 0.95 0.01
Fill Job-NKOJQM main 16 11.545 ns 0.0925 ns 0.0866 ns 1.00 0.00
Fill Job-DYLJDS spanfill 16 10.919 ns 0.0234 ns 0.0207 ns 0.95 0.01
Fill Job-NKOJQM main 24 16.824 ns 0.0822 ns 0.0769 ns 1.00 0.00
Fill Job-DYLJDS spanfill 24 16.434 ns 0.0904 ns 0.0846 ns 0.98 0.01
Fill Job-NKOJQM main 128 87.277 ns 0.2652 ns 0.2480 ns 1.00 0.00
Fill Job-DYLJDS spanfill 128 86.800 ns 0.1626 ns 0.1358 ns 0.99 0.00
Fill Job-NKOJQM main 512 347.978 ns 0.8310 ns 0.7773 ns 1.00 0.00
Fill Job-DYLJDS spanfill 512 349.540 ns 1.1889 ns 1.1121 ns 1.00 0.00
string

(This exercises non-SIMD code paths.)

Method Job Toolchain Size Mean Error StdDev Ratio
Fill Job-NKOJQM main 0 9.361 ns 0.0327 ns 0.0306 ns 1.00
Fill Job-DYLJDS spanfill 0 11.414 ns 0.0994 ns 0.0930 ns 1.22
Fill Job-NKOJQM main 3 22.621 ns 0.1303 ns 0.1088 ns 1.00
Fill Job-DYLJDS spanfill 3 15.888 ns 0.0810 ns 0.0758 ns 0.70
Fill Job-NKOJQM main 7 37.180 ns 0.4089 ns 0.3825 ns 1.00
Fill Job-DYLJDS spanfill 7 21.691 ns 0.3115 ns 0.2762 ns 0.58
Fill Job-NKOJQM main 15 62.400 ns 0.4737 ns 0.4431 ns 1.00
Fill Job-DYLJDS spanfill 15 32.190 ns 0.0847 ns 0.0792 ns 0.52
Fill Job-NKOJQM main 16 66.787 ns 0.2268 ns 0.2121 ns 1.00
Fill Job-DYLJDS spanfill 16 34.926 ns 0.2742 ns 0.2565 ns 0.52
Fill Job-NKOJQM main 24 97.488 ns 0.2009 ns 0.1781 ns 1.00
Fill Job-DYLJDS spanfill 24 46.627 ns 0.1166 ns 0.1034 ns 0.48
Fill Job-NKOJQM main 128 448.163 ns 0.8712 ns 0.8149 ns 1.00
Fill Job-DYLJDS spanfill 128 200.046 ns 0.5331 ns 0.4452 ns 0.45
Fill Job-NKOJQM main 512 1,744.556 ns 3.5163 ns 3.2891 ns 1.00
Fill Job-DYLJDS spanfill 512 748.678 ns 1.5795 ns 1.4002 ns 0.43

Resolves #24806.
Resolves #7049.

/cc @carlossanlop @jozkee

Author: GrabYourPitchforks
Assignees: -
Labels:

area-System.Memory, enhancement, tenet-performance

Milestone: 6.0.0

@GrabYourPitchforks
Copy link
Member Author

System.Linq.Expressions.Tests failures are known issue #51346.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@SingleAccretion
Copy link
Contributor

A note on the memset flavor used by the runtime: it is quite sensitive to alignment, at least for x64. There is a comment about it in runtime source.

@GrabYourPitchforks
Copy link
Member Author

@SingleAccretion good observation. I stepped through the memset implementation (used by initblk) and noticed that it performed alignment fixup before entering the main loop. So I wonder if this comment is stale?

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Apr 16, 2021

So I wonder if this comment is stale?

Very possible. My comment was the result of me measuring this a few months ago, and seeing the expected 2x regression (Ivy Bridge). If it is no longer accurate, that's great!

@GrabYourPitchforks
Copy link
Member Author

Notable changes in latest commit:

  • The code now allows any non-ref-containing type that's a power of 2 and fits into a vector, including Half, Guid, decimal, etc.
  • Cleaned up some of the write logic per feedback from @gfoidl
  • Cleaned up unit test logic to avoid reflection
  • Dropped the "zero elements?" early-exit check at the beginning of the method, as we generally shouldn't optimize for uncommon scenarios, and it'll eventually skip all the branches and hit method exit soon enough anyway
  • Changed the SpanHelpers routine to take nuint instead of uint since it's a cheap enough change and it makes this code path more resilient to any future NativeSpan additions

No major changes to the overall design behind the logic.

@GrabYourPitchforks
Copy link
Member Author

Notable change in latest commit:

Mono interpreter's implementation of initblk performs a null check on the incoming address before delegating to memset. Restoring the "don't call initblk for empty spans" logic seemed like a more surgical solution than trying to change the interpreter's initblk logic, which might have unintentional ripple effects. I'm only restoring this logic for mono, as coreclr's initblk logic handles the zero-length case just fine, and a zero-length input should be very rare.

@GrabYourPitchforks GrabYourPitchforks merged commit fbd3b98 into dotnet:main Apr 17, 2021
@GrabYourPitchforks GrabYourPitchforks deleted the span_fill branch April 17, 2021 08:29
@GrabYourPitchforks
Copy link
Member Author

Thanks all for the feedback! If there are any additional comments we can send a cleanup PR.

Unsafe.InitBlockUnaligned(ref Unsafe.As<T, byte>(ref _pointer.Value), Unsafe.As<T, byte>(ref value), (uint)_length);
#if MONO
// Mono runtime's implementation of initblk performs a null check on the address.
// We'll perform a length check here to avoid passing a null address in the empty span case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a compatibility thing? Should we change it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment at #51365 (comment). I think I know where to make the change, but I was concerned that it would have broader implications and that I'd risk inadvertently breaking some other component. Putting the check here seemed the safest option. (This code had a check originally, so it's not a behavioral change from before this PR.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File an issue? Making the runtimes behave the same would be best if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already filed - #51411 😄.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the same bug?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
7 participants