Should unrolled solutions not be considered base? #693

ItalyToast · 2021-09-01T14:05:48Z

ItalyToast
Sep 1, 2021

When clearing non-primes in the sieve (the second operation), the algorithm clears all non-primes individually, increasing the number with 2 * factor on each cycle.

Does this mean that the code needs to clear the bits individually semantically or mechanically? If it needs to clear the bits individually semantically you just need too have 2 statements that the optimizer easily can convert into a single operation.

I've been implementing a unrolled-hybrid solution with T4 Templates in C# and it's fast C# solution 4. When looking at the generated code you see that its very trivial for the compiler to convert the seperate masks in to a single mask and clear multiple bits with a single operation. I only have a byte pointer but if I would bump it up to a long instead we would probably get even more performance.

@GordonBGood @mike-barber Is there a difference between my unrolled version and yours that makes your version clear the bits individually mechanically? If so how do you guarantee that when there are constants for the factor?

Should we even allow specialized versions of clearing the bits? Clearing the bits in a dense and sparse manner doesnt feel like it is too egregious (obv. I'm a little bit biased here) but having specialized code for a single factor feels definetly more sketchy.

mike-barber · 2021-09-01T16:07:37Z

mike-barber
Sep 1, 2021

It's a grey area with compilers: the only way to really guarantee that compilers don't optimise code is to write it in assembly.

The way I understand the rules is that we're not allowed to set multiple bits in the source code: i.e. the algorithm must be single-bit as written in the language concerned.

I don't think the rules extend as far as ensuring that the optimised machine code clears only a single bit at a time in all cases. This would be quite difficult to ensure (and verify) across languages, processors, etc.

1 reply

ItalyToast Sep 2, 2021
Author

Yes, but when we do the extreme unrolling or using code generation we create a different code path for every factor and it shouldnt be to hard for the compiler to optimize into a multibit mask, when I looked at your rust code it looked like it might be able to optimize it since you had the factor as a compile time parameter. The bit masks would be known at compile time and then the compiler should be able to fold it into the loop.

The way I understand the rules is that we're not allowed to set multiple bits in the source code: i.e. the algorithm must be single-bit as written in the language concerned.

My objection is if you can write code that the compiler can trivially optimize to multibit marking assembly should it be allowed? Unrolled.cs

while (startPtr < endPtr - 6){
	startPtr[1] |= 16;
	startPtr[2] |= 2;
	startPtr[2] |= 64;
	startPtr[3] |= 8;
	startPtr[4] |= 1;
	startPtr[4] |= 32;
	startPtr[5] |= 4;
	startPtr[5] |= 128;
	startPtr += 5;
}

Here we can see that the compiler can trivially combine byte 2, 4 and 5. If we used long instead we could combine even more bits and for more factors.

GordonBGood · 2021-09-01T19:29:30Z

GordonBGood
Sep 1, 2021

@ItalyToast:

When clearing non-primes in the sieve (the second operation), the algorithm clears all non-primes individually, increasing the number with 2 * factor on each cycle.

Does this mean that the code needs to clear the bits individually semantically or mechanically? If it needs to clear the bits individually semantically you just need too have 2 statements that the optimizer easily can convert into a single operation.

I've been implementing a unrolled-hybrid solution with T4 Templates in C# and it's fast C# solution 4. When looking at the generated code you see that its very trivial for the compiler to convert the separate masks in to a single mask and clear multiple bits with a single operation.

@GordonBGood @mike-barber Is there a difference between my unrolled version and yours that makes your version clear the bits individually mechanically? If so how do you guarantee that when there are constants for the factor?

Should we even allow specialized versions of clearing the bits? Clearing the bits in a dense and sparse manner doesn't feel like it is too egregious (obv. I'm a little bit biased here) but having specialized code for a single factor feels definetly more sketchy.

As Mike says, @rbergen has made it clear that as long as the source code doesn't combine the individual bit marking into a common mask (there have been several solutions rejected as not being faithful to base for doing exactly that), they are acceptable, as who knows what a "sufficiently smart compiler" is doing under the covers.

None of my solutions, even the top Nim one, have compilers that are smart enough to combine all of the individual or operations into a common mask as can be determined both by inspecting the generated assembly code and by the time it takes - using individual register/immediate or instructions takes a third of a CPU clock cycle, and combining all of this into one mask over 64 bits would take about one clock cycle for all the markings in a 64-bit word which might average about ten per word for "dense" base primes up to 63; I'm not seeing an extra factor of about three happening here.

I have left writing C# solutions up to you, but if you say that its JIT can combine the pseudo random modulo marking patterns into single masks, that would be amazing! It is not surprising that compilers can combine the setting of a variable and the oring of another variable with an immediate result of that variable obtained from a constant Look Up Table (LUT) as you use in your current linked implementation; however, in the following code snippet from that solution:

            while (ptrStart <= ptrEnd - factor)
            {
                ptrStart[0] |= 0x1;
                var (o0, m0) = lut[0];
                ptrStart[o0] |= m0;
                var (o1, m1) = lut[1];
                ptrStart[o1] |= m1;
                var (o2, m2) = lut[2];
                ptrStart[o2] |= m2;
                var (o3, m3) = lut[3];
                ptrStart[o3] |= m3;
                var (o4, m4) = lut[4];
                ptrStart[o4] |= m4;
                var (o5, m5) = lut[5];
                ptrStart[o5] |= m5;
                var (o6, m6) = lut[6];
                ptrStart[o6] |= m6;
                var (o7, m7) = lut[7];
                ptrStart[o7] |= m7;
                ptrStart += factor;
            }

I would be very surprised that the C# compiler can optimize that down to just over eight CPU clock cycles per loop as my manually generated/templated/macro generated solutions would do. As I say, if the C# compiler can do this while JIT compiling, that is completely amazing.

This is not really the "dense" algorithm that my fastest solutions use: using a macro or code generation, my solutions pull in the word (preferably 64 bits at a time) to a variable, then do all the marking by immediate or's with pre-calculated immediate mask values to that variable, then commit that variable back to the original location. the above code is more like my "Extreme Loop Unrolling" technique. If the compiler can optimize away the "pulling in" of the immediate values from the lut as above, that would be surprising, also meaning that it recognized that using constant index values within the range of lut eliminated the need for a lut bounds check, also inlining the tuple deconstruction for the tuple obtained from lut. When last I worked on C#, the compiler was no where near that smart.

I only have a byte pointer but if I would bump it up to a long instead we would probably get even more performance.

What performance are you getting? I would be surprised if the current version is faster than your current "stride8-block32K" which is taking about 1.4 CPU clock cycles per marking operation; if the C# compiler can take the above code and optimize it so average marking speed is something like even 1.25 CPU clock cycles per operation, that would be incredible. My macro generated Nim code takes about 0.54 CPU clock cycles per operation, but that is a considerably more tuned algorithm as explained above. All calculations based on the Intel i7-8750H.

6 replies

GordonBGood Sep 2, 2021

@ItalyToast:

Isn't a sub 1 clock cycle marking only possible if you have code that does multi-bit marking?

No, do the math: On my CPU for very dense factors such as three, there are 21 markings in a 64-bit word, so reading the word takes a half a CPU cycle, writing it back out to the original location at the end of the word takes one CPU cycle, and each of the 21 immediate or's to a register takes a third of a second for a total of seven cycles for all markings and just over eight and a half cycles for 21 bits marked including a minor amount in loop overhead, which is an average of just over a third of a cycle each; for the extreme long-stride end where there is only about one bit marked per word, it still takes only about one CPU cycle because that can be done just one read/modify/write instruction (I check for this manually in my macro/ generated code). Now it isn't a straight linear average as primes are denser at the small end, so it's perfectly understandable why I am able to get an average of just over a half a cycle per operation. Newer CPU's have even more instructions per clock cycle than this, but it probably doesn't much matter for this benchmark as the controlling bottleneck gets to be the compulsory "cache thrashing" which can't be worked around as we are not allowed page-segmentation. Looking at the generated assembly code confirms this for the GCC back end generated code with one or machine code instruction per bit marked. LLVM generated code is slower because it tries to optimize combining the masks, and like many typical LLVM optimizations, it goes overboard and the resulting code is actually slower, which is why languages such as Rust, Julia, V, Crystal, etc. with LLVM back ends are slower even when using the same algorithm as my Nim solution.

I guess my real question is: If we are allowed solutions that do multi-bit marking but must pretend that its not, why not just allow multi-bit marking?

My point is that it isn't multi-bit marking as we have done all the work to mark each bit separately which is more than the amount of work to just generate a mask using a Look Up Table (I wouldn't be able to reach these kinds of speeds with a LUT using single bit marking as the overhead of the array accesses would eat up most of or more than the saving, as you are seeing in your latest C# version).

I actually think that you may be able to use my techniques in C# using external code generation to generate the masking constants (so no LUT is needed), but may not see much in the way of gains as the JIT delays for the huge generated amount of code of a few thousand lines may may it not practical, even if the compiler ends up producing something close to the ideal. In the long ago I tried such a technique in C# and found it not usable, but the C# compiler and specification have changed a lot since then. Back then, pointers were almost unusable performance-wise and were only supported for marshaling with C-type languages, whereas in your solutions they turn out to be quite performant.

As demonstrated in the LLVM attempts to optimize this, it isn't as easy as you think to combine the bits into masks as it has a pseudo random modulo pattern that varies with every successive word for the number of words in the factor value. Manual code generation or my macros which do the same thing more automatically seem to me to be the way to go.

rbergen Sep 2, 2021
Maintainer

I guess my real question is: If we are allowed solutions that do multi-bit marking but must pretend that its not, why not just allow multi-bit marking?

As has already been pointed out by others in this thread, it is not allowed to write source code in a submission language that "knowingly" applies multi-bit marking in the algorithm. If a language's translation of single-bit marking source code, with whatever optimization flags that are passed to compiler and/or runtime, is such that multi-bit marking is in fact applied at machine code level, then I would actually applaud the cleverness of the language('s tools).

The genesis of this project was a drag race between three programming languages to compare their relative speed. That comparison is executed using the admittedly blunt tool of prime number sieving. Some rules have been set, all at the submission language level, that can be considered arbitrary but are applied as consistently as possible. Contributors have been set the challenge of squeezing the highest performance out of their language(s) of choice. All constructs and tools that solve the problem and stay within the leaderboard boundaries (base algorithm, faithful, 1 bit per prime candidate) are allowed in leaderboard solutions, all the others are not.

ItalyToast Sep 3, 2021
Author

I'm not too familiar with the CPUs instruction timings, I guess they got a lot of tricks up their sleeve. 😀

I actually think that you may be able to use my techniques in C# using external code generation to generate the masking constants

I havent quite fully understood how your masking constants work yet but I'll have to dive deeper into the code.

... but may not see much in the way of gains as the JIT delays for the huge generated amount of code of a few thousand lines may may it not practical...

That should not be an issue since JITed languages are currently allowed 5 sec of warmup time.

As demonstrated in the LLVM attempts to optimize this, it isn't as easy as you think to combine the bits into masks as it has a pseudo random modulo pattern that varies with every successive word for the number of words in the factor value. Manual code generation or my macros which do the same thing more automatically seem to me to be the way to go.

The pattern will repeat after factor * registersize so seems pretty trivial to me but we'll see I guess, I have to implement my long version first to see if it improves performance😄.

@rbergen Thanks for your input, It didnt really change my mind but I guess I'll have to implement the code.

mayerrobert Sep 5, 2021

I guess my real question is: If we are allowed solutions that does multibit marking but it must pretend that its not, why not just allow multibit marking?

I guess one answer could be: to find out how easy various programming languages make it to write code that's easily optimizable, and how good compilers are at combining or instructions.

Or maybe because it's a (somewhat artificial) challenge to write code within a given set o rules that compilers will be able to optimize, and some compilers are better at optimizing than others.

I'm currently working on a Lisp solution where Lisp macros generate dense bit setting loops for low prime values (i.e. I'm stealing @GordonBGood 's idea), and much to my surprise sbcl's backend doesn't combine successive or instructions with the same register target with immediate operands (yet!). So compilers (and therefore programming languages) do differ and can be compared.

GordonBGood Sep 5, 2021

@mayerrobert:

much to my surprise sbcl's backend doesn't combine successive or instructions with the same register target with immediate operands (yet!). So compilers (and therefore programming languages) do differ and can be compared.

I haven't found any compilers that are smart enough to combine these effectively given that the modulo pattern over words appears to be pseudo random (even though we as humans can recognize that it is just a roll-shifted pattern in this case).

The LLVM back end that many compilers use does try when full optimization is enabled, and as usual for more complex cases, it makes a mash of it that is slower than if the individual register instructions were used directly. It's the thing I dislike most about LLVM: many optimizations have been written that show a slight average improvement in performance but ignoring that nine times out of ten there is a decrease in performance and only the "toy" benchmark that is specifically written to show off this optimization makes up for the other losses such that the average is positive; also, it is hard to know what optimizations to turn off and it is difficult to get optimization options through to the back end in some languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should unrolled solutions not be considered base? #693

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Should unrolled solutions not be considered base? #693

ItalyToast Sep 1, 2021

Replies: 2 comments · 7 replies

mike-barber Sep 1, 2021

ItalyToast Sep 2, 2021 Author

GordonBGood Sep 1, 2021

GordonBGood Sep 2, 2021

rbergen Sep 2, 2021 Maintainer

ItalyToast Sep 3, 2021 Author

mayerrobert Sep 5, 2021

GordonBGood Sep 5, 2021

ItalyToast
Sep 1, 2021

Replies: 2 comments 7 replies

mike-barber
Sep 1, 2021

ItalyToast Sep 2, 2021
Author

GordonBGood
Sep 1, 2021

rbergen Sep 2, 2021
Maintainer

ItalyToast Sep 3, 2021
Author