Should unrolled solutions not be considered base? #693
Replies: 2 comments 7 replies
-
It's a grey area with compilers: the only way to really guarantee that compilers don't optimise code is to write it in assembly. The way I understand the rules is that we're not allowed to set multiple bits in the source code: i.e. the algorithm must be single-bit as written in the language concerned. I don't think the rules extend as far as ensuring that the optimised machine code clears only a single bit at a time in all cases. This would be quite difficult to ensure (and verify) across languages, processors, etc. |
Beta Was this translation helpful? Give feedback.
-
As Mike says, @rbergen has made it clear that as long as the source code doesn't combine the individual bit marking into a common mask (there have been several solutions rejected as not being None of my solutions, even the top Nim one, have compilers that are smart enough to combine all of the individual I have left writing C# solutions up to you, but if you say that its JIT can combine the pseudo random modulo marking patterns into single masks, that would be amazing! It is not surprising that compilers can combine the setting of a variable and the while (ptrStart <= ptrEnd - factor)
{
ptrStart[0] |= 0x1;
var (o0, m0) = lut[0];
ptrStart[o0] |= m0;
var (o1, m1) = lut[1];
ptrStart[o1] |= m1;
var (o2, m2) = lut[2];
ptrStart[o2] |= m2;
var (o3, m3) = lut[3];
ptrStart[o3] |= m3;
var (o4, m4) = lut[4];
ptrStart[o4] |= m4;
var (o5, m5) = lut[5];
ptrStart[o5] |= m5;
var (o6, m6) = lut[6];
ptrStart[o6] |= m6;
var (o7, m7) = lut[7];
ptrStart[o7] |= m7;
ptrStart += factor;
} I would be very surprised that the C# compiler can optimize that down to just over eight CPU clock cycles per loop as my manually generated/templated/macro generated solutions would do. As I say, if the C# compiler can do this while JIT compiling, that is completely amazing. This is not really the "dense" algorithm that my fastest solutions use: using a macro or code generation, my solutions pull in the word (preferably 64 bits at a time) to a variable, then do all the marking by immediate
What performance are you getting? I would be surprised if the current version is faster than your current "stride8-block32K" which is taking about 1.4 CPU clock cycles per marking operation; if the C# compiler can take the above code and optimize it so average marking speed is something like even 1.25 CPU clock cycles per operation, that would be incredible. My macro generated Nim code takes about 0.54 CPU clock cycles per operation, but that is a considerably more tuned algorithm as explained above. All calculations based on the Intel i7-8750H. |
Beta Was this translation helpful? Give feedback.
-
Does this mean that the code needs to clear the bits individually semantically or mechanically? If it needs to clear the bits individually semantically you just need too have 2 statements that the optimizer easily can convert into a single operation.
I've been implementing a unrolled-hybrid solution with T4 Templates in C# and it's fast C# solution 4. When looking at the generated code you see that its very trivial for the compiler to convert the seperate masks in to a single mask and clear multiple bits with a single operation. I only have a byte pointer but if I would bump it up to a long instead we would probably get even more performance.
@GordonBGood @mike-barber Is there a difference between my unrolled version and yours that makes your version clear the bits individually mechanically? If so how do you guarantee that when there are constants for the factor?
Should we even allow specialized versions of clearing the bits? Clearing the bits in a dense and sparse manner doesnt feel like it is too egregious (obv. I'm a little bit biased here) but having specialized code for a single factor feels definetly more sketchy.
Beta Was this translation helpful? Give feedback.
All reactions