Prime Counting Functions #6

GordonBGood · 2021-10-22T20:35:06Z

GordonBGood
Oct 22, 2021

Since Stack overflow frowns on extended dialogs in comments sections, especially when moving outside the subject of the question, I'll open a discussion port in your actual project here.

Your goal of making a web page to do what the nth prime page does with a server is quite a good one, but I wouldn't use JavaScript for this, but would likely use Fable/F# to generate the JavaScript, so won't contribute to your code, but can give you some advice and pointers.

I hope you realize that the page-segmented optimizations I made on the SO thread also apply here as to count the primes to 10^15, you need to sieve to 10^10, which will take about five seconds with my techniques but minutes conventionally. However, that is not your current bottleneck, which is the huge number of integer divisions due to the Meissel-Lehmer implementation at something like 50 CPU clock cycles each. This doesn't matter much with the optimizations that Lehmer used and for his smallish ranges, but is a huge overhead as the ranges get larger.

In contrast, modern algorithms use a partial sieving technique that requires almost no integer divisions. Also, the Lagarias-Miller-Odlyzko (LMO) algorithm greatly reduces the number of operations so one gets about theoretical empirical orders of growth in execution time: if it takes one second to compute to 10^12, it will take about a hundred seconds to compute to 10^15.

LMO is a little more complex to use than Meissel-Lehmer but not that much more, especially in its first basic form. The most modern techniques are about ten times faster than LMO, but a lot more complex and probably not worth it to you if one can get your time to count the primes to the 53-bit number range down to minutes; once you have LMO working, you can look into these as a way of reducing maximum computation time to a number of seconds.

vitaly-t · 2021-10-23T20:45:04Z

vitaly-t
Oct 23, 2021
Maintainer

@GordonBGood

Hi Gordon! Welcome to my project, I'm very happy to have you here! :)

Your goal of making a web page to do what the nth prime page does with a server

No, my goal is to offer a prime library that's easy to use in NodeJS + Web, using TypeScript. That's why I focus in my code examples more on use of RXJS, because it is the best today for handling sequences inside NodeJS and Web. Also, I want something that's actually efficient and fast, so I ended up adding quite a few functions to it at this point - see the API at the bottom of the main page.

I agree with all you say there about possible further optimizations. This library is still very young, and I keep adding those optimizations as I find them. The last things I added though were functions countPrimes and countPrimesApprox, which think are quite good, at least in their first iteration :)

Also, worth noting, the library just has been renamed into a more appropriate prime-lib.

I am currently looking into a possible implementation for nthPrimeApprox. I believe that Christian Axler's work is the best in this area today.

2 replies

GordonBGood Oct 23, 2021
Author

@vitaly-t:

Hi,

Your goal of making a web page to do what the nth prime page does with a server

No, my goal is to offer a prime library that's easy to use in NodeJS + Web, using TypeScript. That's why I focus in my code examples more on use of RXJS, because it is the best today for handling sequences inside NodeJS and Web. Also, I want something that's actually efficient and fast, so I ended up adding quite a few functions to it at this point - see the API at the bottom of the main page.

Ah, but as I said, if you are using JavaScript and Typescript,, then I won't be doing much in the way of helping with coding as I avoid those, As I said somewhere in the SO question thread, I've switched over to using functional programming paradigms whenever I can, and Fable/F# would be my choice for this; In the SO thread as an Appendix, there is an implementation of the quite optimized page-segmented, maximally wheel factorized SoE that would be quite useful in this project, and the matter that I haven't done the final tweaks to make it more efficient for sieving from 10^ up to its limit of the 63-bit number range aren't necessary if you are going to use LMO as 10^11 is more than a bit enough range for the sieving part.

One would have to do some modificationss or additions to a function or two to allow doing partial sieving as should be done for LMO for the best speed, but that is quite minor. It would seem to me that being able to produce the prime count and the nth prime within this range in a reasonably short time could be useful for your targe audience.

I agree with all you say there about possible further optimizations. This library is still very young, and I keep adding those optimizations as I find them. The last things I added though were functions countPrimes and countPrimesApprox, which think are quite good, at least in their first iteration :)

I am currently looking into a possible implementation for nthPrimeApprox. I believe that Christian Axler's work is the best in this area today.

I can't help you much there, as i deal mostly with methods of determining exact counts and values, no approximations...

GordonBGood Oct 25, 2021
Author

@vitaly-t:

Hi, I have found some time to do a basic review of your code here and observe that it's pretty simplistic as it currently stands, with the available functions just a few lines of code each. If you are serious about making this a fast library as per your stated goals, your current versions just aren't going to cut it.

Ah, but as I said, if you are using JavaScript and Typescript,, then I won't be doing much in the way of helping with coding as I avoid those, As I said somewhere in the SO question thread, I've switched over to using functional programming paradigms whenever I can, and Fable/F# would be my choice for this; In the SO thread as an Appendix, there is an implementation of the quite optimized page-segmented, maximally wheel factorized SoE that would be quite useful in this project, and the matter that I haven't done the final tweaks to make it more efficient for sieving from 10^ up to its limit of the 63-bit number range aren't necessary if you are going to use LMO as 10^11 is more than a bit enough range for the sieving part.

One would have to do some modificationss or additions to a function or two to allow doing partial sieving as should be done for LMO for the best speed, but that is quite minor. It would seem to me that being able to produce the prime count and the nth prime within this range in a reasonably short time could be useful for your targe audience.

I said above that I don't do JavaScript/Typescript, but that's obviously not quite true as you found me as an author of a JavaScript answer on SO; probably what I should have said is that I do write JavaScript code as long as it doesn't take more than about 200 to 300 LOC ; for example, my full optimized version answer on the SO thread where you found me is about 300 LOC. You've used an earlier base version further up that thread, but you haven't used that final version, I suppose because it isn't a direct primes generator. I guess it wouldn't take much for me to take that final answer which has been built to run as a code snippet in SO and strip out the HTML interface things so it becomes optionally just an "infinite" prime generator or a counter or primes to a limit, which wouldn't take very much work on my part as it is my code. I could then submit it as a PR to this library if you would like?

As to your prime counting function, the one that you found really isn't very good and doesn't really reveal the full potential of what these "combinational numeric algorithms" such as LMO can do; I have done some work on this in other languages, and think that I could easily factor the above work to be used as the required prime sieve plus add a translation of that other work on LMO into JavaScript to add the extra functions so that it could be used as a basic LMO prime counting implementation in less additional LOC than my personal JavaScript limit. The basic version is then likely to be able to determine the count of primes to the 53-bit limit in a couple of minutes on a modern desktop, even in JavaScript. I would then leave it up to you to add the tests, documentation, etc. that would bring it up to the standards of your project. I can likely get to that in a week or two if you are interested?

vitaly-t · 2021-10-25T05:31:15Z

vitaly-t
Oct 25, 2021
Maintainer

You've used an earlier base version further up that thread, but you haven't used that final version, I suppose because it isn't a direct primes generator

I did try to convert it into a generator, but got lost in the end.

I could then submit it as a PR to this library if you would like?

Absolutely, thank you!!!

As to your prime counting function, the one that you found really isn't very good and doesn't really reveal the full potential of what these "combinational numeric algorithms" such as LMO can do;

And yet it nailed the competition, with blazing-fast results (see the bottom benchmark) 😃 It was only beaten by some heavily optimized C++ code, which used 128-bit arithmetic (cheater) 😄

The basic version is then likely to be able to determine the count of primes to the 53-bit limit in a couple of minutes on a modern desktop, even in JavaScript. I would then leave it up to you to add the tests, documentation, etc. that would bring it up to the standards of your project. I can likely get to that in a week or two if you are interested?

Yes, certainly! My library includes 100% test coverage, so if you also want to tweak something and PR it - you won't break things, it's all thoroughly tested ;)

1 reply

GordonBGood Nov 1, 2021
Author

@vitaly-t:

You've used an earlier base version further up that thread, but you haven't used that final version, I suppose because it isn't a direct primes generator

I did try to convert it into a generator, but got lost in the end.

I could then submit it as a PR to this library if you would like?

Absolutely, thank you!!!

I have refactored the code for the full 53-bit range capable page-segmented maximally-wheel-factorized version from the SO thread so that it can be run as a node.js app, either as a primes generator or as a prime counting application, and added the following features:

I've made it now encode the storage of the base prime values in one byte each, with the upper two bits being the wheel index delta from the last and the least significant six bits the totative index (0 to 47); this works because the highest prime gap within the required base prime value range is 234 so a three wheel jump of 210 each is more than enough to span any gap. This reduces the memory used for base prime value storage by a factor of four from four bytes per value to only one.
I've made the sieve buffer bit plane sizes automatically adjust as required so that the sieve can be reasonably fast for quite small ranges such as ten million but is still quite efficient as the range is increased towards 1e12, which may be about the highest practical range as it takes almost ten minutes, even just counting primes. Above about 4e12, the efficiency will gradually decrease as the bit plane buffers exceed the CPU L2 cache size (for most common desktop computers), but it is still not too bad up to about 1e14.
I've added the ability to start the count or iteration at any given point below the maximum limit, which then allows one to work with small segments toward to high end of the usable range of up to the 53-bit limit (about 9e15).
I've added the ability to iterate over the primes in the given range, although this adds considerable overhead and is about six times slower than just counting the primes.
I've changed the first number represented in the sieve from 23 to 11 so that prime K-tuples are all within the same wheel factor and thus can be easily filtered for if one wanted to add the function that scanned for these.
The sieve has an overhead in setup for Look Up Tables and Wheel Pattern of about a tenth of a second, so it becomes competitive with the simple Will Ness hash table based sieve at somewhere about one to ten million, and is also competitive with simple naive one-bit-sieving array like my version you are already using at about this same point.
The program can count the primes to a billion in about a half a second, although it takes about three seconds to enumerate over the primes in this range, it counts the primes to 1e10 in about four seconds, and to 1e11 in about 50 seconds. Although it starts to lose efficiency above 1e12 (about ten minutes), I estimate that it can count the primes to 1e14 in well under a day on a reasonably up-to-date desktop computer. This is interesting as until 1985 this count was not known since the best "super computers" of the day couldn't handle that range in any kind of a reasonable time, yet here we can do it in an almost reasonable time in JavaScript on a desktop computer!
The code can be used as a library to write other functions such as finding or counting all the prime K-tuples, or finding the maximum prime gaps, etc. I considered adding a prime summing function that is faster than summing the iteration of primes over a range, but it would only work to a range of about half a billion (5e8) as otherwise the overflows that 53-bit accurate integer range unless one were to add a "BigInt" library or at least write a basic one (only summing and displaying required for this). If you would like the library extended to include summing primes, I could work on it and likely produce it in a few hours to a day.
I will be using this code to do the partial sieving required for a version of a prime counting function using combinational numeric analysis techniques such as LMO, although I will likely need to add a function or two.

Yes, certainly! My library includes 100% test coverage, so if you also want to tweak something and PR it - you won't break things, it's all thoroughly tested ;)

You'll will see that I have included some tests as commented out at the bottom of the code; you may wish to add more and incorporate them into your library tests...

As I'm too lazy to want to fork this repo and do an official PR, the code is attached as Primes.zip...

vitaly-t · 2021-11-01T04:49:31Z

vitaly-t
Nov 1, 2021
Maintainer

The program can count the primes to a billion in about a half a second, although it takes about three seconds to enumerate over the primes in this range, it counts the primes to 1e10 in about four seconds, and to 1e11 in about 50 seconds. Although it starts to lose efficiency above 1e12 (about ten minutes)

That's much slower than my current implementation - see the bottom of the benchmarks:

1e10 => 166ms
1e11 => 1,545ms
1e12 => 16s

0 replies

GordonBGood · 2021-11-01T05:53:48Z

GordonBGood
Nov 1, 2021
Author

As interspersed below:

On Mon, Nov 1, 2021, 11:49 Vitaly Tomilov ***@***.***> wrote: The program can count the primes to a billion in about a half a second, although it takes about three seconds to enumerate over the primes in this range, it counts the primes to 1e10 in about four seconds, and to 1e11 in about 50 seconds. Although it starts to lose efficiency above 1e12 (about ten minutes) That's much slower than my current implementation - see the bottom of the benchmarks <https://github.com/vitaly-t/prime-lib/tree/main/benchmarks>: - 1e10 => 166ms - 1e11 => 1,545ms - 1e12 => 16s

You realize that the bottom benchmarks aren't a sieve but a prime counting function and thus can never enumerate over the given range of primes, don't you? My posted code is about as fast as a **sieve** can run in JavaScript and as posted can be a prime generator. My next post will be a prime counting function that will be much faster than what you have, especially as the upper limit goes up, as the time cost will only grow at less then a hundred times for every thousand times the upper limit is increased. If you look at the code for your current basic prime counting function, you'll notice it contains a basic SoE which is used to sieve to the range of the upper limit to the two thirds power, and part of the reason it doesn't gain the performance it should at high ranges is that this basic sieve loses its efficiency very quickly with range. Thus, one needs a fast sieve in order to implement a fast prime counting function, which is why I have provided the fast sieve first. As the code I just provided can sieve to ten billion (1e10) in just something like ten seconds, it opens the door to running the rest of the LMO prime counting function in just a little more, so being able to count the primes to 1e15 in under a minute. **Your current prime counting function takes something like eight hours to do that!** Your current prime counting function is alright for smallish upper limits as it has low overheads, but as the upper limit grows above about 1e10, it will lose out to a more advanced technique as I will use in the code I will send you in a few days to a week. This is analogous to the simple sieves that your library currently contains being faster for ranges up to a million or ten million due to low overhead, but then losing out for ranges above that. If you are serious about supporting non trivial ranges, you need to move past the basic solutions. Regards, Gordon

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRTMTP5CSENYJ47XRNRW3TUJYL6NANCNFSM5GRKXDDA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

GordonBGood · 2021-11-21T02:25:35Z

GordonBGood
Nov 21, 2021
Author

@vitaly-t:

The program can count the primes to a billion in about a half a second, although it takes about three seconds to enumerate over the primes in this range, it counts the primes to 1e10 in about four seconds, and to 1e11 in about 50 seconds. Although it starts to lose efficiency above 1e12 (about ten minutes).

That's much slower than my current implementation - see the bottom of the benchmarks https://github.com/vitaly-t/prime-lib/tree/main/benchmarks:

1e10 => 166ms

1e11 => 1,545ms

1e12 => 16s

Rather than comparing those results with my true wheel-factorized page-segmented bit-packed Sieve of Eratosthenes sieve, you should be comparing it with a properly implemented prime counting function, which that is not.

My answer on the same SO thread where you found that code is a properly implement prime counting function and runs in the following times on my machine, which machine is quite a bit slower than yours if comparing the above results is indicative, as it took 50 seconds to use your prime counting function to count the primes to 1e12 on an Intel Skylake i5-6500 running at 3.6 GHz (single threaded boost):

range	count	time (seconds)
1e9	50847534	0.012
1e10	455052511	0.025
1e11	4118054813	0.076
1e12	37607912018	0.368
1e13	346065536839	1.134
1e14	3204941750802	4.765
1e15	29844570422669	25.574
2**53-1 (about 9e15)	252252704148404	128.322

There are several reasons my prime counting function is so much faster, as follows:

It has a proper page-segmented bit-packed Sieve of Eratosthenes as required for prime counting functions to provide the base "seed" primes. Due to it being page segmented and bit packed, the sieve requires a negligible amount of memory as compared to the naive SoE implementation in your version, and thus the prime counting function only uses memory space as O(n^(1/3)) rather than huge exponential factors of that for your version. As much of the remaining computation time is used in sieving for the base primes, a faster version of the SoE such as my wheel factorized version I sent you will likely make this at least twice as fast.
It uses the basic LMO algorithm rather than the straight Meissel algorithm as used by your version, and thus has about O(n^(2/3)(log (log n))) time complexity rather than the O(n/((log n)^3)) for Meissel. It not only reduces the number of operations, but also reduces the number of expensive integer division operations by the same ratio across the range rather than the only constant offset optimizations using caching for your version.
It reduces the time for even the remaining operations by using partial sieving and advanced buffered counting techniques so that the time per operation is reduces as well as the total number of operations.

I developed the code as a node.js console application and then converted it to a web page snippet to make it convenient for casual SO browsers to run the code inside the answer; it would be easy to strip the web page stuff out of the code to get back to a node.js console application version, or I can provide that from my original source.

Before doing that, I want to spend another few days trying to adapt the more sophisticated SoE to the code to see if it speeds it up by about a factor of two as I expect (and perhaps provide that new code as an alternate answer on the SO thread, as that answer is full).

That is what I've been trying to teach you: if you need to actually enumerate the found primes, you need a sieve such as my SoE version I provided previously; if you only need the total count of primes over a range, you can use a properly implemented prime counting function which is exponential powers faster and thus usable to much larger limits. Actually, even if you need the sum of the primes, you can also use these prime counting function numerical analysis techniques to obtain that sum in times exponentially faster than doing if by sieving, but the code is even more complex to implement.

Once one has a fast prime counting function, finding the nth prime is about as fast: One estimates the point to which one has to calculate to obtain the nth prime by using an inversion prime estemation function such as the inversion Logarithmic integeral or the invere Rhiemann counting function, then just counts from the exact number of primes upward or downward to the nth prime desired. I may look into providing that as well.

I think that if I get you to that point, I probably don't have more to contribute as I really don't enjoy coding in JavaScript.

0 replies

ishandutta2007 · 2022-09-18T10:26:00Z

ishandutta2007
Sep 18, 2022

Hi @vitaly-t @GordonBGood just curious if you guys understand what's going on in this implementation of Meissel-Lehmer . It's about 10x faster than more comprehensible implementations of Meissel-Lehmer like say this one .

12 replies

GordonBGood Sep 21, 2022
Author

My library is focused on competitive programming. ... Since your LMO code is already opensource and already wriiten in a single file without any third party library you might want to submit and test where it ranks.

My JavaScript (or whatever language it is translated to or from) LMO implementation doesn't much break any new ground as far as complexity or memory requirements but is just a fairly straight-forward implementation from the paper with some programming tricks as borrowed from Kim Walisch's implementation adapted to use to whatever language used (here JavaScript) along with some forms of code suited to my preference to use Functional Programming style.

As to length, most of my implementations fit within the about 500 LOC that is about the maximum practical limit to be included within a StackOverflow answer without any other concerns whether this might be within a competitive programming limit...

The reason I asked about sieve is because the 100 line sieve in my library seems to be as efficient as yours or Gordons or Kim Walisch's large multi file libraries. I took it from min_25's blog. Just like that "magic" Meissel-Lehmers is very common implementation in most competitive programming libraries so is the "min_25" sieve.

I had a quick look at your repositories and implementations and found several "min_25" sieves, I don't know to which one you refer, but all of them are short as I suppose is a requirement for your submission for competitive programming. From what I can see, all of these use a Sieve of Eratosthenes (SoE) as their root and then use some numerical analysis to determine whatever the requirement (such as counts or sums), although from what I can see none of them are strictly "sieves" in that they use polynomial forms of code just as Meissel derived algorithms do. Due to none of these being strictly "sieves", it is very possible and even likely that they are faster than my or Kim Walisch's SoE implementations as we aren't comparing like with like. It seems to me that competitive programming exercises don't concern themselves with solving "large" problems where the ranges would push execution and memory consumption to the limits, as evidenced in the "magic" prime counting function which consumes huge amounts of memory in order to get its non-recursive performance (although, as you say, that is a common trade-off). I look for solutions that optimize this trade off while not being too complex, which is why the basic LMO is interesting to me and somewhat useful for a reasonable range.

As to deficiencies in implementations of the SoE, it is quite common for users to take the basic monolithic (one huge array) and expand it to huge ranges as in the following odds-only implementation:

using ull = unsigned long long;
vector<unsigned char> mask = { 1, 2, 4, 8, 16, 32, 64, 128 }; // faster than bit twiddling
unsigned char *maskp = (unsigned char *)(&(cmpsts[0]))

int countSoE(ull n) {
    if (n < 3) { return (n < 2) ? 0 : 1; }
    int mxbiti = (n - 3) / 2;
    int szwrds = (mxbiti + 64) / 64; // round up to nearest containing word
    vector<ull> cmpsts(szwrds);
    unsigned char * cmpstsp = (unsigned char *)(&(cmpsts[0]));
    for (int i = 0; i <= mxbiti; ++i) {
        int s = (i + i) * (i + 3) + 3;
        if (s > mxbiti) break;
        if (*(cmpstsp + (i >> 3)) & mask[i & 7]) continue;
        int bp = i + i + 3;
        for (; s <= mxbiti; s += bp) *(cmpstsp + (s >> 3)) |= mask[s & 7];
    }
    int cnt = 2 + mxbiti;
    ull * vp = (ull *)cmpstsp;
    for (int i = 0; i < szwrds; ++i) cnt -= __builtin_popcountll(*(vp + i));
    return cnt;
}

These 23 lines of code are sufficient to sieve to a billion and count the remaining primes in a few seconds, and is similar to as used as a base for many of the implementations (if they even use odds-only sieving) other than that in the "magic" implementation, half of the skip bit vector is unused as the bits represent even values and it implements an "odds-only" skipping past those values and then using a fairly complex indexing scheme to adjust the indexes actually used. The above code is not a good example of an efficient SoE for larger ranges such as a billion because of the large bit vector used (about 62 Megabytes for sieving to a billion, which grows linearly with range), which is well beyond any CPU cache sizes. However for many of these algorithms such as the "magic" one, sieving speed is not the main speed limitation so use of such a sieve can be overlooked; however, algorithms such as LMO reduce the time used by other than sieving so that sieving time is again a factor - in my JavaScript implementation I maximize the use of the sieving by only passing through the page-segmented sieve once, using the results for both the P2 and the S2 calculations, ...

Page segmentation fixes the memory use and caching problem, as then one only has to store the base primes less than the square root of the range, with page segment sizes selected to be less than the CPU (preferably the L1) cache size, and this is sufficient to reduce the time to sieve to a billion to sub second times at the cost of some hundred or so LOC...

Wheel factorization and pre-culling by the small base primes up to 19 is sufficient to make this about four times faster, but one has to be careful how this is formulated as many schemes spend about as much time with more complicated wheel indexing as is saved in reduced number of operations, which is why my JavaScript Stack Overflow answer splits the computations by residual bit planes, treating each bit plane as for the odds-only solutions...

Further gains to actual sieving (not numerical analysis techniques) must pay attention to the time spent per composite number cull by such techniques as extreme loop unrolling as used in the following example benchmark applied as to a page segment bit plane as in the first base page:

#define unroll(n) { \
    for (; cmpstsp < cmpstsplmt; cmpstsp += bp) { \
        *(cmpstsp) |=  (1 << (n & 7)); \
        *(cmpstsp + r1) |= (1 << ((n + ((n >> 2) | 1)) & 7)); \
        *(cmpstsp + r2) |= (1 << ((n + 2 * ((n >> 2) | 1)) & 7)); \
        *(cmpstsp + r3) |= (1 << ((n + 3 * ((n >> 2) | 1)) & 7)); \
        *(cmpstsp + r4) |= (1 << ((n + 4 * ((n >> 2) | 1)) & 7)); \
        *(cmpstsp + r5) |= (1 << ((n + 5 * ((n >> 2) | 1)) & 7)); \
        *(cmpstsp + r6) |= (1 << ((n + 6 * ((n >> 2) | 1)) & 7)); \
        *(cmpstsp + r7) |= (1 << ((n + 7 * ((n >> 2) | 1)) & 7)); \
    } \
    s = (cmpstsp - &cmpsts[0]) * 8 + (n & 7); \
    break; \
}
int benchmarkSoE() {
    int sz = 16384;
    int bitsz = 16384 * 8;
    vector<unsigned char> cmpsts(sz);
    unsigned char * cmpstsp = (unsigned char *)(&(cmpsts[0]));
    for (int n = 0; n < 1000; ++n) // to get it slow enough to time...
        for (int i = 0; ; ++i) {
            int s = (i + i) * (i + 3) + 3;
            if (s >= bitsz) break;
            if (*(cmpstsp + (i >> 3)) & *(maskp + (i & 7))) continue;
            int bp = i + i + 3;
            int r0 = s >> 3;
            int r1 = ((s + bp) >> 3) - r0;
            int r2 = ((s + 2 * bp) >> 3) - r0;
            int r3 = ((s + 3 * bp) >> 3) - r0;
            int r4 = ((s + 4 * bp) >> 3) - r0;
            int r5 = ((s + 5 * bp) >> 3) - r0;
            int r6 = ((s + 6 * bp) >> 3) - r0;
            int r7 = ((s + 7 * bp) >> 3) - r0;
            unsigned char * cmpstsplmt = cmpstsp + sz - r7;
            cmpstsp += r0;
            switch (((bp & 6) << 2) | (s & 7)) {
                case 0: unroll(0);
                case 1: unroll(1);
                case 2: unroll(2);
                case 3: unroll(3);
                case 4: unroll(4);
                case 5: unroll(5);
                case 6: unroll(6);
                case 7: unroll(7);
                case 8: unroll(8);
                case 9: unroll(9);
                case 10: unroll(10);
                case 11: unroll(11);
                case 12: unroll(12);
                case 13: unroll(13);
                case 14: unroll(14);
                case 15: unroll(15);
                case 16: unroll(16);
                case 17: unroll(17);
                case 18: unroll(18);
                case 19: unroll(19);
                case 20: unroll(20);
                case 21: unroll(21);
                case 22: unroll(22);
                case 23: unroll(23);
                case 24: unroll(24);
                case 25: unroll(25);
                case 26: unroll(26);
                case 27: unroll(27);
                case 28: unroll(28);
                case 29: unroll(29);
                case 30: unroll(30);
                case 31: unroll(31);
            }
            cmpstsp = &cmpsts[0];
            for (; s < bitsz; s += bp) *(cmpstsp + (s >> 3)) |= *(maskp + (s & 7));
        }
    int cnt = 1 + bitsz;
    ull * vp = (ull *)(&cmpsts[0]);
    for (int i = 0; i < sz / 8; ++i) cnt -= __builtin_popcountll(*(vp + i));
    return cnt;
}

This can cull one 16 Kilobyte cache in about 200 thousand CPU clock cycles or about one CPU clock cycle per composite cull bit (about 44 microseconds on a 4.5 GHz modern CPU sieving to 262146) and although it isn't quite this fast as the range increases when extended to a wheel-factorized page-segmented solution combining the above ideas, it can get Kim Walisch "primesieve" types of speed of about an eighth of a second to sieve to a billion single-threaded...

Further gains can be made by using dense culling for base primes less than say 128 where GCC or clang will turn the sub culling in a single word into auto-vectorized SIMD instructions, which can gain another ten percent or so to the above ideas...

However, you'll note that all of the above quoted speeds are when counting the number of primes as per the enclosed code: when using an enumeration function to actually enumerate the values of primes, the overhead of calling a function takes at least as long as the fastest of the time it takes to cull the composite numbers; this doesn't matter much when the SoE is a small part of the overall time or if the use of the determined primes can be combined in the same loop as the sieving loop, but would be a factor for competitive programming problems which require the enumeration of the results. In my JavaScript LMO implementation, I do enumerate the primes found by the page-segmented SoE in reverse order and I suppose that the implementation could be sped up by embedding the LMO logic in the SoE loop but it would be messier...

Every time we add new improvements to the execution speed, we must add some LOC, such that with all of the above combined plus multi-threading, one ends up with something in the close order of 500 LOC...

ishandutta2007 Sep 21, 2022

It seems to me that competitive programming exercises don't concern themselves with solving "large" problems where the ranges would push execution and memory consumption to the limits, as evidenced in the "magic" prime counting function which consumes huge amounts of memory in order to get its non-recursive performance.

As I explained it's a combination of all three. If a problem has 1 sec time limit, 500MB of memory limit and 5KB of source limit and asks you to solve a problem of the order of 1e13 then that magic lehmer suffices or with increased time limit to 10secs even for 1e14. but if it asks you to solve a problem of say 1e15 or 1e16 it wont suffice even if you set a time limit of 100 sec because it would run out of memory. In such case we would need LMO or Deleglise-Rivat. So it all depends on how the problem setter defines the time, memory, source size constraints. In that tool as you can see N specified is N<=1e11 and time<=5 sec(memory and source-length are kept as default) which is why that "magic" lehmer suffices(in fact even simpler technique like recursive meissel-lehmer or legedre suffices but the magic one scores better). So you are right, in this area of competitive programming none of the problem setter have challenged a programmer enough asking to apply the state of the art research papers(at least for the problems that I have come across so far). Once I have the LMO/Deleglise-Rivat consolidated C++ script ready and properly benchmarked and I can set a new problem in some contest or some online Judge with larger range.

Every time we add new improvements to the execution speed, we must add some LOC, such that with all of the above combined plus multi-threading, one ends up with something in the close order of 500 LOC...

That's the point I was trying to make earlier. Even if it's algorithmic impovement source-code size definitely increases. Time complexity may have decreased from 200 year old Legendres to 21st century techniques ike Xavier Gordon's, But source-code size would definitely increase. In most areas of computer science there no big mathermatical break through as such, just that newer researcher find more and more smaller blocks to pre-compute or decouple or store in an additional data structure and so on; all of which adds more steps and hence more LOC even though the time complexity improves with each passing generation.

500LOC is small enough to be tolerated in competitive programming. Qute a lot of the times to implement a recent 21st century paper it goes beyond 1000LOC.

vitaly-t Sep 21, 2022
Maintainer

The best-performing algos that I saw were C++ optimized for use of 64-bit computing, which you cannot do in other languages. In JavaScript, for example, 64-bit computing is low-performing, and far from being native. At the same time, my implementation here was able to outperform most of C++ implementations that were 32-bit (ones without deep optimization).

GordonBGood Sep 23, 2022
Author

@vitaly-t:

The best-performing algos that I saw were C++ optimized for use of 64-bit computing, which you cannot do in other languages. In JavaScript, for example, 64-bit computing is low-performing, and far from being native. At the same time, my implementation here was able to outperform most of C++ implementations that were 32-bit (ones without deep optimization).

Strictly speaking, one doesn't need 64-bit registers to still get excellent performance for the SoE, as if one uses a page-segmented SoE then 64-bits indices are only used to keep track of the start index of each segment page (seldom) and the 64-bits used to express the final prime outputs (not necessary for a prime count, and can be bypassed for enumerations for the most part by computing prime deltas rather than absolute values). The reason that there were very slow C++ competitive programming implementations, whether 32-bits or 64-bits was that they were even more naive in not even optimizing to use odds-only (for about two and a half times as slow) and a poor choice of data structures to do the sieving such as not using bit-packed vector<bool> or how they did the bit-packing themselves (done properly it can be faster than direct use of vector<bool>)...

While it is true that one can't implement my "extreme loop unrolling technique" on the 32-bit x86 architecture, it isn't because it is only 32-bits but rather for a lack of general purpose registers; it works fine on 32 bit ARM7 with 32 general purpose registers...

GordonBGood Sep 25, 2022
Author

@ishandutta2007:

As you can see from this opensource public library testing tool almost all the top performers here have tested that "magic" meissel-lehmer. Since your LMO code is already opensource and already written in a single file without any third party library you might want to submit and test where it ranks. My guess is you might top the list...

I had a look at the testing link above, and as you say, lots have tested the "magic" Lehmer solution. As you say, I think that I could likely top the list using my LMO solution translated to a faster compiled language as in C++ or Rust in something under 1000 LOC, but all that work wouldn't produce an astounding difference (perhaps about 30 percent faster) because the problem is too small at "only" counting the primes to 1e11; LMO only starts to come into its own as compared to the "magic" solution at about 1e14 or 1e15; where the translated LMO would really shine is in use of memory where, even for the small problem of counting primes to 1e11, it would consume negligible memory as compared to the "magic" algorithms use of three Megabytes, and for a larger problem such as 1e15 would only use a hundred Kilobytes of memory as compared to the "magic" algorithm's use of over 250 Megabytes...

I considered submitting GHC Haskell solutions to a couple of problems but performance would be quite limited due to the judge server not having the LLVM backend installed for the GHC compiler, which severely cripples GHC Haskell's ability to optimize code, especially as to register allocations upon which many solutions depend; with LLVM GHC often able to make a decent showing for these problems, even keeping up up to C++ solutions using the same algorithm for such problems as counting or enumerating; My GHC Haskell translation of the "magic" algorithm runs at 65 milliseconds, which make it about four times slower than the fastest C++ version, but as explained, it would very likely be about twice as fast at about 32 milliseconds if the judge server made the LLVM backend available. As mentioned in the code header comments, GHC Haskell is sometimes slower due to that all values are "lifted"/potentially "boxed" so it might help to re-write the code using GHC Haskell "unlifted" primitives...

Having done this work, I now see why your "simple Meissel-Lehmer version is so much slower than the "magic" algorithm, as follows:

The "simple" algorithm is the pure Meissel-Lehmer algorithm, so has the execution complexity of this algorithm at O(n / ((log n)^4)) although that is helped (otherwise would be completely unacceptable) by Look Up Tables (LUT's) for the end "leaves" of the "phi" tree (sometimes called the "TinyPhi" LUT tables).
Once beyond the scope of the LUT's, the "simple" algorithm has a huge constant factor running cost due to many nested levels of recursive function calls that cannot be tail call optimized.
Further, the "magic" algorithm uses the trick of using floating point divisions rather than integer divisions for about four times greater speed at the cost of a limited accuracy/usability range; however, one wouldn't want to use this algorithm for the larger ranges where the accuracy would be a problem due to the huge memory consumption for those ranges.
Finally, while the "magic" algorithm is not LMO (with O(n^(2/3) / (log n)) execution complexity), it is better than standard Meissel-Lehmer as per the above at about O(n^(3/4)/ ((log n)^2)) execution complexity.

Since a three quarters power isn't all that much different than a two thirds power, performance is reasonably close between LMO and the "magic" algorithm until the memory use for the latter gets too big to be really practical...

vitaly-t · 2022-09-19T17:41:16Z

vitaly-t
Sep 19, 2022
Maintainer

I'm not doing anything for this project presently, but I did update it just now ;)

8 replies

vitaly-t Sep 20, 2022
Maintainer

Can you test it on this tool and see where it ranks.

That list doesn't even allow JavaScript. Anyway, I did all the necessary benchmarking here.

ishandutta2007 Sep 20, 2022

You have benchmarked till 100m primes. Top algo there generates 26m primes in 266ms, your time for first 26m primes using sieveintBoost would be 6814*26/100 ms=1771 ms which is about 7x slower than top algos in that list. I should not draw conclusions like this as we need to test on same host and same data. Maybe when I have some free time I will try converting yours to c++ and and make a submission on your behalf and check.

vitaly-t Sep 20, 2022
Maintainer

Here's my test specifically for 26mln primes:

So it is 4 times slower than the top C++ solution there, but that's because it is C++, and mine is in JavaScript.

GordonBGood Sep 22, 2022
Author

@vitaly-t:

So it is 4 times slower than the top C++ solution there, but that's because it is C++, and mine is in JavaScript.

No, it is four times slower because of the naive "Chapter 2" one-big-array algorithm and it will be about the same speed no matter what compiled or JIT compiled language is used as the bottle neck for this algorithm is memory access for any compiled language. I tested and confirmed this on a couple of machines, with a C version only a tiny bit faster than the JavaScript version. In the competition, there are C++ submissions using this naive algorithm and they are also about this slow...

Actually, saying it is four times slower assumes that the test machine upon which the competition is compiled and run is comparable to your AMD 5900X machine used here, which AMD CPU of yours seems to have a very good memory subsystem and fast RAM in order to run this fast; on my older Intel i5-6500 with a CPU clock speed (single-threaded boosted) of about three quarters of this one (single-threaded boosted), it runs about twice as slow, but that is with older DDR3 RAM and less cache. So if the competition test machine isn't as efficient at large array memory access as your AMD, this algorithm could be more than four times slower, although a quick scan through the submissions seems to indicate that the test machine is about as fast as yours...

In order to run faster on these machines, one needs a better algorithm that reduces the amount of work it does, which is the whole point of my "Chapter 4.a" StackOverflow answer which reduces the number of operations by about a factor of four by wheel-factorization and pre-cull filling for small base primes up to 19; however, in order to be able to enumerate the resulting primes, the program will need to be changed from the code that just counts the number of found primes, so will lose some time in that enumeration will then take at least as long as it takes to do the culling. With the resulting program that can take advantage of faster compiled languages, a JavaScript version will then likely be half again to twice as slow as a C++ version. The fastest of the submissions in the competition seem to use the same techniques is in my "Chapter 4.a" StackOverflow answer...

Even the current fastest C++ version in the competition doesn't use the "extreme loop unrolling" technique as I outlined with a benchmark example further up this thread, so one could likely make a submission using that to almost double the speed of culling, but as it won't affect the speed of enumeration the gain won't be a factor of two but something less; this technique can't be applied in JavaScript, which doesn't have the primitive pointer operations nor the speed using JIT compilation even if it did...

vitaly-t Sep 22, 2022
Maintainer

@GordonBGood Cheers! I'll stand corrected :)

GordonBGood · 2022-09-29T10:26:22Z

GordonBGood
Sep 29, 2022
Author

@ishandutta2007:

4. Finally, while the "magic" algorithm is not LMO (with O(n^(2/3) / (log n)) execution complexity), it is better than standard Meissel-Lehmer as per the above at about O(n^(3/4)/ ((log n)^2)) execution complexity.

I've worked some more with the "magic" algorithm and see that it is not Meissel, Meissel-Lehmer, or LMO, or otherwise it would require a calculation of the number of primes to the cube root of the range. It also isn't Legendre exactly as the execution complexity doesn't match what the implementation of that usually is; rather, it is an "improved" Legendre algorithm in a similar way to that LMO is an improved Meissel algorithm that differs in greatly improved computational complexity. The nice thing about this algorithm is that it is relatively easy to implement as to required LOC, but it still uses memory at the same rate as Legendre and Meissel-Lehmer.

The way that recursive determination of the "phi" function is eliminated is by using the principle of "partial sieving", where, in the first sieving loop where the base prime factors of three and up to the quad root of the range are culled, upon culling for a given base prime value, it then scans across all the remaining non-culled values at that point which will represent all of the q values which are the remaining primes and products of primes up to the square root of the range, subtracting/excluding the count of the range divided by these non-culled values at this point from the accumulated count of the included values at this point to larges, which is the included minus the excluded counts for each of the "k-roughs" values. In this same "partial sieving" loop, the "k-rough" values are continuously moved back as "k-roughs" are determined, which "k-roughs" are the (potential) primes starting at some prime offset, which in this case is the next value after the current base prime (going up the the quad root of the range). Also in this first loop, the "smalls" current count of potential primes less that the given "k-rough" for the given index is adjusted according to the values culled in the current "partial sieve" loop pass.

So at the end of the first major loop, the sieve of odd values from one up to the square root of the range as been culled (although that will never be used again after the first major loop), the "smalls" array contains the counts minus one up to each of the indices for the odd numbers from 1 up to the square root of the range as in 0 for 1, 1 for 3, 2 for 5, 3 for 7, 3 for 9 (no increase as 9 is not prime), 4 for 11, etc., the "roughs" array will contain 1, first prime after the last prime up to the quad root of the range, and succeeding primes up to the square root of the range, and the "larges" will contain the included/excluded counts up to this point, with a new size variable containing the usable length of the "roughs" and "larges" arrays, and there being a prime count variable containing one less than the number of primes found up to the quad root of the range.

At this point all of the succeeding "larges" values are subtracted from the first value and the "roughs" array never needs to be used again. The result can then be adjusted slightly for the exact number of values and primes found to get from a "phi" to a "pi". Finally, the second major loop adds the counts and subtracts the small adjustment for those combinations of the product of the "roughs" primes above the quad root and less than the square root whose product is larger than the square root of the range but with only the product of two different primes used (and can never be the same prime).

The reason that this algorithm reduces execution complexity as compared to the standard Legendre algorithm is the use of partial sieving, which finds the products of many primes in one loop pass instead of many successive recursions for the conventional implementation, just as LMO uses partial sieving for a similar (but greater) gain. With the somewhat complex logic determining that the product of (just two) primes qualify being the reason this algorithm also saves operations for loops that can't contribute to the included/excluded sums just as a similar simple optimization can make the Legendre algorithm even usable without memoization, else having an exponential time complexity.

This algorithm should really be written up and published as per the LMO format, although the algorithm is considerably simpler than that of LMO; OTOH, the only use this algorithm will likely ever get is for the purposes of competitive programming where contributors don't want to submit the more complex LMO or better algorithms for quite trivial ranges such as 1e11 as used for such contributions...

5 replies

ishandutta2007 Sep 30, 2022

First of all I should apologise for not being active enough on this thread despite initiating it. It's great that you have put in so much time and effort reverse engineering it. From the names smalls and larges I had initially sensed that he might have split the same array of Legendre's approach into two and doing something similar. k-roughness thing I haven't managed to fully grasp yet.

Anyways as a next step I look to generalise it to find sum of primes or sum of k-th power of primes. I have managed to do it for Legendre's and recursive Lehmer's, now I intend to do the same for this magic one. Can you help me with where I should plug in my func() and accfunc() functions like I have done in the earlier scripts .

GordonBGood Oct 1, 2022
Author

@ishandutta2007:

First of all I should apologise for not being active enough on this thread despite initiating it. It's great that you have put in so much time and effort reverse engineering it.

You're welcome, it's been fun!

From the names smalls and larges I had initially sensed that he might have split the same array of Legendre's approach into two and doing something similar.

k-roughness thing I haven't managed to fully grasped yet.

According to the Wikipedia article: "has alternately been defined as requiring all prime factors to strictly exceed k"; In this case, the roughs array is initialized with the k-roughs for odd positive integers as the algorithm is "odds-only" and then the k in the k-roughs is increased by one prime for every loop as in 3, 5, 7, 11, ..., with the final resulting k-rough having a k of the first prime higher than the quad root of the range, meaning that for the used k-rough range up to the square root of the range, all used k-rough numbers must then be primes at the end of the first major loop, at which point none of the arrays are modified further...

Anyways as a next step I look to generalise it to find sum of primes or sum of k-th power of primes. I have managed to do it for Legendre's and recursive Lehmer's, now I intend to do the same for this magic one. Can you help me with where I should plug in my func() and accfunc() functions like I have done in the earlier scripts .

Hmm, I think I'll pass on doing this, as it seems to me that your "generalized" func() and accfunc() aren't really general in that they only work for values of k of zero through two...

I'll make a few observations, as these aren't exactly the example codes with with you opened this sub-thread:

These are not recursive algorithms such as the example Lehmer you originally posted, to which you have added Lehmer's idea of a "TinyPhi" Look Up Table although it doesn't seem to be implemented as he did for a "degree" of five.
In not using recursion but rather loops, your Legendre algorithm starts to more resemble the "magic" algorithm, lacking the optimizations for "odds-only" and "partial sieving" as well as using a prime count Look Up Table (called smalls in "magic") for quick determination of prime counts for values less than the square root of the range.
Your non-recursive Legendre implementation doesn't use sieving whatsoever, **but it gets the effect of "partial-sieving" in the way it uses the values in the v and s arrays, which is how it gets the stated O(n^(3/4)) time complexity; it misses out on the extra division by (log n)^2 because it doesn't use sieving to work with only the prime values as base factors as does the "magic" algorithm.
In short, your Legendre algorithm is just a poor implementation of what the "magic" algorithm becomes.

It may help to post a fully commented as to what is going on listing of some sort of code or pseudo-code for the "magic" algorithm, from which I think it will be obvious as to where you inject your func() and accfunc() to get the effect you desire; especially if I can explain k-roughs and larges in terms of your v and s arrays, it should be obvious...

ishandutta2007 Oct 1, 2022

aren't really general in that they only work for values of k of zero through two.

Its a general structure for any k. I haven't tested for higher k as it would overflow 128bit . So for higher k I would have to change the entire code to bigint. func() = p^k and accfunc() = Faulhaber's formula for kth power

GordonBGood Oct 1, 2022
Author

@ishandutta2007:

aren't really general in that they only work for values of k of zero through two.

Its a general structure for any k. I haven't tested for higher k as it would overflow 128bit . So for higher K i would have to change the entire code to bigint. func() = p^k and accfunc() = Faulhaber's formula for kth power

Ah, okay, I see your limitation; at any rate, I think you will be able to add these functions for yourself if I explain the "magic" algorithm in terms of the current non-recursive Legendre algorithm you linked, which would likely be for useful than just telling your where they would be injected...

GordonBGood Oct 2, 2022
Author

@ishandutta2007:

I think you will be able to add these functions for yourself if I explain the "magic" algorithm in terms of the current non-recursive Legendre algorithm you linked, which would likely be for useful than just telling your where they would be injected...

I have looked at the code as per your last two links (with the generalizations) although not at great depth as I don't take much interest in Lehmer given that there are much better prime counting algorithms available that aren't that much more complicated, but you have opened my eyes to the original work by Legendre, although I doubt he thought of the extra optimizations we are looking at now. In point of fact, your "generalized" Legendre code isn't really Legendre as it doesn't use only primes as determined through sieving in the determination although it resembles Legendre in using the inclusion/exclusion principle; not using only primes is why it has a O(n^(3/4)) time complexity instead of being divided by (log n)^2 as for a Legendre algorithm. The "magic" algorithm is more related to the Legendre algorithm in using only primes as a base so keeps the (log n)^2 factor but gains on the "regular" Legendre algorithm with O(n / (log n)^2) time complexity to O(n^(3/4) / (log n)^2) due to the partial sieving and "splitting".

In comparing to the commented following Nim code, the lower half of the v array corresponds to the smalls array and the top half to the roughs array with extra space used so as conversion from values to indices is not necessary, and the s array corresponds to the larges array except that the same array contains both the results of the "split" with the lower half the result of the smalls counts and the upper half containing the larges counts and the geti function taking care of when to use the bottom or top half of s, and the algorithm continually moving the focus/combining upward until after all the loops are complete, the answer is left in the topmost index of s. Since no sieving takes place, there is no need to ever change the v array after initialization, there are no values ever eliminated from the s array. There is no need for a roughs equivalent as no sieving takes place and the p values (which are not necessarily prime) are just an iteration of the odd values starting at 3 up to the square root of the range. I don't take much interest in this algorithm either as the jump to the "magic" algorithm is so small in extra code for quite a great improvement in execution time.

The Nim commented code is as follows, from which it should be easy to translate to any language and you may see where to inject the extra function with the above description of the arrays from you "generalized" version compared to this:

# This file is a "magic" prime counting function for Nim...
# compile with: nim c -d:danger -t:-march=native --gc:arc

from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt

let n = 100_000_000_000'i64

let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))

# non-"regular" Legendre prime counting function...
# unlike the "regular" recursive Legendre algorithm with O(n/((log n)^2));
# this has O(n^(3/4)/((log n)^2)) time complexity.
# this "magic" algorithm is to legendre as LMO is to the Meissel algorithm,
# substituting partial sieving for function recursion and use of LUT's for
# counting of sub primes per partial sieving, with the difference that the
# "splitting" is at the sqrt of the range rather than the cube root.
# It is much simpler than LMO but at the cost of memory consumption at
# O(sqrt n) and time complexity as above rather than LMO with
# O(n^(1/3)) memory consumption and O(n^(2/3) log log n) time complexity.
proc countPrimes(n: int64): int64 =
  if n < 3: # can't odd sieve for value less than 3!
    return if n < 2: 0 else: 1
  else:
    proc half(n: int): int {.inline.} = (n - 1) shr 1 # convenience
    # dividing using float64 is faster than int64 for some CPU's...
    # precision limits range to maybe 1e16!
    proc divide(nm, d: int64): int {.inline.} = (nm.float64 / d.float64).int
    let rtlmt = n.float64.sqrt.int # precision limits range to maybe 1e16!
    let mxndx = (rtlmt - 1) div 2;
    var smalls = # current accumulated counts of odd primes 1 to sqrt range
      cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * (mxndx + 1)))
    # initialized for no sieving whatsoever:
    #   0 odd primes to 1; 1 odd prime to 3, etc....
    for i in 0 .. mxndx: smalls[i] = i.uint32
    var roughs = # current k-rough numbers up to sqrt of range
      cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * (mxndx + 1)))
    # initialized to all odd positive numbers 1, 3, 5, ... sqrt range...
    for i in 0 .. mxndx: roughs[i] = (i + i + 1).uint32
    # array of current phi counts for above roughs...
    # these are not strictly `phi`'s since they also include the
    # count of base primes in order to match the above `smalls` definition!
    var larges = # starts as size of counts just as `roughs` so they align!
      cast[ptr[UncheckedArray[int64]]](alloc(sizeof(int64) * (mxndx + 1)))
    # initialized for current roughs after accounting for even prime of two...
    for i in 0 .. mxndx: larges[i] = ((n div (i + i + 1) - 1) div 2).int64
    # cmpsts is a bit-packed boolean area representing
    # odd composite numbers from 1 up to rtlmt used for sieving...
    # initialized as "zeros" meaning all odd positives are potentially prime
    # note that this array starts at (and keeps) 1 to match the algorithm even
    # though 1 is not a prime, as 1 is important in computation of phi...
    var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0((mxndx + 8) div 8))

    # number of found base primes and current highest used rough indice...
    var npc = 0; var mxri = mxndx
    for i in 1 .. mxndx: # i will never reach mxndx
      let sqri = (i + i) * (i + 1) # computation of square index!
      if sqri > mxndx: break # because of this square index limit!
      if (cmpsts[i shr 3] and masksp[i and 7]) != 0'u8: continue # if not prime
      # culling the base prime from cmpsts means it will never be found again
      cmpsts[i shr 3] = cmpsts[i shr 3] or masksp[i and 7] # cull base prime
      let bp = i + i + 1 # base prime from index!
      for c in countup(sqri, mxndx, bp): # SoE culling of all bp multiples...
        let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
      # partial sieving to current base prime is now completed!

      var ri = 0 # to keep track of current used roughs index!
      for k in 0 .. mxri: # processing over old roughs size...
        # q is not necessarily a prime but may be a
        # product of primes not yet culled by partial sieving;
        # this is what saves operations compared to "regular" Legendre:
        let q = roughs[k].int; let qi = (q - 1) shr 1 # index of q!
        # skip over values of `q` already culled in the last partial sieve:
        if (cmpsts[qi shr 3] and masksp[qi and 7]) != 0'u8: continue
        # since `q` cannot be equal to bp due to cull of bp and above skip;
        let d = bp * q # `d` is a product of some combination of odd primes!
        # the following computation is essential to the algorithm's speed:
        # the sub value of "phi" can be determined by a split, just as for LMO,
        # so that if `d` is less than the sqrt of the range, the count of odd
        # primes to `d` can be obtained from the `smalls` LUT after conversion
        # to an index, but if `d` is bigger than sqrt range, then "phi" is\
        # obtained by dividing the range by `d`, which quotient must then
        # be less than the sqrt of the range and the sub "phi" can be looked up
        # directly from the `smalles` LUT after converting to an index.
        # This is subtracted from the old left "phi" value but as both include
        # the count of base primes, the count of base primes cancels out and
        # the count of base prime needs to be added back in!
        # `larges`'s are also "moved-back", according to culled rough values:
        larges[ri] = larges[k] -
                     (if d <= rtlmt: larges[smalls[d shr 1].int - npc]
                      else: smalls[half(divide(n, d.int64))].int64) + npc.int64
        # eliminate rough values that have been culled in partial sieve:
        # note that `larges` and `roughs` indices relate to each other!
        roughs[ri] = q.uint32; ri += 1 # update rough value; advance rough index

      var m = mxndx # adjust counts for the newly culled odds...
      # this is faster than recounting over the `cmpsts` array for each loop...
      for k in countdown(((rtlmt div bp) - 1) or 1, bp, 2): # k always odd!
        # `c` is correction from current count to prime count...
        # `e` is end limit index where corrections are the same...
        let c = smalls[k shr 1] - npc.uint32; let e = (k * bp) shr 1
        while m >= e: smalls[m] -= c; m -= 1 # correct over range down to `e`
      mxri = ri - 1; npc += 1 # set next loop max roughs size; adv prime count
    # now `smalls` is a LUT of odd prime accumulated counts for all odd primes;
    # `roughs` is exactly the "k-roughs" up to the sqrt of range with `k` the
    #    next prime above the quad root of the range;
    # `larges` is the partial prime counts for each of the `roughs` values...
    # note that `larges` values include the count of the odd base primes!!!
    # `cmpsts` are never used again!

    # the following does the top most "phi tree" calculation:
    var ans = larges[0] # the answer to here is all valid `phis`
    for i in 1 .. mxri: ans -= larges[i] # combined here by subtraction
    # compensate for the included odd base prime counts over subracted above:
    ans += ((mxri + 1 + 2 * (npc - 1)) * mxri div 2).int64
    # now we have calculated the count for the prime products with
    # the first prime up to the quad root of the range!

    # This loop adds the counts due to the products of the `roughs` primes,
    # of which we only use two different ones at a time, as all the
    # combinations with lower primes than the cube root of the range have
    # already been computed and included with the previous major loop...
    for j in 1 .. mxri:  # for all `roughs` (now prime) not including one:
      let p = roughs[j].int64; let m = n div p # `m` is the `p` quotient
      # so that the end limit `e` can be calculated based on `n`/(`p`^2)
      let e = smalls[half((m div p).int)].int - npc
      if e <= j: break # never use a product of `p` with `p` or less!
      for k in j + 1 .. e: # for all `roughs` greater than `p` to end limit:
        # since `p` * `roughs[k]` is always greater than sqrt range,
        # always use the second of the "splits" from the first loop;
        # these are always added, as there are always exactly two primes used,
        # and the the first subraction plus the second subtraction results in
        # an addition...
        ans += smalls[half(divide(m, roughs[k].int64))].int64
      # compensate for all the extra base prime counts just added!
      ans -= ((e - j) * (npc + j - 1)).int64

    smalls.dealloc; roughs.dealloc; larges.dealloc; cmpsts.dealloc
    return ans + 1 # include the count for the only even prime of two

let strt = getMonoTime()
let rslt = n.countPrimes
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "Found ", rslt, " primes up to ", n, " in ", elpsd, " milliseconds."

The above code has an improved sieving functionality that doesn't waste half of the sieve buffer representing even numbers that are never used and directly implements bit-packed sieving for ease of translation to languages (including Nim and Rust) that don't have a specialized bit-packed boolean array, although since sieving is such a minor part of the overall execution time, these tweaks won't have much of an effect on the algorithm's performance.

GordonBGood · 2022-11-29T03:58:19Z

GordonBGood
Nov 29, 2022
Author

@vitaly-t, @ishandutta2007, there has been some activity on the StackOverflow thread where my JavaScript snippet is posted identifying a bug that shows up when the counting range is about the exact cube of a prime; the following is the node.js code that fixes this for use here in the same way:

"use strict";

const MAXVALUE = 1e11; // 9007199254740991; // 2**53 - 1

// creates a function returning a lazily memoized value from a thunk...
function lazy(thunk) {
  let value = undefined;
  return function() {
    if (value === undefined) { value = thunk(); thunk = null; }
    return value;
  }
}

// a page-segmented odds-only bit-packed Sieve of Eratosthenes;

const PGSZBITS = 262144; // about CPU l1 cache size in bits (power of two)

const CLUT = function () { // fast "pop count" Counting Look Up Table...
  const arr = new Uint8Array(65536);
  for (let i = 0; i < 65536; ++i) {
    let nmbts = 0 | 0; let v = i;
    while (v > 0) { ++nmbts; v &= (v - 1) | 0; }
    arr[i] = nmbts | 0; }
  return arr;
}();

function countPageFromTo(bitstrt, bitlmt, sb) {
  const fst = bitstrt >> 5; const lst = bitlmt >> 5;
  const pg = new Uint32Array(sb.buffer);
  let v0 = (pg[fst] | ((0xFFFFFFFF >>> 0) << (bitstrt & 31))) >>> 0;
  let cnt = ((lst - fst) << 5) + CLUT[v0 & 0xFFFF]; cnt += CLUT[v0 >>> 16];
  for (let i = fst | 0; i < lst; ++i) {
    let v = pg[i] >>> 0;
    cnt -= CLUT[v & 0xFFFF]; cnt -= CLUT[v >>> 16];
  }
  let v1 = (pg[lst] | ((0xFFFFFFFE >>> 0) << (bitlmt & 31))) >>> 0;
  cnt -= CLUT[v1 & 0xFFFF]; cnt -= CLUT[v1 >>> 16]; return cnt | 0;
}

function partialSievePage(lwi, bp, sb) {
  const btsz = sb.length << 3;
  let s = Math.trunc((bp * bp - 3) / 2); // compute the start index...
  if (s >= lwi) s -= lwi; // adjust start index based on page lower limit...   
  else { // for the case where this isn't the first prime squared instance
    let r = ((lwi - s) % bp) >>> 0;
    s = (r != (0 >>> 0) ? bp - r : 0) >>> 0; }
  if (bp <= 32) {
    for (let slmt = Math.min(btsz, s + (bp << 3)); s < slmt; s += bp) {
      const shft = s & 7; const msk = ((1 >>> 0) << shft) >>> 0;
      for (let c = s >> 3, clmt = sb.length; c < clmt | 0; c += bp)
        sb[c] |= msk; } }
  else
    for (let slmt = sb.length << 3; s < slmt; s += bp)
      sb[s >> 3] |= ((1 >>> 0) << (s & 7)) >>> 0;
}

function partialSieveCountPage(lwi, bp, cntarr, sb) {
  const btsz = sb.length << 3; let cullcnt = 0;
  let s = Math.trunc((bp * bp - 3) / 2); // compute the start index...
  if (s >= lwi) // adjust start index based on page lower limit...
    s -= lwi;
  else { // for the case where this isn't the first prime squared instance
    let r = ((lwi - s) % bp) >>> 0;
    s = (r != (0 >>> 0) ? bp - r : 0) >>> 0; }
  if (bp <= 32) {
    for (let slmt = Math.min(btsz, s + (bp << 3)); s < slmt; s += bp) {
      const shft = s & 7; const msk = ((1 >>> 0) << shft) >>> 0;
      for (let c = s >>> 3, clmt = sb.length; c < clmt | 0; c += bp) {
        const isbit = ((sb[c] >>> shft) ^ 1) & 1;
        cntarr[c >> 6] -= isbit; cullcnt += isbit; sb[c] |= msk; }
    }
  }
  else
    for (let slmt = sb.length << 3; s < slmt; s += bp) {
      const sba = s >>> 3; const shft = s & 7;
      const isbit = ((sb[sba] >>> shft) ^ 1) & 1;
      cntarr[s >> 9] -= isbit; cullcnt += isbit;
      sb[sba] |= ((1 >>> 0) << shft) >>> 0; }
  return cullcnt;
}

// pre-culled pattern of small wheel primes...
const WHLPRMS = [ 2, 3, 5, 7, 11, 13, 17 ];
const WHLPTRNLEN = WHLPRMS.reduce((s, v) => s * v, 1) >>> 1; // odds only!
const WHLPTRN = function() { // larger than WHLPTRN by one buffer for overflow
  const len = (WHLPTRNLEN + (PGSZBITS >>> 3) + 3) & (-4); // up 2 even 32 bits!
  const arr = new Uint8Array(len);
  for (let bp of WHLPRMS.slice(1)) partialSievePage(0, bp, arr);
  arr[0] |= ~(-2 << ((WHLPRMS[WHLPRMS.length - 1] - 3) >> 1)) >>> 0; return arr;
}();

function fillPage(lwi, sb) {
  const mod = (lwi / 8) % WHLPTRNLEN;
  sb.set(new Uint8Array(WHLPTRN.buffer, mod, sb.length));
}

function cullPage(lwi, bpras, sb) {
  const btsz = sb.length << 3; let bp = 3;
  const nxti = lwi + btsz; // just beyond the current page 
  for (let bpra of bpras()) {
    for (let bpri = 0; bpri < bpra.length; ++bpri) {
      const bpr = bpra[bpri]; bp += bpr + bpr;
      let s = (bp * bp - 3) / 2; // compute start index of prime squared
      if (s >= nxti) return; // enough bp's
      partialSievePage(lwi, bp, sb);
    }
  }
}

function soePages(bitsz, bpras) {
  const buf =  new Uint8Array(bitsz >> 3); let lowi = 0;
  const gen = bpras === undefined ? makeBasePrimeRepArrs() : bpras;
  return function*() {
    while (true) {
      fillPage(lowi, buf); cullPage(lowi, gen, buf);
      yield { lwi: lowi, sb: buf }; lowi += bitsz; }
  };
}

function makeBasePrimeRepArrs() {
  const buf = new Uint8Array(128); let gen = undefined; // avoid data race!
  fillPage(0, buf);
  for (let i = 8, bp = 19, sqr = bp * bp; sqr < 2048+3;
                                          ++i, bp += 2, sqr = bp * bp)
    if (((buf[i >> 3] >>> 0) & ((1 << (i & 7)) >>> 0)) === 0 >>> 0)
      for (let c = (sqr - 3) >> 1; c < 1024; c += bp)
        buf[c >> 3] |= (1 << (c & 7)) >>> 0; // init zeroth buf
  function sb2bprs(sb) {
    const btsz = sb.length << 3; let oi = 0;
    const arr = new Uint8Array(countPageFromTo(0, btsz - 1, sb));
    for (let i = 0, j = 0; i < btsz; ++i)
      if (((sb[i >> 3] >>> 0) & ((1 << (i & 7)) >>> 0)) === 0 >>> 0) {
        arr[j++] = (i - oi) >>> 0; oi = i; }
    return { bpra: arr, lastgap: (btsz - oi) | 0 };
  }
  let { bpra, lastgap } = sb2bprs(buf);
  function next() {
    const nxtpg = sb2bprs(gen.next().value.sb);
    nxtpg.bpra[0] += lastgap; lastgap = nxtpg.lastgap;
    return { head: nxtpg.bpra, tail: lazy(next) };
  }
  const lazylist = { head: bpra, tail: lazy(function() {
    if (gen === undefined) {
      gen = soePages(1024)(); gen.next() } // past first page
    return next();
  }) };
  return function*() { // return a generator of rep pages...
    let ll = lazylist; while (true) {  yield ll.head; ll = ll.tail(); }
  };
}

function *revPrimesFrom(top, bpras) {
  const topndx = (top - 3) >>> 1;
  const buf = new Uint8Array(PGSZBITS >>> 3);
  let lwi = (((topndx / PGSZBITS) >>> 0) * PGSZBITS) >>> 0;
  let si = (topndx - lwi) >>> 0;
  for (; lwi >= 0; lwi -= PGSZBITS) { // usually external limit!
    const base = 3 + lwi + lwi;
    fillPage(lwi, buf); cullPage(lwi, bpras, buf);
    for (; si >= 0 >>> 0; --si)
      if (((buf[si >> 3] >>> 0) & ((1 << (si & 7)) >>> 0)) === (0 >>> 0))
        yield base + si + si;
    si = PGSZBITS - 1;
  }
};

const TinyPrimes = [ 2, 3, 5, 7, 11, 13, 17, 19 ]; // degree eight
const TinyPhiDegree = TinyPrimes.length;
const TinyProduct = TinyPrimes.reduce((s, v) => s * v) >>> 0;
const TinyHalfProduct = TinyProduct >>> 1;
const TinyTotient = TinyPrimes.reduce((s, v) => s * (v - 1), 1) >>> 0;
const TinyLength = (TinyProduct + 8) >>> 2; // include zero and half point!
const TinyTotients = function() {
  const arr = new Uint32Array(TinyLength | 0);
  arr[TinyLength - 1] = 1; // mark mid point value as not prime - never is
  let spn = 3 * 5 * 7; arr[0] = 1; // mark zeroth value as not prime!
  for (let bp of [ 3, 5, 7 ]) // cull small base prime values...
    for (let c = (bp + 1) >>> 1; c <= spn; c += bp) arr[c] |= 1;
  for (let bp of [ 11, 13, 17, 19 ]) {
    for (let i = 1 + spn; i < TinyLength; i += spn) {
      const rng = i + spn > TinyLength ? spn >> 1 : spn;
      arr.set(new  Uint32Array(arr.buffer, 4, rng), i); }
    spn *= bp;
    for (let c = (bp + 1) >>> 1; // eliminate prime in pattern!
           c < (spn > TinyLength ? TinyLength : spn + 1); c += bp)
      arr[c] |= 1;
  }
  arr.reduce((s, v, i) => { // accumulate sums...
    const ns = s + (v ^ 1); arr[i] = ns; return ns; }, 0);
  return arr;
}();  

function tinyPhi(m) {
  const d = Math.trunc(m / TinyProduct);
  const ti = (m - d * TinyProduct + 1) >>> 1;
  const t = ti < TinyLength
              ? TinyTotients[ti]
              : TinyTotient - TinyTotients[TinyHalfProduct - ti];
  return d * TinyTotient + t;
}

function *primeCountTo(limit) {
//  if (limit > MAXVALUE) {
//    console.error("Maximum integer size of", MAXVALUE, "exceeded!!!"); return; }
  if (limit <= WHLPRMS[WHLPRMS.length - 1]) {
    let cnt = 0; for (let p of WHLPRMS) { if (p > limit) break; else ++cnt; }
    return cnt; }

  const bpras = makeBasePrimeRepArrs();
  if (limit < 1024**2 + 3) { // for limit < about a million, just sieve...
    let p = 3; let cnt = WHLPRMS.length;
    for (let bpra of bpras())
      for (let bpr of bpra) { // just count base prime values to limit
        p += bpr + bpr; if (p > limit) return cnt; ++cnt; }
  }

  if (limit <= 32 * 2 * PGSZBITS + 3) { // count sieve to about 32 million...
    const lmti = (limit - 3) / 2;
    let cnt = WHLPRMS.length; // just use page counting to limit as per usual...
    for (let pg of soePages(PGSZBITS, bpras)()) {
      const nxti = pg.lwi + (pg.sb.length << 3);
      if (nxti > lmti) { cnt += countPageFromTo(0, lmti - pg.lwi, pg.sb); break; }
      cnt += countPageFromTo(0, PGSZBITS - 1, pg.sb);
    }
    return cnt;
  }

  // Actual LMO prime counting code starts here...
  const sqrt = Math.trunc(Math.sqrt(limit)) >>> 0;
  const cbrt = Math.trunc(Math.cbrt(limit)) >>> 0;
  const sqrtcbrt = Math.trunc(Math.sqrt(cbrt)) >>> 0;
  const top = cbrt * cbrt - 1; //  Math.trunc(limit / cbrt) - 1; // sized for maximum required!
  const bsprms = function() {
    let bp = 3; let cnt = WHLPRMS.length + 1; for (let bpra of bpras())
      for (let bpr of bpra) {
        bp += bpr + bpr; if (bp > cbrt) return new Uint32Array(cnt); ++cnt; }
  }();
  bsprms.set(WHLPRMS, 1); // index zero not used == 0!
  const pisqrtcbrt = function() {
    let cnt = WHLPRMS.length; let i = cnt + 1; let bp = 3;
    stop: for (let bpra of bpras())
      for (let bpr of bpra) {
        bp += bpr + bpr; if (bp > cbrt) break stop;
        if (bp <= sqrtcbrt) ++cnt; bsprms[i++] = bp >>> 0; }
    return cnt;
  }();
  const pis = function() { // starts with index 0!
    const arr = new Uint32Array(cbrt + 2); let j = 0;
    for (let i = 1; i < bsprms.length; ++i)
      for (; j < bsprms[i]; ) arr[j++] = (i - 1) >>> 0;
    for (; j < arr.length; ) arr[j++] = (bsprms.length - 1) >>> 0;
    return arr;
  }();
  const phis = function() { // index below TinyPhi degree never used...
    const arr = (new Array(bsprms.length)).fill(1);
    arr[0] = 0; arr[1] = 3; arr[2] = 2; // unused
    for (let i = WHLPRMS.length + 2; i < arr.length; ++i) {
      arr[i] -= i - WHLPRMS.length - 1; } // account for non phi primes!
    return arr;
  }();
  // indexed by `m`, contains greatest prime factor and
  // Moebius value bit; Moebius one is negative...
  const specialroots = new Uint16Array(cbrt + 1); // filled in with S1 below...
  const S1 = function() { // it is very easy to calculate S1 recursively...
    let s1acc = tinyPhi(limit);
    function level(lmtlpfni, mfv, m) {
      for (let lpfni = 9; lpfni < lmtlpfni; ++lpfni) {
        const pn = bsprms[lpfni]; const nm = m * pn;
        if (nm > cbrt) { // don't split, found S2 root leaf...
          specialroots[m] = (lmtlpfni << 1) | (mfv < 0 ? 1 : 0); return; }
        else { // recurse for S1; never more than 11 levels deep...
          s1acc += mfv * tinyPhi(Math.trunc(limit / nm)); // down level...
          level(lpfni, -mfv, nm); // Moebius sign change on down level!
        } // up prime value, same level!
      }
    }
    level(bsprms.length, -1, 1); return s1acc;
  }();

  // at last, calculate the more complex parts of the final answer:
  function *complex() {
    let s2acc = 0; let p2acc = 0; let p2cnt = 0; // for "P2" calculation
    const buf = new Uint8Array(PGSZBITS >>> 3); let ttlcnt = 0;
    const cnts = new Uint8Array(PGSZBITS >>> 9);
    const cntaccs = new Uint32Array(cnts.length);
    const revgen = revPrimesFrom(sqrt, bpras);
    let p2v = Math.trunc(limit / revgen.next().value);
    const lwilmt = Math.trunc((top - 3) / 2);

    for (let lwi = 0; lwi <= lwilmt; lwi += PGSZBITS) { // for all pages
      if ((lwi & 63) == 0) yield lwi / lwilmt * 100.0;
      let pgcnt = 0; const low = 3 + lwi + lwi;
      const high = Math.min(low + (PGSZBITS << 1) - 1, top);
      let cntstrti = 0 >>> 0;
      function countTo(stop) {
        const cntwrd = stop >>> 9; const bsndx = stop & (-512);
        const xtr = countPageFromTo(bsndx, stop, buf);
        while (cntstrti < cntwrd) {
          const ncnt = cntaccs[cntstrti] + cnts[cntstrti];
          cntaccs[++cntstrti] = ncnt; }
        return cntaccs[cntwrd] + xtr;
      }
      const bpilmt = pis[Math.trunc(Math.sqrt(high)) >>> 0] >>> 0;
      const maxbpi = pis[Math.min(cbrt, Math.trunc(Math.sqrt(limit/low)))]>>>0;
      const tminbpi = pis[Math.min(Math.trunc(top / (high + 1)),
                                   bsprms[maxbpi]) >>> 0];
      const minbpi = Math.max(TinyPrimes.length, tminbpi) + 1;
      fillPage(lwi, buf); let bpi = (WHLPRMS.length + 1) >>> 0;
     
      if (minbpi <= maxbpi) { // jump to doing P2 if not range

        // for bpi < minbpi there are no special leaves...
        for (; bpi < minbpi; ++bpi) { // eliminate all Tiny Phi primes...
          const bp = bsprms[bpi]; const i = (bp - 3) >>> 1; // cull base primes!
          phis[bpi] += countPageFromTo(0, PGSZBITS - 1, buf);
          partialSievePage(lwi, bp, buf); }
        for (let i = 0; i < cnts.length; ++i) { // init cnts arr...
          const s = i << 9; const c = countPageFromTo(s, s + 511, buf);
          cnts[i] = c; pgcnt += c; }

        // for all base prime values up to limit**(1/6) in the page,
        // add all special leaves composed of this base prime value and
        // any number of other higher base primes, all different,
        // that qualify as special leaves...
        let brkchkr = false;
        for (; bpi <= Math.min(pisqrtcbrt, maxbpi) >>> 0; ++bpi) {
          const bp = bsprms[bpi];
          const minm = Math.max(Math.trunc(limit / (bp * (high + 1))),
                                Math.trunc(cbrt / bp)) >>> 0;
          const maxm = Math.min(Math.trunc(limit / (bp * low)), cbrt) >>> 0;
          if (bp >= maxm) { brkchkr = true; break; }
          for (let m = maxm; m > minm; --m) {
            const rt = specialroots[m];
            if (rt != 0 && bpi < rt >>> 1) {
              const stop = Math.trunc(limit / (bp * m) - low) >>> 1;
              const mu = ((rt & 1) << 1) - 1; // one bit means negative!
              s2acc -= mu * (phis[bpi] + countTo(stop));
            } }
          phis[bpi] += pgcnt; // update intermediate base prime counters
          pgcnt -= partialSieveCountPage(lwi, bp, cnts, buf);
          cntstrti = 0; cntaccs[0] = 0;
        }
        // for all base prime values > limit**(1/6) in the page,
        // add results of all special leaves composed using only two primes...
        if (!brkchkr)
        for (; bpi <= maxbpi; ++bpi) {
          const bp = bsprms[bpi];
          let l = pis[Math.min(Math.trunc(limit / (bp * low)), cbrt)>>>0]>>>0;
          if (bp >= bsprms[l]) break;
          const piminm = pis[Math.max(Math.trunc(limit / (bp * (high + 1))),
                                      bp) >>> 0] >>> 0;
          for (; l > piminm; --l) {          
            const stop = Math.trunc(limit / (bp * bsprms[l]) - low) >>> 1;
            s2acc += phis[bpi] + countTo(stop);
          }
          phis[bpi] += pgcnt; // update intermediate base prime counters
          if (bpi <= bpilmt) {
            pgcnt -= partialSieveCountPage(lwi, bp, cnts, buf);
            cntstrti = 0; cntaccs[0] = 0; }
        }
      }

      // complete cull page segment, then count up "P2" terms in range...
      for (; bpi <= bpilmt; ++bpi) partialSievePage(lwi, bsprms[bpi], buf);
      let ndx = 0 >>> 0;
      while (p2v >= low && p2v <= high) {
        const nndx = (p2v - low) >>> 1;
        ++p2cnt; ttlcnt += countPageFromTo(ndx, nndx, buf);
        p2acc += ttlcnt;
        ndx = (nndx + 1) >>> 0; p2v = Math.trunc(limit / revgen.next().value);
      }
      if (ndx < PGSZBITS) ttlcnt += countPageFromTo(ndx, PGSZBITS - 1, buf);
    }
    const Piy = bsprms.length - 1;
    // adjust for now known delta picbrt to pisqrt!
    p2acc -= p2cnt * ((p2cnt - 1) / 2 + (Piy - WHLPRMS.length));
//    console.log("S1: ", S1);
//    console.log("S2: ", s2acc);
//    console.log("P2: ", p2acc);
//    console.log("Piy", Piy);
//    console.log("p2cnt:  ", p2cnt);
    yield S1 + s2acc - p2acc + Piy - 1;
  }
  yield* complex();
}

const limit = 1e11; // sieve to this limit...

function last(gen) {
  let lst;
  for (let r = gen.next(); !r.done;  lst = r, r = gen.next()) ;
  return lst.value;
}

const start = Date.now();
const answr = last(primeCountTo(limit));
const elpsd = Date.now() - start;

console.log("The number of primes to", limit, "is", answr, "in", elpsd, "milliseconds.");

Note that the code has the capability of outputting a progress indication if one cares to hook it up...

8 replies

ishandutta2007 Nov 30, 2022

Also let me know if you know of any single file converter for C++. I tried with gcc -e but didnt get a usable code.
The reason I asked about transpiler is because lots of such advanced AI based tools are coming out these days. I tried TransCoder AI by meta but that simply isn't smart enough.
Single file convertion is lot simpler task than transpiling. I was hoping to find a tool for it easily but since that's not a front line usecase in software industry no one seems to have bothered to write one.

ishandutta2007 Dec 1, 2022

Time complexity wise magic Legndre beats LMO as $O(n ^ {3 / 4} / (log_2 n) ^ 2) \lt O(n ^ {2 / 3} log_2(log_2 n))$ for n>3
So for the existing competitive programming problems we wont "blow anything away" with LMO but we can design problems with higher constraints for sure once I understand the memory requirements.
Memory wise magic Legndre can only go upto 10^16 for a 1GM RAM. In competitive programming you can go up and down in setting time constraints as much as you wish but for memory there is a hard limit of ~1GB for most servers and it doesnt have provision to read/write from file system which number theory researchers might have.

GordonBGood Dec 1, 2022
Author

@ishandutta2007:

Also let me know if you know of any single file converter for C++. I tried with gcc -e...

I think I've mentioned before that my programming focus is not C/C++ and I generally look for a language that generates C/C++ with a better syntax (at least to me) so if I needed something C'ish, I would use Nim or the new language V (somewhat a GoLang look alike but better in some ways in my estimation). As also mentioned, since I use some advanced features of JavaScript such as iterator/generator abstractions which are just barely making it into the new C++ standards, it is unlikely to find an automatic tool that can directly convert those. As further mentioned, the tool will have to guess at the types to use, but if it just made everything into double's, it wouldn't be hard to fix that to what is actually required later...

I probably wouldn't bother with such tools for only a few hundred lines of code, and would manually "write around" the iterator/generator abstractions by using a lazy list since their use is for elegance and wouldn't affect performance...

Tell me one thing, when does all these algos overflow ~1GB memory. The Magic Legendre works fine till 10^16...

Yes, you mentioned previously that the limit for most competitive programming servers is 1 GB; as you found, the "magic" Legendre implementation is within that limit to 1e16 (it takes about 800 MB to that limit so won't go much higher). The Meissel based algorithms take less memory than that being generally O(n^(1/3)) based instead of O(n^(1/2)) so any of them should be usable to the 64-bit number limit using less memory than 1 GB; total running time limits may be more of a problem for higher limits than 1e16, as even Gourdon takes about twelve minutes to count to the 64-bit number limit range, and about a minute and a half to 1e18 when single-threaded...

Time complexity wise magic Legndre beats LMO as for n>3; So for the existing competitive programming problems we wont "blow anything away" with LMO...

I think you have got your limits wrong, as the log factors change quite slowly with range as the ranges get large, meaning that the exponent factor is the controlling one and the power of 2/3 is less than the power of 3/4. This can be measured empirically with Kim Walisch's "primecount" although he doesn't have a "magic" Legendre implementation; however, given that on my machine "magic" Legendre runs at about 20 milliseconds to 1e11 and just under a minute to just under 1e16 in C++/Nim/V; single-threaded LMO with `primecount" takes about 15 milliseconds to 1e11 and about 28 seconds to 1e16 (a growing advantage with range) and would have a constant factor of about half again in performance if he combined the "S2" and "P2" calculation to one sieving pass as I do instead of using two sieving passes. Using Deleglise with "primecount" has even slightly better asymptotic complexity taking about 14 seconds to 1e16 but isn't all that much better than straight basic LMO with the constant factor gain of combining "S2" and "P2", but Gourdon is quite a bit faster at 9 milliseconds to 1e11 and about 6.5 seconds to 1e11 (again an increasing advantage with range) so asymptotic complexity does favour the Meissel based algorithms.

So a Gourdon based implementation really would "blow away" the competition using "magic" Legendre by several times, although Meissel based algorithms are highly dependent on a very fast sieving implementation as I previously noted, and using just an odds-only SoE as in my current JavaScript implementation would give up most of the advantage in constant factor losses...

I don't normally work much with C++ and am not too interested in joining competitive programming races, but am working at converting LMO to Haskell for a StackOverflow answer just out of interest's sake; I am looking into implementing the "residual bit plane" wheel-factorization SoE that I have linked previously in order to not give up these constant factor advantages, and expect that my LMO implementation in Haskell using the LLVM back-end will be about as fast as Kim Walish's "primecount" LMO algorithm (actually a little faster because I combine the "S2" and "P2" counts). I think I could do the same for the Gourdon algorithm if I took the time. All of my implementations are single file...

ishandutta2007 Dec 1, 2022

I think you have got your limits wrong, as the log factors change quite slowly with range as the ranges get large, meaning that the exponent factor is the controlling one and the power of 2/3 is less than the power of 3/4.

I am speaking in terms of what the expression evalues to. Maybe in practice the benchmarked time may not be proportinal .
I couldn't prove it mathmatically, so evalued both the expressions $n ^ {3 / 4} / (log_2 n) ^ 2$ and $n ^ {2 / 3} log_2(log_2 n)$ for diferent values of $n$ and found that $$n ^ {3 / 4} / (log_2 n) ^ 2 \lt n ^ {2 / 3} log_2(log_2 n)$$ for any $3< n < 10^{68}$.

The Meissel based algorithms take less memory than that being generally O(n^(1/3)) based instead of O(n^(1/2))

Yes Many people in competitive programming do use lehmer or missel-lehmer. They beat magic legendre guys in memory but not in time.

So a Gourdon based implementation really would "blow away" the competition using "magic"

For more advanced algorithms I would have to worry about source code size limits as well. magic legendre is just 3KB, missel-lehmer is about 4KB. As long as I can code Gourdon under 10KB I should be fine.

if I needed something C'ish, I would use Nim or the new language V

Tell me something, if they are C/C++'ish can they match C/C++ time as well? In competitive programming I haven't seen any other language come close to C/C++'s execution time. If they can then there is no reason not to use them in competitive programming as well especially for advanced algorithms like Gourdon as for advanced algorithms we have to squeze the source code to as small as possible and new languages are very good at that.
is lack of LLVM back-end the only reason for Nim/Haskel not matching C/C++ performance?

my programming focus is not C/C++

My programming focus is not C/C++ either, my focus is "whatever it takes to get the best time and memory". I am open to write in assembly language too if required.

GordonBGood Dec 1, 2022
Author

@ishandutta2007:

I am speaking in terms of what the formula evaluates to. Maybe in practice the benchmarked time may not be proportinal .
I couldn't prove it mathmatically, so evaluated both and for different and found for any .

What you seem to be forgetting is that Bit O formulas have all kinds of constant factors that are left out such that these only describe how performance grows with range assuming the "operations" take a constant time each and those "log2" terms aren't necessarily base two logarithms but just generic logarithmic log terms so may have different constants to be applied, and the "operations" could be ratios of hundreds of times different without affecting the correctness of the Big O notation and thus Big O notation is not useful for comparing the relative performance of different algorithms but only describe how a given algorithms execution time changes with range if the assumption that the "operations" don't change in execution time with range holds. The "magic" Legendre algorithm depends on divisions and counting operations, which operations take longer than the Meissel type operations which are very fast composite number culling operations when they are highly optimized as in implementations such as Kim Walisch's "primecount" so comparing these only by the Big O formula is meaningless. The Big O formulas do correctly predict that the growth of the Meissel execution time with range is slower than that of the "magic" Legendre algorithm, which is all we can ask of Big O...

To further this thought, there was a time when mathematicians looked for the "holy grail" of O(n) (or less) for sieving algorithms and achieved it, but none of those algorithms are of practical use as compared to currently used SoE implementations with O(n log log n) complexity because, although the number of operations is linear with range, those operations take hundreds of times longer than currently implemented SoE implementation culling operations so the time saved in number of operations will never cancel out the time wasted in these complex operations for practical sieving ranges...

Yes Many people in competitive programming do use Lehmer or Meissel-Lehmer. They beat magic legendre guys in memory but not in time.

First, understand that the Meissel-Lehmer algorithm doesn't have any practical use except for to Professor Lehmer in calculating the number of primes to 1e10 and was necessary (or so he thought) in order to reduce the memory consumption to the very limited amount of RAM available on the computer he was using. Second, understand that what many people call Meissel-Lehmer isn't the true Meissel-Lehmer algorithm as it requires sieving to the n^3/4 limit and the use of a "P3" count correction which most of these don't have - most of these algorithms are just straight Meissel implementations. Next, understand that no "pure" Legendre, Meissel, or Meissel-Lehmer algorithm will ever be very useful for counting larger ranges such as 1e11 and higher without the "LMO" treatment, as the "magic" algorithm is the "LMO" treatment applied to the Legendre algorithm, and true "LMO" is applied to the Meissel algorithm. Finally, the later algorithms by Deleglise-Rivat and Gordon are just tweaks to LMO that change the balance between sieving and counting while analyzing and using techniques of simplifying the counting for some of the counting...

As long as you have an efficient wheel-factorized SoE, you can always win these competitions with maybe LMO or Deleglise, but definitely Gourdon implementations for larger counting ranges such as 1e16 and up if you can fit the code into whatever the code size limits are...

As long as I can code Gourdon under 10KB I should be fine.

If you are talking source code size, I think Gourdon can likely be coded in about 1000 LoC and definitely less than 2000 LoC, which probably is less than 10 Kilobytes of source code ASCII characters; LMO can definitely be coded in that space, even with a better wheel-factorized SoE, which will be necessary to make it worth it...

Tell me something, if they are C/C++'ish can they match C/C++ time as well?

Yes, Nim can produce code that is at least as fast as the code produced from C++; I have coded the "magic" Legendre algorithm in Nim using the GCC back-end which runs at the same speed or maybe a little faster than the C++ version from which I translated it. V is a newer language still in beta and may not optimize quite so well...

for advanced algorithms like Gourdon as for advanced algorithms we have to squeeze the source code to as small as possible and new languages are very good at that.

Yes, that is one of the advantages of Nim: modern abstractions such as recursive closures and iterators/generators are already part of the language as well as avoiding quite a bit of "boilerplate" being a "white space" significant language a bit similar to Python...

is lack of LLVM back-end the only reason for Nim/Haskel not matching C/C++ performance?

As stated, I have already matched C++ performance for the "magic" Legendre algorithm with Nim and the only reason I haven't submitted it is that one would have to a submit a PR to add a Nim Docker image to the test server.

For GHC Haskell, although adding the LLVM back-end would help immensely in getting close to the C++ performance for my submission in approximately doubling the performance, it still would be perhaps 25 percent slower than the C++/Nim/GCC submissions because in this case even LLVM doesn't optimize as well as GCC as the back-end doesn't automatically vectorize using SIMD/AVX2 instructions as does Nim/C++/GCC; GHC Haskell also has some overheads of Garbage Collection and more boxing/unboxing although I don't think these last are the bottleneck in this case...

ishandutta2007 · 2022-12-12T04:16:21Z

ishandutta2007
Dec 12, 2022

ChatGPT is fun , it tried to code LMO for me.
@GordonBGood how would you rate its efforts ? Your stackoverflow answer might have been one of its training data.
I guess for him the word "efficient" doesn't mean better time complexity but simplicity to code.

#include <iostream>
#include <cmath>

using namespace std;

// Compute the number of primes less than or equal to n using the LMO algorithm
int lmo(int n) {
    int pi_sqrtn = sqrt(n);
    int pi_n = n / log(n);
    int phi_n = n - pi_n;
    int phi_n_1 = phi_n + 1;
    int phi_n_2 = phi_n + 2;
    int pi_sqrtn_1 = pi_sqrtn + 1;
    int pi_n_1 = pi_n + 1;

    int sum1 = 0;
    for (int i = 1; i <= pi_sqrtn; i++) {
        sum1 += phi_n / i - phi_n_1 / (i + 1);
    }

    int sum2 = 0;
    for (int i = 1; i <= pi_sqrtn; i++) {
       sum2 += phi_n_2 / (2 * i + 1) - phi_n / (2 * i);
    }

    int sum3 = 0;
    for (int i = 1; i <= pi_sqrtn_1; i++) {
        sum3 += pi_n / i - pi_n_1 / (i + 1);
    }

    return pi_n - pi_sqrtn + sum1 - 2 * sum2 + sum3;
}

int main() {
    int n;
    cout << "Enter a positive integer n: ";
    cin >> n;
    cout << "There are " << lmo(n) << " primes less than or equal to n." << endl;
    return 0;
}

#include <iostream>
#include <cmath>

using namespace std;

// Calculate the number of primes under n using the LMO algorithm
long long lmo_prime_count(long long n)
{
    if (n <= 1) return 0;

    // Initialize variables
    long long sqrtn = sqrt(n);
    long long count = n / 2;
    long long phi = n;

    // Loop through all odd primes
    for (long long p = 3; p <= sqrtn; p += 2)
    {
        // Check if p is prime
        if (phi % p == 0)
        {
            // Update count and phi
            count -= count / p;
            phi = phi / p * (p - 1);

            // Loop through all powers of p
            while (phi % p == 0)
            {
                phi /= p;
            }
        }
    }

    // Return the final count
    return count;
}

int main()
{
    // Calculate and print the number of primes under 10^18
    cout << lmo_prime_count(1000000000000000000) << endl;

    return 0;
}

4 replies

GordonBGood Dec 12, 2022
Author

@ishandutta2007:

Hmm, the AI knows about the basic SoE but not how to implement it efficiently as to even page-segmentation or odds-only. It knows about the Legendre and Meissel-Lehmer algorithms but has the computational complexity wrong as predicting much too few operations. It knows about LMO, but again has the wrong computational complexity and its implementations are not LMO, which has the characteristic of requiring a sieve to the counting range to the two-thirds power. I am not impressed so far, as the AI model upon which it builds seems to have been an idiot...

ishandutta2007 Dec 12, 2022

It's like a kid who has prepared for the exam at the very last moment. Attempts all questions with utmost confidence in the hope that examiner might award him marks without going through the small details.

GordonBGood Dec 13, 2022
Author

@ishandutta2007 , yes, something like that. Practically pretty much useless, at least on these subjects. One would be better just to do a search on StackOverflow, in which case one should find my answers on sieving and prime counting and starts coding by translating those...

I suppose it isn't surprising that the AI doesn't do well on these subjects as some otherwise reasonably competent programmers also don't seem to have too much of a clue when it comes to prime sieving and counting, and I suppose that the AI is using as at least some of its source input from such programmers...

ishandutta2007 Dec 13, 2022

I tried chatGPT with various exam papers, it isn't performing that great with highly analytical examinations. In relatively easy analytical exams like SAT it scores around 60%. Research specific AI, ie Something like Meta's Galactica might perform better on these kinds of things. Anyways lets spare this thread of any off-topic discussions.

ishandutta2007 · 2022-12-12T04:46:48Z

ishandutta2007
Dec 12, 2022

But shy's away to code page segmented sieve

5 replies

GordonBGood Dec 12, 2022
Author

@ishandutta2007, well, it got that right, it only knows general information and can spout that when it's sources are correct (which they are not sometimes, it seems). This isn't going to be able to write LMO or even an efficient SoE for you...

ishandutta2007 Mar 30, 2023

ChatGPT is getting smarter.
Can you tell the time complexity of the following function :

ll prime_pi(const ll N) {
  if (N <= 1)
    return 0;
  if (N == 2)
    return 1;
  const int v = isqrt(N);
  int s = (v + 1) / 2;
  vector<int> smalls(s);
  for (int i = 1; i < s; i++)
    smalls[i] = i;
  vector<int> roughs(s);
  for (int i = 0; i < s; i++)
    roughs[i] = 2 * i + 1;
  vector<ll> larges(s);
  for (int i = 0; i < s; i++)
    larges[i] = (N / (2 * i + 1) - 1) / 2;
  vector<bool> skip(v + 1);
  const auto divide = [](ll n, ll d) -> int { return (double)n / d; };
  const auto half = [](int n) -> int { return (n - 1) >> 1; };
  int pc = 0;
  for (int p = 3; p <= v; p += 2)
    if (!skip[p]) {
      int q = p * p;
      if ((ll)q * q > N)
        break;
      skip[p] = true;
      for (int i = q; i <= v; i += 2 * p)
        skip[i] = true;
      int ns = 0;
      for (int k = 0; k < s; k++) {
        int i = roughs[k];
        if (skip[i])
          continue;
        ll d = (ll)i * p;
        larges[ns] = larges[k] -
                     (d <= v ? larges[smalls[d >> 1] - pc]
                             : smalls[half(divide(N, d))]) +
                     pc;
        roughs[ns++] = i;
      }
      s = ns;
      for (int i = half(v), j = ((v / p) - 1) | 1; j >= p; j -= 2) {
        int c = smalls[j >> 1] - pc;
        for (int e = (j * p) >> 1; i >= e; i--)
          smalls[i] -= c;
      }
      pc++;
    }
  larges[0] += (ll)(s + 2 * (pc - 1)) * (s - 1) / 2;
  for (int k = 1; k < s; k++)
    larges[0] -= larges[k];
  for (int l = 1; l < s; l++) {
    ll q = roughs[l];
    ll M = N / q;
    int e = smalls[half(M / q)] - pc;
    if (e < l + 1)
      break;
    ll t = 0;
    for (int k = l + 1; k <= e; k++)
      t += smalls[half(divide(M, roughs[k]))];
    larges[0] += t - (ll)(e - l) * (pc + l - 1);
  }
  return larges[0] + 1;
}

ishandutta2007 Mar 30, 2023

Response of you.com:

GordonBGood Apr 1, 2023
Author

Hi @ishandutta2007:

ChatGPT is getting smarter. Can you tell the time complexity of the following function :

But not that smart yet, as it thought it recognized some things that aren't there, as follows:

It thought that the the algorithm is Meissel -Lehmer algorithm, which it is not, as if it were the full Meissel-Lehmer prime counting algorithm it would need to compute the prime sieve to O(n^(3/4)) and if the Meissel algorithm to O(n^(2/3)) where it correctly identifies that the given code segment only calculates the sieve to O(n^(1/2)) (the square root).
It incorrectly identifies that the sieve is a segmented sieve, which it is not.
It can't do math, as it estimates the computational complexity as O(n^(2/3) because "n^(1/2) times n^(1/3) is n^(2/3)" which is incorrect but multiplication is the sum of the powers or n^(5/6), but even if it could do math the answer is wrong as the complexity computation should be based on the number of primes up to the square root which is O(n/log n) and the number of operations doing partial sieving which it seemingly has no clue about.

The correct answer should be O(n^(3/4)/((log n)^2)), but it is no where close to being able to determine this, as although it identifies the loops and where the main work is carried out, it doesn't have a clue as to what the main loop is actually doing...

GordonBGood Apr 1, 2023
Author

Hi @ishandutta2007:

Response of you.com:

This answer is somewhat better than that of ChatGPT as follows:
!. The answer identifies that this is based on Legendre's work (as in only requiring a prime sieve to the cube root of the counting range).
2. The answer is correct that it is complicated to compute the time complexity.

However, the answer is incorrect that the time is dominated by the sieve calculation of the odd primes as it doesn't see that the sieve code is a minor part of the main loop and that the time expended is dominated in the first loop that uses partial sieving by the time to update the "larges" array and then to update the count values in the "smalls" array and that there is an almost as significant loop after all sieving has been completed to additionally compute the remaining counts; it gives no rational explanation of where its given O(n^(2/3)/log n) computational complexity comes from, and I don't suppose that it is any coincidence that this is the computational complexity of LMO with all of the optimizations given in the paper.

The answer doesn't state that the code is fully based on Legendre but with the LMO optimizations applied plus some others to give O(n^(3/4)/((log n)^2)) computational complexity or something close to that - I'm not positive whether the log term should be squared as given here of not...

I'm not familiar with you.com to know whether that is an AI or human answer...

Prime Counting Functions #6

Replies: 11 comments · 45 replies

vitaly-t Oct 23, 2021 Maintainer

GordonBGood Oct 23, 2021 Author

GordonBGood Oct 25, 2021 Author

vitaly-t Oct 25, 2021 Maintainer

GordonBGood Nov 1, 2021 Author

vitaly-t Nov 1, 2021 Maintainer

GordonBGood Nov 1, 2021 Author

GordonBGood Nov 21, 2021 Author

GordonBGood Sep 21, 2022 Author

vitaly-t Sep 21, 2022 Maintainer

GordonBGood Sep 23, 2022 Author

GordonBGood Sep 25, 2022 Author

vitaly-t Sep 19, 2022 Maintainer

vitaly-t Sep 20, 2022 Maintainer

vitaly-t Sep 20, 2022 Maintainer

GordonBGood Sep 22, 2022 Author

vitaly-t Sep 22, 2022 Maintainer

GordonBGood Sep 29, 2022 Author

GordonBGood Oct 1, 2022 Author

GordonBGood Oct 1, 2022 Author

GordonBGood Oct 2, 2022 Author

GordonBGood Nov 29, 2022 Author

GordonBGood Dec 1, 2022 Author

GordonBGood Dec 1, 2022 Author

GordonBGood Dec 12, 2022 Author

GordonBGood Dec 13, 2022 Author

GordonBGood Dec 12, 2022 Author

GordonBGood Apr 1, 2023 Author

GordonBGood Apr 1, 2023 Author

Replies: 11 comments 45 replies

vitaly-t
Oct 23, 2021
Maintainer

GordonBGood Oct 23, 2021
Author

GordonBGood Oct 25, 2021
Author

vitaly-t
Oct 25, 2021
Maintainer

GordonBGood Nov 1, 2021
Author

vitaly-t
Nov 1, 2021
Maintainer

GordonBGood
Nov 1, 2021
Author

GordonBGood
Nov 21, 2021
Author

GordonBGood Sep 21, 2022
Author

vitaly-t Sep 21, 2022
Maintainer

GordonBGood Sep 23, 2022
Author

GordonBGood Sep 25, 2022
Author

vitaly-t
Sep 19, 2022
Maintainer

vitaly-t Sep 20, 2022
Maintainer

vitaly-t Sep 20, 2022
Maintainer

GordonBGood Sep 22, 2022
Author

vitaly-t Sep 22, 2022
Maintainer

GordonBGood
Sep 29, 2022
Author

GordonBGood Oct 1, 2022
Author

GordonBGood Oct 1, 2022
Author

GordonBGood Oct 2, 2022
Author

GordonBGood
Nov 29, 2022
Author

GordonBGood Dec 1, 2022
Author

GordonBGood Dec 1, 2022
Author

GordonBGood Dec 12, 2022
Author

GordonBGood Dec 13, 2022
Author

GordonBGood Dec 12, 2022
Author

GordonBGood Apr 1, 2023
Author

GordonBGood Apr 1, 2023
Author