cpu cryptonight_twenty_hash #1604

mbasaman · 2018-05-29T22:27:54Z

Please make sure your PR is against dev branch. Merging PRs directly into master branch would interfere with our workflow.

mbasaman · 2018-05-29T22:31:42Z

This seems to improve CPU performance a bit. I'm surprised that the load from RAM takes so much time, I think the update makes the prefetch option more effective (at least on the CPU's that I tested).

I've seen some comments on the internet that indicate prefetch is only effective if you wait 100-200 CPU cycles or so. I'd like to know if there is some asynchronous way to load a register from RAM because I don't think the computations take much time at all.

Also, you could probably add some code to "auto-discover" the optimal number of hashes based on runtime measurements for the user's CPU.

Spudz76 · 2018-05-30T00:22:46Z

In agreement with possibilities of smarter CPU autodetections, and based on short successive benchmarking in more AI fashion than fingerprinting directly from cache size and core count and mostly not even looking at the manufacturer or model or coretype at all.

I had previously considered extending the newer --benchmark option to do successive short rounds to rough-guess the best settings. Probably a --retrain option which would start from what is currently in the config for the backend, and "mine" for the best ones, and put those back in the config file for you (or write out a new version).

Sort of wanted to make a utility function that would read old config files and then write out the latest version with the new settings and whatever for version upgrades, and retrain is similar to that as well.

mbasaman · 2018-05-30T06:36:10Z

Yeah, that sounds good.

What was had observed is that 20 hashes at a time was faster than 3 or 5, but the optimal number might be 15 or something like that.

It's probably different for each machine. I'm not sure exactly what the trade-offs are, but I think RAM access is the issue and I'd prefer a method that was deterministic. Perhaps it's not possible and I'd also like to try the Intel "streaming" features.

One interesting thing that I had thought of is that it's very unlikely that you would "need to know" the data that you store in the next loop iteration (something like 1/128000), so if the "store" is causing a delay and there was a asynchronous way to do it, it might just work 99% of the time. I haven't done the math and you would need to double check the hash if it exceeds the pool difficulty. I don't know if "store" is the problem or if the feature is even available. I could be wrong.

Spudz76 · 2018-05-31T20:30:31Z

Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
15MB smartcache
Using Win7 with

    { "low_power_mode" : 20, "no_prefetch" : false, "affine_to_cpu" : 0 },
    { "low_power_mode" : 20, "no_prefetch" : false, "affine_to_cpu" : 1 },
    { "low_power_mode" : 20, "no_prefetch" : false, "affine_to_cpu" : 2 },
    { "low_power_mode" : 20, "no_prefetch" : false, "affine_to_cpu" : 3 },

and I assume that's correct to enable this patch (20 with prefetch not-disabled aka enabled) - Also tested with 20 but no_prefetch true and that gave identical speed to single threaded/normal.

Same config no patch with 1/true gave 278H/s and with patch is 261H/s a loss of 17H/s or 6.11%

Atom D2550 with Linux and gcc-7 had a loss of half, but not surprising. It isn't fast anyway and has no AES. More testing out of curiosity, and it made it worse which is as expected.

Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
3MB smartcache
As above Xeon test but with Linux and clang-3.8 and two host threads, otherwise same.
Normal was 86H/s and with patch is 81H/s loss of 5H/s or 5.81%
With 20/true it showed 85 but that is within margin of error to match the Xeon result.
Yes this is overcommitted on cache and should probably only run one host thread however I do get an additional ~16H/s off the cache starved thread without apparently hurting the other one.

Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
20MB smartcache
DUAL CPU (so, divide below by two for single cpu result)
Running 12 host threads, Linux, cmake 3.8.1 (jessie backports)
Normal 594H/s with patch 566H/s, loss of 28H/s or 4.71%
Within error margin again with 20/true.

mbasaman · 2018-05-31T20:43:51Z

I think the config has to be 6. It's faster on the machines I've tried.

"cpu_threads_conf" :
[
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 2 },
],

[2018-06-01 05:25:43] : Mining coin: monero7
[2018-06-01 05:25:43] : Starting 6x thread, affinity: 0.
[2018-06-01 05:25:43] : hwloc: memory pinned
[2018-06-01 05:25:43] : Starting 6x thread, affinity: 2.
[2018-06-01 05:25:43] : hwloc: memory pinned

Totals (ALL): 74.1 73.7 0.0 H/s
Highest: 74.5 H/s

"cpu_threads_conf" :
[
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 1 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 2 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 3 },
],

[2018-06-01 05:29:37] : Mining coin: monero7
[2018-06-01 05:29:37] : Starting 5x thread, affinity: 0.
[2018-06-01 05:29:37] : hwloc: memory pinned
[2018-06-01 05:29:37] : Starting 5x thread, affinity: 1.
[2018-06-01 05:29:37] : hwloc: memory pinned
[2018-06-01 05:29:37] : Starting 5x thread, affinity: 2.
[2018-06-01 05:29:37] : hwloc: memory pinned
[2018-06-01 05:29:37] : Starting 5x thread, affinity: 3.
[2018-06-01 05:29:37] : hwloc: memory pinned

Totals (ALL): 67.6 67.3 0.0 H/s
Highest: 67.7 H/s

Intel(R) Core(TM) i7-3537U CPU @ 2.00GHz

I've tested it on a few other linux machines, it's wasn't "as good" but it's still better in the tests I've run.

Spudz76 · 2018-05-31T21:03:08Z

That gave 60H/s on the little i3 under Linux, with or without the second host thread. Waaay worse.
81H/s on the 278H/s Xeon E5 ewwww

What CPUs had you tested with success?

mbasaman · 2018-05-31T21:08:56Z

are you using low_power_mode == 6? It's faster on several servers that I've tried. I think using 20 in the config will default back to low_power_mode = 1 (which hasn't changed).

My expectations were the same as yours, in that it would be slower, but it's faster on my old i7 and perhaps 10% faster on some of the other processors I've tried. The only thing I can tell you is that I'm using it and my total rate is higher than with any of the other options. Something like 3100 to 3500 H/s.

mbasaman · 2018-05-31T22:35:52Z

One other thing to note is that I'm only using CPU 0 and 2, not 0, 1, 2 and 3.

Perhaps you can check if only using half of the CPU's has any effect.

The next best setting is 4 CPU's at power_level 5 on my machine. The logs are above.

mbasaman · 2018-05-31T23:11:16Z

Yeah so, reducing the number of hashes to 10 and kicking it back up to 4 CPU's put it in line with 20 hashes on 2 CPU's.

[2018-06-01 06:03:39] : Starting (10)x thread, affinity: 0.
[2018-06-01 06:03:39] : hwloc: memory pinned
[2018-06-01 06:03:39] : Starting (10)x thread, affinity: 1.
[2018-06-01 06:03:39] : hwloc: memory pinned
[2018-06-01 06:03:39] : Starting (10)x thread, affinity: 2.
[2018-06-01 06:03:39] : hwloc: memory pinned
[2018-06-01 06:03:39] : Starting (10)x thread, affinity: 3.
[2018-06-01 06:03:39] : hwloc: memory pinned

Totals (ALL): 75.1 73.5 0.0 H/s
Highest: 79.0 H/s

Spudz76 · 2018-05-31T23:45:09Z

Yes the retest previous post was with == 6
I guess it's strange it accepts 20 if there is no index for such / must just default to 0/1 (same meaning) which explains the same result when using same no_prefetch==true as with 0/false/1
The == 6 result was clearly doing something different.
What compiler were you using, I am very likely using weird ones (clang, gcc-7) in comparison to yours. Compiler may be optimizing out your hand expansions or such?

Spudz76 · 2018-05-31T23:53:26Z

I think I see, try to use less host threads (cores) so that the cache and prefetch stacks up longer queues (to hit that hundreds mark) instead of widening bandwidth and/or full cache size utilization (cache/2MB for monero7). Taller work stacks via two blocks (4MB cache).

Testing that...

More seems after checking that this wants a total number of threads (host thread cores * power threads) that is divisible by 20 and then it will use the 20-way instead of without patch and those same settings it would use 4 * 5-way and not stack the prefetch up as far... I think?

mbasaman · 2018-06-01T00:06:35Z

I'm not an expert on the compiler optimizations. The compiler I'm using is: gcc (Debian 6.3.0-18+deb9u1)

If that doesn't work, you may want to try 10 hashes at a time of the full number of cores. 8 or 12 might be even better.

What I was thinking is that prefetch isn't helping as much as it should and the cause may be that there aren't enough CPU cycles after the prefetch command.

If you were reading from L1 exclusively (32KB or 256KB cache size via prefetch) I think it would be faster but there must be some trade-off with the limited number of registers. I tried 100 hashes at a time but it was slower.

Perhaps the store command is also a problem. I was trying to think if there was a way to offload the store because it's very unlikely you would re-read the bits right away.

I also re-ordered the macros to put the prefetch as the last command (in the twenty method only).

If it's not faster for you that don't worry too much about it. Also, 10 hashes with the full number of cores may be more convenient to configure.

Spudz76 · 2018-06-01T00:39:36Z

I see around 65 per core with four host threads, 75 per core with two.
So there is a gain there, but using 4@65 is better than 2@75 and I can't seem to get 4@75. BUT- this may be turbo kicking in because I am using less cores.
Otherwise various settings of low_power_mode even up to 3200 (didn't know it would go that high) gave more or less similar results (260H/s where normal was 268H/s) and then 151H/s with the two cores. Would be real sweet to get the 302H out of these but again I think that is all due to the turbo from using 2 of 4 physicals. If I drop to one it should go slightly faster yet (80H/s?)

I do not really know how to check for actual usage of the 20-way at runtime

mbasaman · 2018-06-01T01:02:54Z

What I would try is 10 hashes at a time. You can change twenty_work_main() to use multiway_work_main<10u>(); instead of multiway_work_main<20u>(); and comment out the STEP(a10) through STEP(a19)

I'll make the change on the external branch now or you can try it yourself.

mbasaman · 2018-06-01T01:15:34Z

ok, I committed the changes to my local fork. If you want to try it you can clone it.

It has similar numbers with 10 hashes on 4 cores to the 20 hash with 2 cores on this machine.

I'll clean it up if that works. You may also want to try the linux machine.

Spudz76 · 2018-06-01T04:54:37Z

Okay so what do you get with

"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },
],

mbasaman · 2018-06-01T05:26:35Z

I updated it to use 10 hashes instead of 20 (with the hope that you can use all 4 cores).

Try it a new build with the update and use all 4 cores

"cpu_threads_conf" :
[
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 1 },
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 2 },
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 3 },
],

Totals (ALL): 76.3 74.7 0.0 H/s
Highest: 77.6 H/s

I think the base case of power level 5 was 68 H/s or something like that.

mbasaman · 2018-06-01T06:01:22Z

Also, I just checked in a third update.

This to use the new code on power level 5 (as well as 6). Perhaps the macro changes are the cause of the improvement I am seeing:

I had originally not wanted to touch the existing code.

"cpu_threads_conf" :
[
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 1 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 2 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 3 },
],

With the "latest" update to power level 5:

Totals (ALL): 76.3 75.9 0.0 H/s
Highest: 76.5 H/s

Before the update:

Totals (ALL): 63.6 67.7 0.0 H/s
Highest: 69.0 H/s

So perhaps the macro changes were the cause, not the increased number of hashes. Check out the latest build and give both power level 5 and 6 a try with all 4 of the cores.

mbasaman · 2018-06-01T07:07:47Z

Ok, so I made a fourth update to power levels 3, 4, 5, and 10 to use the branched code.

Starting 3x thread, affinity: 0.
Starting 3x thread, affinity: 1.
Starting 3x thread, affinity: 2.
Starting 3x thread, affinity: 3.

Original - Totals (ALL): 58.9 59.5 0.0 H/s
Updated - Totals (ALL): 61.6 61.3 0.0 H/s

Starting 4x thread, affinity: 0.
Starting 4x thread, affinity: 1.
Starting 4x thread, affinity: 2.
Starting 4x thread, affinity: 3.

Original - Totals (ALL): 66.8 66.9 0.0 H/s
Updated - Totals (ALL): 72.6 72.4 0.0 H/s

Starting 5x thread, affinity: 0.
Starting 5x thread, affinity: 1.
Starting 5x thread, affinity: 2.
Starting 5x thread, affinity: 3.

Original - Totals (ALL): 69.9 71.6 0.0 H/s
Updated - Totals (ALL): 79.1 78.6 0.0 H/s

Starting 10x thread, affinity: 0.
Starting 10x thread, affinity: 1.
Starting 10x thread, affinity: 2.
Starting 10x thread, affinity: 3.

Updated - Totals (ALL): 79.8 78.0 0.0 H/s

Please update the build and give it a try. I'm seeing it faster for power levels 3, 4, 5 and 10 now. If that's not what you see, let me know.

*** note - I changed the config to expect 10 instead of 6

"cpu_threads_conf" :
[
{ "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 1 },
{ "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 2 },
{ "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 3 },
],

Spudz76 · 2018-06-01T18:00:55Z

Okay, it looks like all the CPUs I was testing are Ivy Bridge variants, and none of this makes those faster at all, it hurts them bad. Got 60/75 total versus the usual 260-280 across 3,4,5,10 leaving the affinity and thread counts the same as normal/best.

Tested this time on a Linux with weird Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz which is Broadwell based and it did show gains though!
Normal best with false/true was ~92H and then with this patch and 10/false it gets 102H, running three cores of four, due to the 6MB cache (so only 3 x 2MB block space)

So maybe this just hates Ivy and Sandy memory controller design? But I thought your CPU is an Ivy... did you try my CPU single thread no prefetch config on it from a couple posts back? I'm curious if that roasts your other rates like it does on all of mine.

mbasaman · 2018-06-01T18:08:52Z

This patch requires prefetch to be on. Let me know what you'd like me to test. Without prefetch it won't work. I'd recommend trying power level 5 with prefetch on first.

I'll give the single threads with and without prefetch at try.

Spudz76 · 2018-06-01T18:21:50Z

I want you to test my normal Ivy/Sandy config which I posted above, since you seem to have an Ivy Bridge core. No low power and No prefetch, and yes then it's not using your code, but that is how I'm getting my rates that are 4x faster than with the prefetch and your patch.

Directed prefetch seems to work well with Broadwell cores, but however the Ivy/Sandy ones have an internal prefetch optimization (which can't be shut off) it doesn't like being directed, and I get 4x the performance by letting it do all prefetch management (aka "no prefetch" from xmr-stak / but again CPU still does its own regardless of that)

Spudz76 · 2018-06-02T08:40:23Z

OK got to test this on Win7 Intel(R) Core(TM) i3-4130 CPU @ 3.40GHz with 3MB cache and found a strange way to new best of 106H/s total. Use one thread with your code at 10, and another on the other physical core the way I usually have it set (false/true) and then both get around 53H/s each.

Not sure why/how (should be out of cache) but it works. Normal with all false/true was 86H/s so it gained 20H/s.

Same gain same config in Linux on the i3-4160 3.6GHz 3MB. Runs 6H/s slower overall in any mode so even with the faster CPU, Linux is slower.

    { "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 0 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },

Also these are Haswell core with 8/8/8/12 way caches

mbasaman · 2018-06-02T08:57:00Z

all right, well - I don't know if that is useful to you or not

If it's something that you want, what I would do is leave the existing methods exactly how they are and add new methods for additional options. I would add 6, 7, 8, 9 and 20 as options as well. I think the only issue with this is adding to the compile time.

I have no idea why it works - as you said, it should be out of cache. The only thing that I can think of is that will additional CPU cycles between the prefetch and the load, it is more likely that the memory is read from L1, which should be a lot faster than L3.

In addition to that, I would create a wrapper script that iterates through all the possible "combinations" to find the best config, by actually running for 1 or 2 minutes on each possible config. It could be done in C or even shell and produce a report that lists the config possibilities in order. Hopefully it would only take a couple of hours to run.

Spudz76 · 2018-06-02T09:03:25Z

Loaded that Xeon D-1518 Broadwell up with 10/false but used every core (0-7) which means HT cores also, and it scores 150H/s that way. Up from original 92H/s, huge.

mbasaman · 2018-06-02T09:13:34Z

are you comparing against the trunk? I also changed 3/4/5 with the latest checkin

Spudz76 · 2018-06-02T09:42:48Z

no, against dev without your patches is the comparison (also the normal one I run daily)
I have never run master ever since I started compiling in-situ on each Linux rig.

And yes I have been checking all four permutations 3,4,5,10/false and have all the patches

mbasaman · 2018-06-04T10:08:50Z

It went from 72.4 to 86.1 H/s on this i7-3537U CPU @ 2.00GHz

The script is checked in at xmr-stak-config. You can take it and rename it if you want it.

Regarding "cache starvation", I think that the L1 cache is only 32KB or 256KB. If prefetch is working, it should never hit L3. I think the trade off is with the number of registers for the variables, not L3. I could be wrong.

I think it should still be faster. I was able to get over 400 H/s by commenting out the _mm_load_si128 and _mm_store_si128 so memory access is probably the issue, not the encryption (unless the compiler is doing something). What I am going to try next is all the combinations of possible macros to switch between 9 hashes at a time.

power options: 1 2 3 4 5
branch: master

number of cores: 2
threads_per_core: 2
test time: 75 seconds
power options: 5
prefetch options: 2
configurations: 65
estimated runtime: 1:21:15

{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 0},
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 1},
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 2},
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 3},

Hashrate: 72.4

power options: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 103 104 105 106 107 108 109 110 111 112 113 114 115
branch: more_options_dev

// N > 100 use the branched code optimized for prefetch with (N - 100) hashes

number of cores: 2
threads_per_core: 2
test time: 75 seconds
power options: 28
prefetch options: 1
configurations: 434
estimated runtime: 9:02:30

{ "low_power_mode" : 109, "no_prefetch" : false, "affine_to_cpu" : 0},
{ "low_power_mode" : 112, "no_prefetch" : false, "affine_to_cpu" : 1},
{ "low_power_mode" : 109, "no_prefetch" : false, "affine_to_cpu" : 2},
{ "low_power_mode" : 112, "no_prefetch" : false, "affine_to_cpu" : 3},

Hashrate: 86.1

mbasaman · 2018-06-06T09:18:12Z

ok, I'm moving on to something else for the time being. If you want to use the new 10 method, I would suggest keeping the unmodified 3/4/5 methods as was done in the 1st version of the checkin.

Good luck

Spudz76 · 2018-06-06T20:47:13Z

Thanks for all the work, I did not find the loop-test-script on your fork anywhere, but will probably write my own to use --benchmark 7 --benchwait 1 --benchwork 14 so it only wastes ~15s each test, and doesn't need to connect to a pool at all.

I would have been unable to use the test script as my Internet doesn't always connect to pool on the very first packet / hangs and requires a kick / no way to detect that easily in looper script. Also benchmark mode exits by itself (no sleep/kill).

mbasaman · 2018-06-06T21:00:53Z

yeah, it's no problem at all. I'll probably read the GPU code as well eventually, it's probably already really good, but you it's always possible that there is something there.

The script is at https://github.com/mbasaman/xmr-stak-config

If you want the script, you can take it and rename it.

You can update the code to pull the 15 second value. It's just scraping the log file.

I think what it's doing now is waiting 75 seconds, then if the log entry doesn't get generated, it will re-read the log every 5 seconds for an additional 75 seconds. You can update it to wait longer for the network issues.

psychocrypt · 2018-07-14T21:08:07Z

Could someone please give me a short summery if 20hash is an improvement and short summarize an example. Please do not measure one core. Please use a good old cfg vs a cfg with this PR.

mbasaman · 2018-07-15T09:07:07Z

if you want to use the "10 hashes at a time" method OR the "optimized for prefetch" changes, I would suggest a new PR that retains the original implementations.

https://github.com/mbasaman/xmr-stak/tree/more_options_dev is an example, but it would need to be trimmed to improve build time.

The last tests I did are available at:

https://github.com/mbasaman/xmr-stak-config/blob/master/results.2.txt

which showed a 20% improvement versus

https://github.com/mbasaman/xmr-stak-config/blob/master/results.1.txt

psychocrypt · 2018-07-15T09:16:36Z

what kind of cpu do you use for your tests? Leds than 100H/s looks like some very old low end cpu.

Spudz76 · 2018-07-15T12:46:22Z

We were pretty clear everywhere above about which CPUs got what and we tested quite a range

And yes like Celerons and stuff that aren't supposed to work well anyway. But the main interesting thing is how stacking silly high levels of work lets the Intel funky magic (SmartCache, predictive whatever, adjacent cache line prefetch, etc...) do a much cleaner job which gets quite a boost in some cases.

I think once a huge stack of tasks is queued it stops readjusting the SmartCache topology as much (as it can see a gigantic workload of the same stuff over and over) but when it remains a normal logical task with short/fast sections, it decides to share out the cache to cores differently, and changes it often - leading to not same hashrate on each core (cores should be very close to completely equal) or floating hashrates ('high' rate core floats among the others, but overall total hashrate is stable).

Since SmartCache is relatively snake oil with no docs, it makes sense to try semirandom ideas just to see how it reacts, which might tell us more about exactly how to force (or pseudo-force) cache-to-core allocations.

Spudz76 · 2018-07-15T15:52:50Z

#1649 may be related to SmartCache being "weird" with too-short a work stack, but definitely shows the problem, and I guess as usual "some other miner" is perfect every time.

My hopes were this long stacking idea would fix that.

baldpope · 2018-07-18T13:25:39Z

Is this PR going to be in an upcoming release? I'd love to get my performance back on these CPUs

Spudz76 · 2018-07-19T03:42:36Z

It probably needs a thorough cleanup, and/or it seems like a template could generate a variable selection of low_power_mode thread depths. When compiling with 2-150 it takes about an hour, building a kernel per thread depth, and the executable is gigantic, so including them all is not particularly great except for benchmarking to figure out which depths to include for your system(s).

Would be better to utilize hardware ID information and have a DLL per family, split the CPU backend into so/dll like the other backends. Memory footprint would be smaller, loading at runtime whichever CPU backend has whatever heights work best on it.

But collecting reliable benchmark data is somewhat a problem, so a profiler tool would have to be created (xmr-chek ?) to use most of the core backend code but only run hardware ID and benchmarks (test all possible backends and step through thread depths for 5s each and then retest the top ones for 15s each like successive search) and provide a file to post/upload somewhere, in a machine importable format. That would provide enough information to know what's best for a particular CPU and/or family of CPUs, which then could be used to make a specific backend for that ID/fingerprint (model/stepping/revision/cache amount/etc).

Bench-cloud system could benefit the autoconfig code in the GPU backends as well, not having a benchmark-cloud type system makes it tough to get good optimization data for those just the same.

There is that site with various paste dumps of old format config files and whatnot, but that isn't machine readable nor always accurate and most are from ancient backends or old algorithm versions (out of date).

mbasaman · 2018-07-19T10:44:56Z

yeah, so it does need to be cleaned up.

There are probably ways to reduce compile time by moving the code to a static library (or something similar).

What I was thinking is to add some kind of learning process to the code if every CPU is different. For example, you mine for 9 minutes on the best known config, and then spend 1 minute mining on an untested config. After a few hours or days, the program will have built it's own deployment-specific benchmarks, which could be persistent locally via the file system or whatever.

After the benchmarks have been auto-generated (after a few days) it could still try to re-benchmark known configs but probabilistically favor configs that are almost as good as the best config, on the off chance that there was some irregularities during the original bench-marking.

That's assuming that you have lots of combinations of implementations via static libraries. The other option would be to just add a single method with 10 hashes per run using the modified macros that are optimized for prefetch and leave the original code unmodified.

mbasaman · 2018-07-19T10:49:57Z

and if you guys don't have time to put it together yourself, just agree to a specification for what you want done and I'll implement it when I have some free time

I can also look at the GPU code but I have to put the time in to read it.

baldpope · 2018-07-19T12:42:37Z

I like @Spudz76 suggestion above of running a benchmark and uploading results some how. I agree, the current xmr bench mark site is OK as a reference, but far from definitive and not machine readable at all.

I'm not familiar with the client side portion at all, but could take a stab at putting something on the get/share results side. What all field data would be relevant to store?

Spudz76 · 2018-07-19T13:12:26Z

It would be ideal if the CPU backend compiled its executable/kernel at runtime based on settings/detections just like the GPU backends. But then the miner rig needs compilers, not a problem for Linux but, windows would be painful. However it would solve the huge-exe or many-dll problem and allow for one piece of code with optional sections and variable template expansions/unrolls just like the OpenCL or CUDA code does, assembled at runtime.

Maybe this is why OpenCL supports CPUs at all, sort of a portable compile engine. Too bad it is never fast as direct to metal, but I do wonder how close it can get (one would hope Intel or AMD made their CPU-OpenCL with strengths and weaknesses of their CPU in mind)

Spudz76 · 2018-07-19T13:25:51Z

I have been somewhat maintaining this patch against current dev on my fork+branch dev-hax however it only applies the 10-way (which was what I mostly needed)

It is untested other than monero7 but the deca-work is all wired up for all algos. Also note donation is already patched out along with some random type-warning fixes in the latest commit, CUDA detection verbosity patch from my other PR in the next most recent commit, and the 10-way patch in the third (then normal upstream dev commits). This is the branch I compile my active miners from, feel free to checkout and compile from it too, you can see in the commits there are no awesome backdoors or anything evil added. Interested if the lite/heavy/fast/bittube variants all work properly if any of you mine those, and have a CPU that likes 10-way...

The 10-way helps hashrate on a Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz by quite a bit compared to defaults. Also slight boost using 10-way on first core of Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz and a single default thread on the second core (only 3MB cache, not sure how the second singlethread even gets cache to use, but it works...)

Also another branch on my fork, dev-superthread is currently behind but I will begin adapting the 20-way and etc to it.

Spudz76 · 2018-07-25T17:17:33Z

I've noticed my current branches with the 10-way patch applied probably do not work correctly for anything but what I mine (monero7) there are some missing patches to the macros for newer/changed coins.

Caveat until I get them fixed up. Maybe converting the macros to true templates too.

I did modify and succeed in getting the cpu backend compiled into a shared library just like the other two. There is some mess with the other backends needing to use the 'core' cpu code to validate results, but I forked a simplified singlethread CPU hash validator (cpu::reschk instead of cpu::minethd which is the actual backend) to leave in the main backend static lib for those to all use. Bonus there is the unoptimized validator should never have any optimization-related miscalc problems, and not contend with CPU mining that is using cache (the validator is run only when one of the backends reports a result, not continually in a loop, so speed is not an issue and it can stay out of the way).

I am currently wrestling with CMake not building the OpenCL backend until after the main exe, so then it doesn't get linked properly. CUDA still builds before the main exe so I've got something messed up.

But CPU mining via the externalized backend seems to work so far.

Spudz76 · 2018-07-25T17:30:33Z

Also suspicious that the X-way cache on the CPU may be the precise X-way threading that works optimally. One of my CPUs has 20-way 15MB L3 thus the hypothesis is that 20-way threading will work best. As long as the linesize of cache is 64 bit that is...

Pretty much rewriting the autotuner to use more knowledge that hwloc provides, and finally merging the hwloc and non-hwloc code into one file so it's easier to work with. The non-hwloc autotuner also had more messages about what it found and how it thinks the layout should be and why, which is useful info but totally missing in the hwloc version.

mbasaman · 2018-07-30T22:09:29Z

ok, I'm glad you picked it up. I still think it's theoretically possible to be faster, perhaps using a different method. Good luck.

intial checkin

e2a9777

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request May 31, 2018

cpu cryptonight_twenty_hash fireice-uk#1604

6432d94

10 hashes at a time

fea0ae3

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Jun 1, 2018

cpu cryptonight_ten_hash fireice-uk#1604

da7c33c

use the updated macros in power level 5 as well

fcb9e18

updated power levels 3, 4 and 5. Set power level 10

47150de

remove spaces

44116b5

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Jun 1, 2018

cpu cryptonight_ten_hash_3_4_5_10 fireice-uk#1604

a9472f2

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Jun 4, 2018

cpu cryptonight_ten_hash_3_4_5_10_more fireice-uk#1604

05fad73

Spudz76 mentioned this pull request Jun 25, 2018

hwloc can't bind beyond 8 threads on 16 core Threadripper 1950x #1678

Open

Spudz76 mentioned this pull request Jul 6, 2018

i7-8700k inconsistent results #1649

Open

This was referenced Jul 19, 2018

support to CPU mine at more than 75% for dedicated machines #1730

Open

configuration and hashrate problem #1739

Open

Spudz76 mentioned this pull request Sep 15, 2018

unify cpu cryptonight implementations #1821

Merged

3 tasks

cpu cryptonight_twenty_hash #1604

Are you sure you want to change the base?

cpu cryptonight_twenty_hash #1604

Conversation

mbasaman commented May 29, 2018

mbasaman commented May 29, 2018 • edited Loading

Spudz76 commented May 30, 2018

mbasaman commented May 30, 2018 • edited Loading

Spudz76 commented May 31, 2018 • edited Loading

mbasaman commented May 31, 2018 • edited Loading

Spudz76 commented May 31, 2018 • edited Loading

mbasaman commented May 31, 2018 • edited Loading

mbasaman commented May 31, 2018 • edited Loading

mbasaman commented May 31, 2018

Spudz76 commented May 31, 2018

Spudz76 commented May 31, 2018 • edited Loading

mbasaman commented Jun 1, 2018 • edited Loading

Spudz76 commented Jun 1, 2018 • edited Loading

mbasaman commented Jun 1, 2018 • edited Loading

mbasaman commented Jun 1, 2018 • edited Loading

Spudz76 commented Jun 1, 2018

mbasaman commented Jun 1, 2018 • edited Loading

mbasaman commented Jun 1, 2018

mbasaman commented Jun 1, 2018

Spudz76 commented Jun 1, 2018

mbasaman commented Jun 1, 2018 • edited Loading

Spudz76 commented Jun 1, 2018 • edited Loading

Spudz76 commented Jun 2, 2018 • edited Loading

mbasaman commented Jun 2, 2018 • edited Loading

Spudz76 commented Jun 2, 2018

mbasaman commented Jun 2, 2018 • edited Loading

Spudz76 commented Jun 2, 2018

mbasaman commented Jun 4, 2018 • edited Loading

mbasaman commented Jun 6, 2018

Spudz76 commented Jun 6, 2018

mbasaman commented Jun 6, 2018 • edited Loading

psychocrypt commented Jul 14, 2018

mbasaman commented Jul 15, 2018

psychocrypt commented Jul 15, 2018 via email

Spudz76 commented Jul 15, 2018

Spudz76 commented Jul 15, 2018 • edited Loading

baldpope commented Jul 18, 2018

Spudz76 commented Jul 19, 2018

mbasaman commented Jul 19, 2018

mbasaman commented Jul 19, 2018 • edited Loading

baldpope commented Jul 19, 2018 • edited Loading

Spudz76 commented Jul 19, 2018

Spudz76 commented Jul 19, 2018 • edited Loading

Spudz76 commented Jul 25, 2018 • edited Loading

Spudz76 commented Jul 25, 2018 • edited Loading

mbasaman commented Jul 30, 2018 • edited Loading

mbasaman commented May 29, 2018 •

edited

Loading

mbasaman commented May 30, 2018 •

edited

Loading

Spudz76 commented May 31, 2018 •

edited

Loading

mbasaman commented May 31, 2018 •

edited

Loading

Spudz76 commented May 31, 2018 •

edited

Loading

mbasaman commented May 31, 2018 •

edited

Loading

mbasaman commented May 31, 2018 •

edited

Loading

Spudz76 commented May 31, 2018 •

edited

Loading

mbasaman commented Jun 1, 2018 •

edited

Loading

Spudz76 commented Jun 1, 2018 •

edited

Loading

mbasaman commented Jun 1, 2018 •

edited

Loading

mbasaman commented Jun 1, 2018 •

edited

Loading

mbasaman commented Jun 1, 2018 •

edited

Loading

mbasaman commented Jun 1, 2018 •

edited

Loading

Spudz76 commented Jun 1, 2018 •

edited

Loading

Spudz76 commented Jun 2, 2018 •

edited

Loading

mbasaman commented Jun 2, 2018 •

edited

Loading

mbasaman commented Jun 2, 2018 •

edited

Loading

mbasaman commented Jun 4, 2018 •

edited

Loading

mbasaman commented Jun 6, 2018 •

edited

Loading

Spudz76 commented Jul 15, 2018 •

edited

Loading

mbasaman commented Jul 19, 2018 •

edited

Loading

baldpope commented Jul 19, 2018 •

edited

Loading

Spudz76 commented Jul 19, 2018 •

edited

Loading

Spudz76 commented Jul 25, 2018 •

edited

Loading

Spudz76 commented Jul 25, 2018 •

edited

Loading

mbasaman commented Jul 30, 2018 •

edited

Loading