Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu cryptonight_twenty_hash #1604

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

cpu cryptonight_twenty_hash #1604

wants to merge 5 commits into from

Conversation

mbasaman
Copy link

Please make sure your PR is against dev branch. Merging PRs directly into master branch would interfere with our workflow.

@mbasaman
Copy link
Author

mbasaman commented May 29, 2018

This seems to improve CPU performance a bit. I'm surprised that the load from RAM takes so much time, I think the update makes the prefetch option more effective (at least on the CPU's that I tested).

I've seen some comments on the internet that indicate prefetch is only effective if you wait 100-200 CPU cycles or so. I'd like to know if there is some asynchronous way to load a register from RAM because I don't think the computations take much time at all.

Also, you could probably add some code to "auto-discover" the optimal number of hashes based on runtime measurements for the user's CPU.

@Spudz76
Copy link
Contributor

Spudz76 commented May 30, 2018

In agreement with possibilities of smarter CPU autodetections, and based on short successive benchmarking in more AI fashion than fingerprinting directly from cache size and core count and mostly not even looking at the manufacturer or model or coretype at all.

I had previously considered extending the newer --benchmark option to do successive short rounds to rough-guess the best settings. Probably a --retrain option which would start from what is currently in the config for the backend, and "mine" for the best ones, and put those back in the config file for you (or write out a new version).

Sort of wanted to make a utility function that would read old config files and then write out the latest version with the new settings and whatever for version upgrades, and retrain is similar to that as well.

@mbasaman
Copy link
Author

mbasaman commented May 30, 2018

Yeah, that sounds good.

What was had observed is that 20 hashes at a time was faster than 3 or 5, but the optimal number might be 15 or something like that.

It's probably different for each machine. I'm not sure exactly what the trade-offs are, but I think RAM access is the issue and I'd prefer a method that was deterministic. Perhaps it's not possible and I'd also like to try the Intel "streaming" features.

One interesting thing that I had thought of is that it's very unlikely that you would "need to know" the data that you store in the next loop iteration (something like 1/128000), so if the "store" is causing a delay and there was a asynchronous way to do it, it might just work 99% of the time. I haven't done the math and you would need to double check the hash if it exceeds the pool difficulty. I don't know if "store" is the problem or if the feature is even available. I could be wrong.

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request May 31, 2018
@Spudz76
Copy link
Contributor

Spudz76 commented May 31, 2018

Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
15MB smartcache
Using Win7 with

    { "low_power_mode" : 20, "no_prefetch" : false, "affine_to_cpu" : 0 },
    { "low_power_mode" : 20, "no_prefetch" : false, "affine_to_cpu" : 1 },
    { "low_power_mode" : 20, "no_prefetch" : false, "affine_to_cpu" : 2 },
    { "low_power_mode" : 20, "no_prefetch" : false, "affine_to_cpu" : 3 },

and I assume that's correct to enable this patch (20 with prefetch not-disabled aka enabled) - Also tested with 20 but no_prefetch true and that gave identical speed to single threaded/normal.

Same config no patch with 1/true gave 278H/s and with patch is 261H/s a loss of 17H/s or 6.11%


Atom D2550 with Linux and gcc-7 had a loss of half, but not surprising. It isn't fast anyway and has no AES. More testing out of curiosity, and it made it worse which is as expected.


Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz
3MB smartcache
As above Xeon test but with Linux and clang-3.8 and two host threads, otherwise same.
Normal was 86H/s and with patch is 81H/s loss of 5H/s or 5.81%
With 20/true it showed 85 but that is within margin of error to match the Xeon result.
Yes this is overcommitted on cache and should probably only run one host thread however I do get an additional ~16H/s off the cache starved thread without apparently hurting the other one.


Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
20MB smartcache
DUAL CPU (so, divide below by two for single cpu result)
Running 12 host threads, Linux, cmake 3.8.1 (jessie backports)
Normal 594H/s with patch 566H/s, loss of 28H/s or 4.71%
Within error margin again with 20/true.

@mbasaman
Copy link
Author

mbasaman commented May 31, 2018

I think the config has to be 6. It's faster on the machines I've tried.

"cpu_threads_conf" :
[
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 2 },
],

[2018-06-01 05:25:43] : Mining coin: monero7
[2018-06-01 05:25:43] : Starting 6x thread, affinity: 0.
[2018-06-01 05:25:43] : hwloc: memory pinned
[2018-06-01 05:25:43] : Starting 6x thread, affinity: 2.
[2018-06-01 05:25:43] : hwloc: memory pinned

Totals (ALL): 74.1 73.7 0.0 H/s
Highest: 74.5 H/s


"cpu_threads_conf" :
[
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 1 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 2 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 3 },
],

[2018-06-01 05:29:37] : Mining coin: monero7
[2018-06-01 05:29:37] : Starting 5x thread, affinity: 0.
[2018-06-01 05:29:37] : hwloc: memory pinned
[2018-06-01 05:29:37] : Starting 5x thread, affinity: 1.
[2018-06-01 05:29:37] : hwloc: memory pinned
[2018-06-01 05:29:37] : Starting 5x thread, affinity: 2.
[2018-06-01 05:29:37] : hwloc: memory pinned
[2018-06-01 05:29:37] : Starting 5x thread, affinity: 3.
[2018-06-01 05:29:37] : hwloc: memory pinned

Totals (ALL): 67.6 67.3 0.0 H/s
Highest: 67.7 H/s

Intel(R) Core(TM) i7-3537U CPU @ 2.00GHz

I've tested it on a few other linux machines, it's wasn't "as good" but it's still better in the tests I've run.

@Spudz76
Copy link
Contributor

Spudz76 commented May 31, 2018

That gave 60H/s on the little i3 under Linux, with or without the second host thread. Waaay worse.
81H/s on the 278H/s Xeon E5 ewwww

What CPUs had you tested with success?

@mbasaman
Copy link
Author

mbasaman commented May 31, 2018

are you using low_power_mode == 6? It's faster on several servers that I've tried. I think using 20 in the config will default back to low_power_mode = 1 (which hasn't changed).

My expectations were the same as yours, in that it would be slower, but it's faster on my old i7 and perhaps 10% faster on some of the other processors I've tried. The only thing I can tell you is that I'm using it and my total rate is higher than with any of the other options. Something like 3100 to 3500 H/s.

@mbasaman
Copy link
Author

mbasaman commented May 31, 2018

One other thing to note is that I'm only using CPU 0 and 2, not 0, 1, 2 and 3.

Perhaps you can check if only using half of the CPU's has any effect.

The next best setting is 4 CPU's at power_level 5 on my machine. The logs are above.

@mbasaman
Copy link
Author

Yeah so, reducing the number of hashes to 10 and kicking it back up to 4 CPU's put it in line with 20 hashes on 2 CPU's.

[2018-06-01 06:03:39] : Starting (10)x thread, affinity: 0.
[2018-06-01 06:03:39] : hwloc: memory pinned
[2018-06-01 06:03:39] : Starting (10)x thread, affinity: 1.
[2018-06-01 06:03:39] : hwloc: memory pinned
[2018-06-01 06:03:39] : Starting (10)x thread, affinity: 2.
[2018-06-01 06:03:39] : hwloc: memory pinned
[2018-06-01 06:03:39] : Starting (10)x thread, affinity: 3.
[2018-06-01 06:03:39] : hwloc: memory pinned

Totals (ALL): 75.1 73.5 0.0 H/s
Highest: 79.0 H/s

@Spudz76
Copy link
Contributor

Spudz76 commented May 31, 2018

Yes the retest previous post was with == 6
I guess it's strange it accepts 20 if there is no index for such / must just default to 0/1 (same meaning) which explains the same result when using same no_prefetch==true as with 0/false/1
The == 6 result was clearly doing something different.
What compiler were you using, I am very likely using weird ones (clang, gcc-7) in comparison to yours. Compiler may be optimizing out your hand expansions or such?

@Spudz76
Copy link
Contributor

Spudz76 commented May 31, 2018

I think I see, try to use less host threads (cores) so that the cache and prefetch stacks up longer queues (to hit that hundreds mark) instead of widening bandwidth and/or full cache size utilization (cache/2MB for monero7). Taller work stacks via two blocks (4MB cache).

Testing that...

More seems after checking that this wants a total number of threads (host thread cores * power threads) that is divisible by 20 and then it will use the 20-way instead of without patch and those same settings it would use 4 * 5-way and not stack the prefetch up as far... I think?

@mbasaman
Copy link
Author

mbasaman commented Jun 1, 2018

I'm not an expert on the compiler optimizations. The compiler I'm using is: gcc (Debian 6.3.0-18+deb9u1)

If that doesn't work, you may want to try 10 hashes at a time of the full number of cores. 8 or 12 might be even better.

What I was thinking is that prefetch isn't helping as much as it should and the cause may be that there aren't enough CPU cycles after the prefetch command.

If you were reading from L1 exclusively (32KB or 256KB cache size via prefetch) I think it would be faster but there must be some trade-off with the limited number of registers. I tried 100 hashes at a time but it was slower.

Perhaps the store command is also a problem. I was trying to think if there was a way to offload the store because it's very unlikely you would re-read the bits right away.

I also re-ordered the macros to put the prefetch as the last command (in the twenty method only).

If it's not faster for you that don't worry too much about it. Also, 10 hashes with the full number of cores may be more convenient to configure.

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 1, 2018

I see around 65 per core with four host threads, 75 per core with two.
So there is a gain there, but using 4@65 is better than 2@75 and I can't seem to get 4@75. BUT- this may be turbo kicking in because I am using less cores.
Otherwise various settings of low_power_mode even up to 3200 (didn't know it would go that high) gave more or less similar results (260H/s where normal was 268H/s) and then 151H/s with the two cores. Would be real sweet to get the 302H out of these but again I think that is all due to the turbo from using 2 of 4 physicals. If I drop to one it should go slightly faster yet (80H/s?)

I do not really know how to check for actual usage of the 20-way at runtime

@mbasaman
Copy link
Author

mbasaman commented Jun 1, 2018

What I would try is 10 hashes at a time. You can change twenty_work_main() to use multiway_work_main<10u>(); instead of multiway_work_main<20u>(); and comment out the STEP(a10) through STEP(a19)

I'll make the change on the external branch now or you can try it yourself.

@mbasaman
Copy link
Author

mbasaman commented Jun 1, 2018

ok, I committed the changes to my local fork. If you want to try it you can clone it.

It has similar numbers with 10 hashes on 4 cores to the 20 hash with 2 cores on this machine.

I'll clean it up if that works. You may also want to try the linux machine.

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Jun 1, 2018
@Spudz76
Copy link
Contributor

Spudz76 commented Jun 1, 2018

Okay so what do you get with

"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },
],

@mbasaman
Copy link
Author

mbasaman commented Jun 1, 2018

I updated it to use 10 hashes instead of 20 (with the hope that you can use all 4 cores).

Try it a new build with the update and use all 4 cores

"cpu_threads_conf" :
[
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 1 },
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 2 },
{ "low_power_mode" : 6, "no_prefetch" : false, "affine_to_cpu" : 3 },
],

Totals (ALL): 76.3 74.7 0.0 H/s
Highest: 77.6 H/s

I think the base case of power level 5 was 68 H/s or something like that.

@mbasaman
Copy link
Author

mbasaman commented Jun 1, 2018

Also, I just checked in a third update.

This to use the new code on power level 5 (as well as 6). Perhaps the macro changes are the cause of the improvement I am seeing:

I had originally not wanted to touch the existing code.

"cpu_threads_conf" :
[
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 1 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 2 },
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 3 },
],

With the "latest" update to power level 5:

Totals (ALL): 76.3 75.9 0.0 H/s
Highest: 76.5 H/s

Before the update:

Totals (ALL): 63.6 67.7 0.0 H/s
Highest: 69.0 H/s

So perhaps the macro changes were the cause, not the increased number of hashes. Check out the latest build and give both power level 5 and 6 a try with all 4 of the cores.

@mbasaman
Copy link
Author

mbasaman commented Jun 1, 2018

Ok, so I made a fourth update to power levels 3, 4, 5, and 10 to use the branched code.

Starting 3x thread, affinity: 0.
Starting 3x thread, affinity: 1.
Starting 3x thread, affinity: 2.
Starting 3x thread, affinity: 3.

Original - Totals (ALL): 58.9 59.5 0.0 H/s
Updated - Totals (ALL): 61.6 61.3 0.0 H/s

Starting 4x thread, affinity: 0.
Starting 4x thread, affinity: 1.
Starting 4x thread, affinity: 2.
Starting 4x thread, affinity: 3.

Original - Totals (ALL): 66.8 66.9 0.0 H/s
Updated - Totals (ALL): 72.6 72.4 0.0 H/s

Starting 5x thread, affinity: 0.
Starting 5x thread, affinity: 1.
Starting 5x thread, affinity: 2.
Starting 5x thread, affinity: 3.

Original - Totals (ALL): 69.9 71.6 0.0 H/s
Updated - Totals (ALL): 79.1 78.6 0.0 H/s

Starting 10x thread, affinity: 0.
Starting 10x thread, affinity: 1.
Starting 10x thread, affinity: 2.
Starting 10x thread, affinity: 3.

Updated - Totals (ALL): 79.8 78.0 0.0 H/s

Please update the build and give it a try. I'm seeing it faster for power levels 3, 4, 5 and 10 now. If that's not what you see, let me know.

*** note - I changed the config to expect 10 instead of 6

"cpu_threads_conf" :
[
{ "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 0 },
{ "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 1 },
{ "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 2 },
{ "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 3 },
],

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Jun 1, 2018
@Spudz76
Copy link
Contributor

Spudz76 commented Jun 1, 2018

Okay, it looks like all the CPUs I was testing are Ivy Bridge variants, and none of this makes those faster at all, it hurts them bad. Got 60/75 total versus the usual 260-280 across 3,4,5,10 leaving the affinity and thread counts the same as normal/best.

Tested this time on a Linux with weird Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz which is Broadwell based and it did show gains though!
Normal best with false/true was ~92H and then with this patch and 10/false it gets 102H, running three cores of four, due to the 6MB cache (so only 3 x 2MB block space)

So maybe this just hates Ivy and Sandy memory controller design? But I thought your CPU is an Ivy... did you try my CPU single thread no prefetch config on it from a couple posts back? I'm curious if that roasts your other rates like it does on all of mine.

@mbasaman
Copy link
Author

mbasaman commented Jun 1, 2018

This patch requires prefetch to be on. Let me know what you'd like me to test. Without prefetch it won't work. I'd recommend trying power level 5 with prefetch on first.

I'll give the single threads with and without prefetch at try.

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 1, 2018

I want you to test my normal Ivy/Sandy config which I posted above, since you seem to have an Ivy Bridge core. No low power and No prefetch, and yes then it's not using your code, but that is how I'm getting my rates that are 4x faster than with the prefetch and your patch.

Directed prefetch seems to work well with Broadwell cores, but however the Ivy/Sandy ones have an internal prefetch optimization (which can't be shut off) it doesn't like being directed, and I get 4x the performance by letting it do all prefetch management (aka "no prefetch" from xmr-stak / but again CPU still does its own regardless of that)

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 2, 2018

OK got to test this on Win7 Intel(R) Core(TM) i3-4130 CPU @ 3.40GHz with 3MB cache and found a strange way to new best of 106H/s total. Use one thread with your code at 10, and another on the other physical core the way I usually have it set (false/true) and then both get around 53H/s each.

Not sure why/how (should be out of cache) but it works. Normal with all false/true was 86H/s so it gained 20H/s.

Same gain same config in Linux on the i3-4160 3.6GHz 3MB. Runs 6H/s slower overall in any mode so even with the faster CPU, Linux is slower.

    { "low_power_mode" : 10, "no_prefetch" : false, "affine_to_cpu" : 0 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },

Also these are Haswell core with 8/8/8/12 way caches

@mbasaman
Copy link
Author

mbasaman commented Jun 2, 2018

all right, well - I don't know if that is useful to you or not

If it's something that you want, what I would do is leave the existing methods exactly how they are and add new methods for additional options. I would add 6, 7, 8, 9 and 20 as options as well. I think the only issue with this is adding to the compile time.

I have no idea why it works - as you said, it should be out of cache. The only thing that I can think of is that will additional CPU cycles between the prefetch and the load, it is more likely that the memory is read from L1, which should be a lot faster than L3.

In addition to that, I would create a wrapper script that iterates through all the possible "combinations" to find the best config, by actually running for 1 or 2 minutes on each possible config. It could be done in C or even shell and produce a report that lists the config possibilities in order. Hopefully it would only take a couple of hours to run.

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 2, 2018

Loaded that Xeon D-1518 Broadwell up with 10/false but used every core (0-7) which means HT cores also, and it scores 150H/s that way. Up from original 92H/s, huge.

@mbasaman
Copy link
Author

mbasaman commented Jun 2, 2018

are you comparing against the trunk? I also changed 3/4/5 with the latest checkin

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 2, 2018

no, against dev without your patches is the comparison (also the normal one I run daily)
I have never run master ever since I started compiling in-situ on each Linux rig.

And yes I have been checking all four permutations 3,4,5,10/false and have all the patches

@mbasaman
Copy link
Author

mbasaman commented Jun 4, 2018

It went from 72.4 to 86.1 H/s on this i7-3537U CPU @ 2.00GHz

The script is checked in at xmr-stak-config. You can take it and rename it if you want it.


Regarding "cache starvation", I think that the L1 cache is only 32KB or 256KB. If prefetch is working, it should never hit L3. I think the trade off is with the number of registers for the variables, not L3. I could be wrong.

I think it should still be faster. I was able to get over 400 H/s by commenting out the _mm_load_si128 and _mm_store_si128 so memory access is probably the issue, not the encryption (unless the compiler is doing something). What I am going to try next is all the combinations of possible macros to switch between 9 hashes at a time.


power options: 1 2 3 4 5
branch: master

number of cores: 2
threads_per_core: 2
test time: 75 seconds
power options: 5
prefetch options: 2
configurations: 65
estimated runtime: 1:21:15

{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 0},
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 1},
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 2},
{ "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 3},

Hashrate: 72.4


power options: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 103 104 105 106 107 108 109 110 111 112 113 114 115
branch: more_options_dev

// N > 100 use the branched code optimized for prefetch with (N - 100) hashes

number of cores: 2
threads_per_core: 2
test time: 75 seconds
power options: 28
prefetch options: 1
configurations: 434
estimated runtime: 9:02:30

{ "low_power_mode" : 109, "no_prefetch" : false, "affine_to_cpu" : 0},
{ "low_power_mode" : 112, "no_prefetch" : false, "affine_to_cpu" : 1},
{ "low_power_mode" : 109, "no_prefetch" : false, "affine_to_cpu" : 2},
{ "low_power_mode" : 112, "no_prefetch" : false, "affine_to_cpu" : 3},

Hashrate: 86.1

Spudz76 added a commit to Spudz76/xmr-stak that referenced this pull request Jun 4, 2018
@mbasaman
Copy link
Author

mbasaman commented Jun 6, 2018

ok, I'm moving on to something else for the time being. If you want to use the new 10 method, I would suggest keeping the unmodified 3/4/5 methods as was done in the 1st version of the checkin.

Good luck

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 6, 2018

Thanks for all the work, I did not find the loop-test-script on your fork anywhere, but will probably write my own to use --benchmark 7 --benchwait 1 --benchwork 14 so it only wastes ~15s each test, and doesn't need to connect to a pool at all.

I would have been unable to use the test script as my Internet doesn't always connect to pool on the very first packet / hangs and requires a kick / no way to detect that easily in looper script. Also benchmark mode exits by itself (no sleep/kill).

@mbasaman
Copy link
Author

mbasaman commented Jun 6, 2018

yeah, it's no problem at all. I'll probably read the GPU code as well eventually, it's probably already really good, but you it's always possible that there is something there.

The script is at https://github.com/mbasaman/xmr-stak-config

If you want the script, you can take it and rename it.

You can update the code to pull the 15 second value. It's just scraping the log file.

I think what it's doing now is waiting 75 seconds, then if the log entry doesn't get generated, it will re-read the log every 5 seconds for an additional 75 seconds. You can update it to wait longer for the network issues.

@psychocrypt
Copy link
Collaborator

Could someone please give me a short summery if 20hash is an improvement and short summarize an example. Please do not measure one core. Please use a good old cfg vs a cfg with this PR.

@mbasaman
Copy link
Author

if you want to use the "10 hashes at a time" method OR the "optimized for prefetch" changes, I would suggest a new PR that retains the original implementations.

https://github.com/mbasaman/xmr-stak/tree/more_options_dev is an example, but it would need to be trimmed to improve build time.

The last tests I did are available at:

https://github.com/mbasaman/xmr-stak-config/blob/master/results.2.txt

which showed a 20% improvement versus

https://github.com/mbasaman/xmr-stak-config/blob/master/results.1.txt

@psychocrypt
Copy link
Collaborator

psychocrypt commented Jul 15, 2018 via email

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 15, 2018

We were pretty clear everywhere above about which CPUs got what and we tested quite a range

And yes like Celerons and stuff that aren't supposed to work well anyway. But the main interesting thing is how stacking silly high levels of work lets the Intel funky magic (SmartCache, predictive whatever, adjacent cache line prefetch, etc...) do a much cleaner job which gets quite a boost in some cases.

I think once a huge stack of tasks is queued it stops readjusting the SmartCache topology as much (as it can see a gigantic workload of the same stuff over and over) but when it remains a normal logical task with short/fast sections, it decides to share out the cache to cores differently, and changes it often - leading to not same hashrate on each core (cores should be very close to completely equal) or floating hashrates ('high' rate core floats among the others, but overall total hashrate is stable).

Since SmartCache is relatively snake oil with no docs, it makes sense to try semirandom ideas just to see how it reacts, which might tell us more about exactly how to force (or pseudo-force) cache-to-core allocations.

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 15, 2018

#1649 may be related to SmartCache being "weird" with too-short a work stack, but definitely shows the problem, and I guess as usual "some other miner" is perfect every time.

My hopes were this long stacking idea would fix that.

@baldpope
Copy link

Is this PR going to be in an upcoming release? I'd love to get my performance back on these CPUs

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

It probably needs a thorough cleanup, and/or it seems like a template could generate a variable selection of low_power_mode thread depths. When compiling with 2-150 it takes about an hour, building a kernel per thread depth, and the executable is gigantic, so including them all is not particularly great except for benchmarking to figure out which depths to include for your system(s).

Would be better to utilize hardware ID information and have a DLL per family, split the CPU backend into so/dll like the other backends. Memory footprint would be smaller, loading at runtime whichever CPU backend has whatever heights work best on it.

But collecting reliable benchmark data is somewhat a problem, so a profiler tool would have to be created (xmr-chek ?) to use most of the core backend code but only run hardware ID and benchmarks (test all possible backends and step through thread depths for 5s each and then retest the top ones for 15s each like successive search) and provide a file to post/upload somewhere, in a machine importable format. That would provide enough information to know what's best for a particular CPU and/or family of CPUs, which then could be used to make a specific backend for that ID/fingerprint (model/stepping/revision/cache amount/etc).

Bench-cloud system could benefit the autoconfig code in the GPU backends as well, not having a benchmark-cloud type system makes it tough to get good optimization data for those just the same.

There is that site with various paste dumps of old format config files and whatnot, but that isn't machine readable nor always accurate and most are from ancient backends or old algorithm versions (out of date).

@mbasaman
Copy link
Author

yeah, so it does need to be cleaned up.

There are probably ways to reduce compile time by moving the code to a static library (or something similar).

What I was thinking is to add some kind of learning process to the code if every CPU is different. For example, you mine for 9 minutes on the best known config, and then spend 1 minute mining on an untested config. After a few hours or days, the program will have built it's own deployment-specific benchmarks, which could be persistent locally via the file system or whatever.

After the benchmarks have been auto-generated (after a few days) it could still try to re-benchmark known configs but probabilistically favor configs that are almost as good as the best config, on the off chance that there was some irregularities during the original bench-marking.

That's assuming that you have lots of combinations of implementations via static libraries. The other option would be to just add a single method with 10 hashes per run using the modified macros that are optimized for prefetch and leave the original code unmodified.

@mbasaman
Copy link
Author

mbasaman commented Jul 19, 2018

and if you guys don't have time to put it together yourself, just agree to a specification for what you want done and I'll implement it when I have some free time

I can also look at the GPU code but I have to put the time in to read it.

@baldpope
Copy link

baldpope commented Jul 19, 2018

I like @Spudz76 suggestion above of running a benchmark and uploading results some how. I agree, the current xmr bench mark site is OK as a reference, but far from definitive and not machine readable at all.

I'm not familiar with the client side portion at all, but could take a stab at putting something on the get/share results side. What all field data would be relevant to store?

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

It would be ideal if the CPU backend compiled its executable/kernel at runtime based on settings/detections just like the GPU backends. But then the miner rig needs compilers, not a problem for Linux but, windows would be painful. However it would solve the huge-exe or many-dll problem and allow for one piece of code with optional sections and variable template expansions/unrolls just like the OpenCL or CUDA code does, assembled at runtime.

Maybe this is why OpenCL supports CPUs at all, sort of a portable compile engine. Too bad it is never fast as direct to metal, but I do wonder how close it can get (one would hope Intel or AMD made their CPU-OpenCL with strengths and weaknesses of their CPU in mind)

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 19, 2018

I have been somewhat maintaining this patch against current dev on my fork+branch dev-hax however it only applies the 10-way (which was what I mostly needed)

It is untested other than monero7 but the deca-work is all wired up for all algos. Also note donation is already patched out along with some random type-warning fixes in the latest commit, CUDA detection verbosity patch from my other PR in the next most recent commit, and the 10-way patch in the third (then normal upstream dev commits). This is the branch I compile my active miners from, feel free to checkout and compile from it too, you can see in the commits there are no awesome backdoors or anything evil added. Interested if the lite/heavy/fast/bittube variants all work properly if any of you mine those, and have a CPU that likes 10-way...

The 10-way helps hashrate on a Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz by quite a bit compared to defaults. Also slight boost using 10-way on first core of Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz and a single default thread on the second core (only 3MB cache, not sure how the second singlethread even gets cache to use, but it works...)

Also another branch on my fork, dev-superthread is currently behind but I will begin adapting the 20-way and etc to it.

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 25, 2018

I've noticed my current branches with the 10-way patch applied probably do not work correctly for anything but what I mine (monero7) there are some missing patches to the macros for newer/changed coins.

Caveat until I get them fixed up. Maybe converting the macros to true templates too.

I did modify and succeed in getting the cpu backend compiled into a shared library just like the other two. There is some mess with the other backends needing to use the 'core' cpu code to validate results, but I forked a simplified singlethread CPU hash validator (cpu::reschk instead of cpu::minethd which is the actual backend) to leave in the main backend static lib for those to all use. Bonus there is the unoptimized validator should never have any optimization-related miscalc problems, and not contend with CPU mining that is using cache (the validator is run only when one of the backends reports a result, not continually in a loop, so speed is not an issue and it can stay out of the way).

I am currently wrestling with CMake not building the OpenCL backend until after the main exe, so then it doesn't get linked properly. CUDA still builds before the main exe so I've got something messed up.

But CPU mining via the externalized backend seems to work so far.

@Spudz76
Copy link
Contributor

Spudz76 commented Jul 25, 2018

Also suspicious that the X-way cache on the CPU may be the precise X-way threading that works optimally. One of my CPUs has 20-way 15MB L3 thus the hypothesis is that 20-way threading will work best. As long as the linesize of cache is 64 bit that is...

Pretty much rewriting the autotuner to use more knowledge that hwloc provides, and finally merging the hwloc and non-hwloc code into one file so it's easier to work with. The non-hwloc autotuner also had more messages about what it found and how it thinks the layout should be and why, which is useful info but totally missing in the hwloc version.

@mbasaman
Copy link
Author

mbasaman commented Jul 30, 2018

ok, I'm glad you picked it up. I still think it's theoretically possible to be faster, perhaps using a different method. Good luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants