-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu cryptonight_twenty_hash #1604
base: dev
Are you sure you want to change the base?
Conversation
This seems to improve CPU performance a bit. I'm surprised that the load from RAM takes so much time, I think the update makes the prefetch option more effective (at least on the CPU's that I tested). I've seen some comments on the internet that indicate prefetch is only effective if you wait 100-200 CPU cycles or so. I'd like to know if there is some asynchronous way to load a register from RAM because I don't think the computations take much time at all. Also, you could probably add some code to "auto-discover" the optimal number of hashes based on runtime measurements for the user's CPU. |
In agreement with possibilities of smarter CPU autodetections, and based on short successive benchmarking in more AI fashion than fingerprinting directly from cache size and core count and mostly not even looking at the manufacturer or model or coretype at all. I had previously considered extending the newer Sort of wanted to make a utility function that would read old config files and then write out the latest version with the new settings and whatever for version upgrades, and retrain is similar to that as well. |
Yeah, that sounds good. What was had observed is that 20 hashes at a time was faster than 3 or 5, but the optimal number might be 15 or something like that. It's probably different for each machine. I'm not sure exactly what the trade-offs are, but I think RAM access is the issue and I'd prefer a method that was deterministic. Perhaps it's not possible and I'd also like to try the Intel "streaming" features. One interesting thing that I had thought of is that it's very unlikely that you would "need to know" the data that you store in the next loop iteration (something like 1/128000), so if the "store" is causing a delay and there was a asynchronous way to do it, it might just work 99% of the time. I haven't done the math and you would need to double check the hash if it exceeds the pool difficulty. I don't know if "store" is the problem or if the feature is even available. I could be wrong. |
and I assume that's correct to enable this patch (20 with prefetch not-disabled aka enabled) - Also tested with 20 but no_prefetch true and that gave identical speed to single threaded/normal. Same config no patch with 1/true gave 278H/s and with patch is 261H/s a loss of 17H/s or 6.11% Atom D2550 with Linux and gcc-7 had a loss of half, but not surprising. It isn't fast anyway and has no AES. More testing out of curiosity, and it made it worse which is as expected.
|
I think the config has to be 6. It's faster on the machines I've tried. "cpu_threads_conf" : [2018-06-01 05:25:43] : Mining coin: monero7 Totals (ALL): 74.1 73.7 0.0 H/s "cpu_threads_conf" : [2018-06-01 05:29:37] : Mining coin: monero7 Totals (ALL): 67.6 67.3 0.0 H/s Intel(R) Core(TM) i7-3537U CPU @ 2.00GHz I've tested it on a few other linux machines, it's wasn't "as good" but it's still better in the tests I've run. |
That gave 60H/s on the little i3 under Linux, with or without the second host thread. Waaay worse. What CPUs had you tested with success? |
are you using low_power_mode == 6? It's faster on several servers that I've tried. I think using 20 in the config will default back to low_power_mode = 1 (which hasn't changed). My expectations were the same as yours, in that it would be slower, but it's faster on my old i7 and perhaps 10% faster on some of the other processors I've tried. The only thing I can tell you is that I'm using it and my total rate is higher than with any of the other options. Something like 3100 to 3500 H/s. |
One other thing to note is that I'm only using CPU 0 and 2, not 0, 1, 2 and 3. Perhaps you can check if only using half of the CPU's has any effect. The next best setting is 4 CPU's at power_level 5 on my machine. The logs are above. |
Yeah so, reducing the number of hashes to 10 and kicking it back up to 4 CPU's put it in line with 20 hashes on 2 CPU's. [2018-06-01 06:03:39] : Starting (10)x thread, affinity: 0. Totals (ALL): 75.1 73.5 0.0 H/s |
Yes the retest previous post was with == 6 |
I think I see, try to use less host threads (cores) so that the cache and prefetch stacks up longer queues (to hit that hundreds mark) instead of widening bandwidth and/or full cache size utilization (cache/2MB for monero7). Taller work stacks via two blocks (4MB cache). Testing that... More seems after checking that this wants a total number of threads (host thread cores * power threads) that is divisible by 20 and then it will use the 20-way instead of without patch and those same settings it would use 4 * 5-way and not stack the prefetch up as far... I think? |
I'm not an expert on the compiler optimizations. The compiler I'm using is: gcc (Debian 6.3.0-18+deb9u1) If that doesn't work, you may want to try 10 hashes at a time of the full number of cores. 8 or 12 might be even better. What I was thinking is that prefetch isn't helping as much as it should and the cause may be that there aren't enough CPU cycles after the prefetch command. If you were reading from L1 exclusively (32KB or 256KB cache size via prefetch) I think it would be faster but there must be some trade-off with the limited number of registers. I tried 100 hashes at a time but it was slower. Perhaps the store command is also a problem. I was trying to think if there was a way to offload the store because it's very unlikely you would re-read the bits right away. I also re-ordered the macros to put the prefetch as the last command (in the twenty method only). If it's not faster for you that don't worry too much about it. Also, 10 hashes with the full number of cores may be more convenient to configure. |
I see around 65 per core with four host threads, 75 per core with two. I do not really know how to check for actual usage of the 20-way at runtime |
What I would try is 10 hashes at a time. You can change twenty_work_main() to use multiway_work_main<10u>(); instead of multiway_work_main<20u>(); and comment out the STEP(a10) through STEP(a19) I'll make the change on the external branch now or you can try it yourself. |
ok, I committed the changes to my local fork. If you want to try it you can clone it. It has similar numbers with 10 hashes on 4 cores to the 20 hash with 2 cores on this machine. I'll clean it up if that works. You may also want to try the linux machine. |
Okay so what do you get with
|
I updated it to use 10 hashes instead of 20 (with the hope that you can use all 4 cores). Try it a new build with the update and use all 4 cores "cpu_threads_conf" : Totals (ALL): 76.3 74.7 0.0 H/s I think the base case of power level 5 was 68 H/s or something like that. |
Also, I just checked in a third update. This to use the new code on power level 5 (as well as 6). Perhaps the macro changes are the cause of the improvement I am seeing: I had originally not wanted to touch the existing code. "cpu_threads_conf" : With the "latest" update to power level 5: Totals (ALL): 76.3 75.9 0.0 H/s Before the update: Totals (ALL): 63.6 67.7 0.0 H/s So perhaps the macro changes were the cause, not the increased number of hashes. Check out the latest build and give both power level 5 and 6 a try with all 4 of the cores. |
Ok, so I made a fourth update to power levels 3, 4, 5, and 10 to use the branched code. Starting 3x thread, affinity: 0. Original - Totals (ALL): 58.9 59.5 0.0 H/s Starting 4x thread, affinity: 0. Original - Totals (ALL): 66.8 66.9 0.0 H/s Starting 5x thread, affinity: 0. Original - Totals (ALL): 69.9 71.6 0.0 H/s Starting 10x thread, affinity: 0. Updated - Totals (ALL): 79.8 78.0 0.0 H/s Please update the build and give it a try. I'm seeing it faster for power levels 3, 4, 5 and 10 now. If that's not what you see, let me know. *** note - I changed the config to expect 10 instead of 6 "cpu_threads_conf" : |
Okay, it looks like all the CPUs I was testing are Ivy Bridge variants, and none of this makes those faster at all, it hurts them bad. Got 60/75 total versus the usual 260-280 across 3,4,5,10 leaving the affinity and thread counts the same as normal/best. Tested this time on a Linux with weird Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz which is Broadwell based and it did show gains though! So maybe this just hates Ivy and Sandy memory controller design? But I thought your CPU is an Ivy... did you try my CPU single thread no prefetch config on it from a couple posts back? I'm curious if that roasts your other rates like it does on all of mine. |
This patch requires prefetch to be on. Let me know what you'd like me to test. Without prefetch it won't work. I'd recommend trying power level 5 with prefetch on first. I'll give the single threads with and without prefetch at try. |
I want you to test my normal Ivy/Sandy config which I posted above, since you seem to have an Ivy Bridge core. No low power and No prefetch, and yes then it's not using your code, but that is how I'm getting my rates that are 4x faster than with the prefetch and your patch. Directed prefetch seems to work well with Broadwell cores, but however the Ivy/Sandy ones have an internal prefetch optimization (which can't be shut off) it doesn't like being directed, and I get 4x the performance by letting it do all prefetch management (aka "no prefetch" from xmr-stak / but again CPU still does its own regardless of that) |
OK got to test this on Win7 Not sure why/how (should be out of cache) but it works. Normal with all false/true was 86H/s so it gained 20H/s. Same gain same config in Linux on the i3-4160 3.6GHz 3MB. Runs 6H/s slower overall in any mode so even with the faster CPU, Linux is slower.
Also these are Haswell core with 8/8/8/12 way caches |
all right, well - I don't know if that is useful to you or not If it's something that you want, what I would do is leave the existing methods exactly how they are and add new methods for additional options. I would add 6, 7, 8, 9 and 20 as options as well. I think the only issue with this is adding to the compile time. I have no idea why it works - as you said, it should be out of cache. The only thing that I can think of is that will additional CPU cycles between the prefetch and the load, it is more likely that the memory is read from L1, which should be a lot faster than L3. In addition to that, I would create a wrapper script that iterates through all the possible "combinations" to find the best config, by actually running for 1 or 2 minutes on each possible config. It could be done in C or even shell and produce a report that lists the config possibilities in order. Hopefully it would only take a couple of hours to run. |
Loaded that Xeon D-1518 Broadwell up with 10/false but used every core (0-7) which means HT cores also, and it scores 150H/s that way. Up from original 92H/s, huge. |
are you comparing against the trunk? I also changed 3/4/5 with the latest checkin |
no, against And yes I have been checking all four permutations 3,4,5,10/false and have all the patches |
It went from 72.4 to 86.1 H/s on this i7-3537U CPU @ 2.00GHz The script is checked in at xmr-stak-config. You can take it and rename it if you want it. Regarding "cache starvation", I think that the L1 cache is only 32KB or 256KB. If prefetch is working, it should never hit L3. I think the trade off is with the number of registers for the variables, not L3. I could be wrong. I think it should still be faster. I was able to get over 400 H/s by commenting out the _mm_load_si128 and _mm_store_si128 so memory access is probably the issue, not the encryption (unless the compiler is doing something). What I am going to try next is all the combinations of possible macros to switch between 9 hashes at a time. power options: 1 2 3 4 5 number of cores: 2 { "low_power_mode" : 5, "no_prefetch" : false, "affine_to_cpu" : 0}, Hashrate: 72.4 power options: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 103 104 105 106 107 108 109 110 111 112 113 114 115 // N > 100 use the branched code optimized for prefetch with (N - 100) hashes number of cores: 2 { "low_power_mode" : 109, "no_prefetch" : false, "affine_to_cpu" : 0}, Hashrate: 86.1 |
ok, I'm moving on to something else for the time being. If you want to use the new 10 method, I would suggest keeping the unmodified 3/4/5 methods as was done in the 1st version of the checkin. Good luck |
Thanks for all the work, I did not find the loop-test-script on your fork anywhere, but will probably write my own to use I would have been unable to use the test script as my Internet doesn't always connect to pool on the very first packet / hangs and requires a kick / no way to detect that easily in looper script. Also benchmark mode exits by itself (no sleep/kill). |
yeah, it's no problem at all. I'll probably read the GPU code as well eventually, it's probably already really good, but you it's always possible that there is something there. The script is at https://github.com/mbasaman/xmr-stak-config If you want the script, you can take it and rename it. You can update the code to pull the 15 second value. It's just scraping the log file. I think what it's doing now is waiting 75 seconds, then if the log entry doesn't get generated, it will re-read the log every 5 seconds for an additional 75 seconds. You can update it to wait longer for the network issues. |
Could someone please give me a short summery if 20hash is an improvement and short summarize an example. Please do not measure one core. Please use a good old cfg vs a cfg with this PR. |
if you want to use the "10 hashes at a time" method OR the "optimized for prefetch" changes, I would suggest a new PR that retains the original implementations. https://github.com/mbasaman/xmr-stak/tree/more_options_dev is an example, but it would need to be trimmed to improve build time. The last tests I did are available at: https://github.com/mbasaman/xmr-stak-config/blob/master/results.2.txt which showed a 20% improvement versus https://github.com/mbasaman/xmr-stak-config/blob/master/results.1.txt |
what kind of cpu do you use for your tests? Leds than 100H/s looks like
some very old low end cpu.
|
We were pretty clear everywhere above about which CPUs got what and we tested quite a range And yes like Celerons and stuff that aren't supposed to work well anyway. But the main interesting thing is how stacking silly high levels of work lets the Intel funky magic (SmartCache, predictive whatever, adjacent cache line prefetch, etc...) do a much cleaner job which gets quite a boost in some cases. I think once a huge stack of tasks is queued it stops readjusting the SmartCache topology as much (as it can see a gigantic workload of the same stuff over and over) but when it remains a normal logical task with short/fast sections, it decides to share out the cache to cores differently, and changes it often - leading to not same hashrate on each core (cores should be very close to completely equal) or floating hashrates ('high' rate core floats among the others, but overall total hashrate is stable). Since SmartCache is relatively snake oil with no docs, it makes sense to try semirandom ideas just to see how it reacts, which might tell us more about exactly how to force (or pseudo-force) cache-to-core allocations. |
#1649 may be related to SmartCache being "weird" with too-short a work stack, but definitely shows the problem, and I guess as usual "some other miner" is perfect every time. My hopes were this long stacking idea would fix that. |
Is this PR going to be in an upcoming release? I'd love to get my performance back on these CPUs |
It probably needs a thorough cleanup, and/or it seems like a template could generate a variable selection of Would be better to utilize hardware ID information and have a DLL per family, split the CPU backend into so/dll like the other backends. Memory footprint would be smaller, loading at runtime whichever CPU backend has whatever heights work best on it. But collecting reliable benchmark data is somewhat a problem, so a profiler tool would have to be created ( Bench-cloud system could benefit the autoconfig code in the GPU backends as well, not having a benchmark-cloud type system makes it tough to get good optimization data for those just the same. There is that site with various paste dumps of old format config files and whatnot, but that isn't machine readable nor always accurate and most are from ancient backends or old algorithm versions (out of date). |
yeah, so it does need to be cleaned up. There are probably ways to reduce compile time by moving the code to a static library (or something similar). What I was thinking is to add some kind of learning process to the code if every CPU is different. For example, you mine for 9 minutes on the best known config, and then spend 1 minute mining on an untested config. After a few hours or days, the program will have built it's own deployment-specific benchmarks, which could be persistent locally via the file system or whatever. After the benchmarks have been auto-generated (after a few days) it could still try to re-benchmark known configs but probabilistically favor configs that are almost as good as the best config, on the off chance that there was some irregularities during the original bench-marking. That's assuming that you have lots of combinations of implementations via static libraries. The other option would be to just add a single method with 10 hashes per run using the modified macros that are optimized for prefetch and leave the original code unmodified. |
and if you guys don't have time to put it together yourself, just agree to a specification for what you want done and I'll implement it when I have some free time I can also look at the GPU code but I have to put the time in to read it. |
I like @Spudz76 suggestion above of running a benchmark and uploading results some how. I agree, the current xmr bench mark site is OK as a reference, but far from definitive and not machine readable at all. I'm not familiar with the client side portion at all, but could take a stab at putting something on the get/share results side. What all field data would be relevant to store? |
It would be ideal if the CPU backend compiled its executable/kernel at runtime based on settings/detections just like the GPU backends. But then the miner rig needs compilers, not a problem for Linux but, windows would be painful. However it would solve the huge-exe or many-dll problem and allow for one piece of code with optional sections and variable template expansions/unrolls just like the OpenCL or CUDA code does, assembled at runtime. Maybe this is why OpenCL supports CPUs at all, sort of a portable compile engine. Too bad it is never fast as direct to metal, but I do wonder how close it can get (one would hope Intel or AMD made their CPU-OpenCL with strengths and weaknesses of their CPU in mind) |
I have been somewhat maintaining this patch against current dev on my fork+branch dev-hax however it only applies the 10-way (which was what I mostly needed) It is untested other than monero7 but the deca-work is all wired up for all algos. Also note donation is already patched out along with some random type-warning fixes in the latest commit, CUDA detection verbosity patch from my other PR in the next most recent commit, and the 10-way patch in the third (then normal upstream dev commits). This is the branch I compile my active miners from, feel free to checkout and compile from it too, you can see in the commits there are no awesome backdoors or anything evil added. Interested if the lite/heavy/fast/bittube variants all work properly if any of you mine those, and have a CPU that likes 10-way... The 10-way helps hashrate on a Also another branch on my fork, |
I've noticed my current branches with the 10-way patch applied probably do not work correctly for anything but what I mine (monero7) there are some missing patches to the macros for newer/changed coins. Caveat until I get them fixed up. Maybe converting the macros to true templates too. I did modify and succeed in getting the cpu backend compiled into a shared library just like the other two. There is some mess with the other backends needing to use the 'core' cpu code to validate results, but I forked a simplified singlethread CPU hash validator ( I am currently wrestling with CMake not building the OpenCL backend until after the main exe, so then it doesn't get linked properly. CUDA still builds before the main exe so I've got something messed up. But CPU mining via the externalized backend seems to work so far. |
Also suspicious that the X-way cache on the CPU may be the precise X-way threading that works optimally. One of my CPUs has 20-way 15MB L3 thus the hypothesis is that 20-way threading will work best. As long as the linesize of cache is 64 bit that is... Pretty much rewriting the autotuner to use more knowledge that hwloc provides, and finally merging the hwloc and non-hwloc code into one file so it's easier to work with. The non-hwloc autotuner also had more messages about what it found and how it thinks the layout should be and why, which is useful info but totally missing in the hwloc version. |
ok, I'm glad you picked it up. I still think it's theoretically possible to be faster, perhaps using a different method. Good luck. |
Please make sure your PR is against dev branch. Merging PRs directly into master branch would interfere with our workflow.