Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bind replication threads to specific cores #1305

Merged
merged 1 commit into from
Oct 12, 2020
Merged

Conversation

porcuquine
Copy link
Collaborator

@porcuquine porcuquine commented Oct 6, 2020

In order to maximize replication (precommit phase 1) performance, bind all replication threads to a single core (distinct from other replication threads). Consumer and producer threads for each replication task should be on cores which share cache. If multiple such cores exist, keep threads for a single replication task together (on cores sharing a cache). Otherwise, group cores based on the number of threads required. We create groups even when they do not correspond to a shared cache in order to make use of the core binding facility used in the preferred case.

If a core cannot be bound, a message will be logged in a debug mode, but operation is not interrupted. The thread is simply scheduled without constraint as though no attempt to bind a core had been made.

NOTE: This implementation does not support currently support core binding on macos.

  • Document in README.
  • Identify and remove minimal cause of perf regression (vs best seen in this work, not vs previous).

PERF UPDATES

  • I benchmarked replication of a 32GiB sector with the final code at 2:03:05 (on 3970x) — so regression is conquered.
  • I benchmarked the currently-latest version without mlocked window pages (and not run as root) and saw a noticeably faster result: 2:02:23.
  • 2 sectors, with window pages locked: 2:06:27
  • 2 sectors, without window pages locked 2:06:17

@porcuquine porcuquine force-pushed the feat/sdr-core-handling branch 5 times, most recently from f493237 to 4a2ac47 Compare October 6, 2020 07:43
.circleci/config.yml Outdated Show resolved Hide resolved
@vmx
Copy link
Contributor

vmx commented Oct 6, 2020

@porcuquine The CI is setup properly, hwloc is installed on Linux as well as macOS. The current CI failure seem to be actual failures.

@porcuquine porcuquine marked this pull request as ready for review October 6, 2020 17:55
dignifiedquire
dignifiedquire previously approved these changes Oct 6, 2020
@Dizans
Copy link

Dizans commented Oct 7, 2020

Just curious, how much can this optimize accelerate the PreCommitPhase1 on AMD 3970X.

dignifiedquire
dignifiedquire previously approved these changes Oct 7, 2020
@porcuquine
Copy link
Collaborator Author

@Dizans I put the latest benchmark in the description. It is almost down to two hours on 3970x.

dignifiedquire
dignifiedquire previously approved these changes Oct 8, 2020
dignifiedquire
dignifiedquire previously approved these changes Oct 8, 2020
@porcuquine porcuquine merged commit be32766 into master Oct 12, 2020
@porcuquine porcuquine deleted the feat/sdr-core-handling branch October 12, 2020 22:36
@xueyangl
Copy link

xueyangl commented Apr 6, 2021

@porcuquine i am trying to run lotus-bench 1.5.2 on Intel server platform with 2 Icelake cpu , i run 14 task in parallel, at the very begin of replication, your changes can work well, there are 7 tasks bind into cpu0 and 7 tasks bind into cpu1, however after running some time but still in replication, i saw the binding rule is changed, some tasks(maybe in layer 6 or layer 7) bind to cpu0 originally are re-bind to cpu1, do you have any idea about this symptom ?

@porcuquine
Copy link
Collaborator Author

I don't have any insight into why that would be, but I'll be curious to hear as you or others learn more. I think @dignifiedquire may soon have hardware on which he could experiment.

@xueyangl
Copy link

xueyangl commented Apr 8, 2021

@porcuquine so glad to hear your feedback, just want to know why you didn't consider HWLOC_CPUBIND_STRICT when trying to bind CPU instead of using CPUBIND_THREAD . actually , we are trying CPUBIND_STRICT on our symptom to see if there is any improvement.

@porcuquine
Copy link
Collaborator Author

I don't remember thinking about it either way. The current implementation worked well on my test hardware, which is most of what I was considering at the time.

@xueyangl
Copy link

xueyangl commented Apr 14, 2021 via email

@porcuquine
Copy link
Collaborator Author

porcuquine commented Apr 15, 2021

Out of curiosity, how many cores share an L2 L3 cache on this Ice Lake CPU? You will need to make sure you don't try to use more processes than this number (set appropriate environment variables) for sealing, since shared cache is important to this optimization.

Consumer thread binding to CPU core at the early beginning -- my understanding is consumer thread should be fixed on one cpu core during whole progress, should not migrate to other core .

That should be true.

Producer thread : on each layer, producer thread will do cpu core binding again , so it means for different layer, producer thread may be bind onto different CPU core.

Although the binding is released between layers, the same one should be reacquired for the next layer. Maybe some other thread is getting scheduled there in between, and this reacquisition is failing. The code notes:

                // This could fail, but we will ignore the error if so.
                // It will be logged as a warning by `bind_core`.

Check your logs for evidence (or lack thereof) that this may be happening. Make sure to set log level appropriately.

If my understanding is correct, so how can we guarantee consumer thread and producer threads are in same socket ? is it possible consumer thread bind onto CPU0 core and related producer threads bind onto CPU1 cores?

You might want to also log at DEBUG level and check the output from cores.rs. This might help you understand what is happening. If you compare that output against what you know about the topology, you might be able to confirm or deny that things are being detected and chosen as expected.

@xueyangl
Copy link

xueyangl commented Apr 15, 2021 via email

@porcuquine
Copy link
Collaborator Author

porcuquine commented Apr 15, 2021

Right, sorry, I should have said L3 (fixed in my original message now).

If the L3 cache is shared by 32 cores, then it seems quite likely this optimization will not help much — or not nearly as much as is ideal. It might help some, so it's still worth trying to make sure you can get the producer threads stably pinned to a single core. But I don't think you will be able to get the very fast inter-core communication (via cache) which makes this particular consumer-producer pattern fast enough to let the producers keep up with the hashing of the consumer.

The idea is that we want the hashing to be the bottleneck, which means we need to be able to feed the cache fast enough. For this to work, the cache needs to be fast — and it needs to stay undisturbed by other work loads. The buffer shared between the cooperating cores (running the producers and consumer threads) needs to stay in that fast cache. This works out well when we can ensure no other work happens on the cores sharing an L3 cache. I am not sure it will work out well with your topology.

@xueyangl
Copy link

xueyangl commented Apr 15, 2021 via email

@porcuquine
Copy link
Collaborator Author

I'm not sure hwloc can, but you might try cpuid.

@xueyangl
Copy link

xueyangl commented Apr 16, 2021 via email

@xueyangl
Copy link

xueyangl commented Apr 18, 2021 via email

@porcuquine
Copy link
Collaborator Author

This means that the main thread is waiting for data from a producer. If it doesn't happen 'too much', it's probably fine and is (I think) pretty normal. But if you are always waiting then that is a bottleneck. The obvious thing to try is to increase the number of producers, which you can set with FIL_PROOFS_MULTICORE_SDR_PRODUCERS. For this to be useful, though, the producers and the consumer need to share a cache (and no other threads should be disrupting that cache during the sealing process).

If you can't increase number of producers, you need to make sure the producers are producing fast enough. They read from disk, so I guess making sure you are using SSD would help. There are many possible reasons, some of which might not be fixable, and it might not matter… Hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants