Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDR: cannot put two runs into the same hwloc group #1556

Closed
vmx opened this issue Jan 20, 2022 · 28 comments · Fixed by #1588
Closed

SDR: cannot put two runs into the same hwloc group #1556

vmx opened this issue Jan 20, 2022 · 28 comments · Fixed by #1588
Labels
good first issue Good for newcomers

Comments

@vmx
Copy link
Contributor

vmx commented Jan 20, 2022

Description

We are using hwloc for the multicore SDR. We want to make sure that the workers and the producer should share the same L3 cache. As reported filecoin-project/lotus#7981 it looks like we assign at most one multicore SDR per group. If there requested more then those groups, they get arbitrarily assigned.

Ideally if all groups already have one multicore SDR assigned, we should check if there is still enough room, to put another one into a group. Example: You have 24 cores and they form 4 groups where each group has its own L3 cache. You'd have 6 cores for SDR. If you now only have 2 producers (and one consumer), you only use 3 cores. You could put another SDR run in there. They will battle for the L3 cache, but that might be fine if the lookahead is small enough.

Acceptance criteria

You can have severals SDRs assigned to the same group if there are enough cores. Ideally it's a test that tests for the assignment it will use. This tests would likely need to be run manually as it's highly hardware specific, but that's fine.

Risks + pitfalls

Make sure that the current assignment of one SDR per group (until there are no more free groups left) is kept intact.

Where to begin

https://github.com/filecoin-project/rust-fil-proofs/blob/307e7af34732eed5d3a6885e4da33489dbcb1ccf/storage-proofs-porep/src/stacked/vanilla/cores.rs contains the relevant code.

@benjaminh83
Copy link

I have my EPYC MILAN 7413 dual CPU test rig running Multicore SDR with FIL_PROOFS_MULTICORE_SDR_PRODUCERS=2

Each CPU has 4 CCX and this gives a total of 8 CCX cross the whole platform. Logs are set to DEBUG.

I see a great thread allocation of new PC1 jobs for each CCX up to 8 PC1 jobs. So for PC1 job 1-8, logs generally look like examples here below:

2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 1
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 2
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 2
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 0
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 1
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 1
2022-02-03T09:34:14.220 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner
2022-02-03T09:34:14.220 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner

and

2022-02-03T09:41:54.815 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 1
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 8
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 8
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 0
2022-02-03T09:41:54.816 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 7
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 7
2022-02-03T09:41:54.816 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner

But as mentioned in the linked issue, we get a problem/error when trying to allocate threads for PC1 job 9. This is what the logs show when the allocation is done and fails:

2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::proof > multi core replication
2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > create labels
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 0 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 1 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 2 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 3 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 4 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 5 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 6 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 7 locked, could not checkout
2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::memory_handling > initializing cache
2022-02-03T09:43:05.282 WARN storage_proofs_porep::stacked::vanilla::memory_handling > failed to lock map Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }, falling back
2022-02-03T09:43:05.282 WARN storage_proofs_porep::stacked::vanilla::memory_handling > failed to lock map Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }, falling back
2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > Layer 1
2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > Creating labels for layer 1
2022-02-03T09:43:05.283 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 0
2022-02-03T09:43:05.283 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner
2022-02-03T09:43:05.283 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 1
2022-02-03T09:43:05.283 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner

I find the last couple of lines interesting. We see the message regarding binding core in producer thread 0/1, but no thread is actually picked. I my specific CPU setup, I would have wanted it to assign the next free cores on CCX0 on CPU0 (it has 6 physical cores in each CCX, so it should be allowed to run 2 jobs in each CCX with 2 producers each). I would have expected it to run the master thread of the 9th PC1 job on thread 3, as 0,1,2 is running 1st PC1 job, and then the logs should have stated:

storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 1
storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 4
storage_proofs_porep::stacked::vanilla::cores > binding to 4

storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 0
storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 5
storage_proofs_porep::stacked::vanilla::cores > binding to 5

But it didn't and now the CPUs are just balancing the 9th PC1 job cross free cores on both CPUs - sealing time goes 3h->9h for those jobs :(

@vmx
Copy link
Contributor Author

vmx commented Feb 3, 2022

Yes, this is what I tried to describe in my bug report. Currently you can only assign one SDR to each CCX, once each CCX has one SDR assigned, random cores are used.

@benjaminh83
Copy link

@vmx any idea if this will be addressed any time soon?
I think it makes really good sense to get it fixed, as its more efficient to do PC1 on a dual socket system, and right now we cannot really do that without going into virtualization and cheating lotus.
I can create a VM and bind it to only the second CPU, and tell the VM to show the OS that it has 15 CCX with each 2 cores, and lotus is happily do 15 PC1 Multicore SDR jobs in parallel on this VM with normal speeds.
I just don't think its right that we need to work like this to get lotus working because of a bug right? And not being able to easily use Dual CPU servers is just a waste of power, which we are trying to prevent right? :)

@vmx
Copy link
Contributor Author

vmx commented Mar 17, 2022

@benjaminh83 there's no timeline for this yet, though it's something I'd surely like to look into.

vmx added a commit that referenced this issue Apr 22, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
vmx added a commit that referenced this issue Apr 22, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
vmx added a commit that referenced this issue Apr 22, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
vmx added a commit that referenced this issue Apr 22, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
vmx added a commit that referenced this issue Apr 22, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
@f8-ptrk
Copy link

f8-ptrk commented Apr 24, 2022

@vmx is there a way forward for this to get into a release soon?

just to give you an impression on the impact: when planning pc1 servers this issue is the difference between a single server with 2x32core 3000$ cpus (if the scheduling works) and 2 64 core single cpu servers with each cpu at 7k

@vmx
Copy link
Contributor Author

vmx commented Apr 25, 2022

@f8-ptrk I don't know how quickly it will be in a proper release soon. But you should be able to try it out today, when built from source with this changed applied #1588.

@f8-ptrk
Copy link

f8-ptrk commented Apr 25, 2022

sweet! quick question: if we have a 4 core ccx and run 1+1 producer mcSDR PC1s:

will it fit 2(2x 1+1) PC1in that group or 3 (3x 1+1) - one producer core could be shared between 3 PC1s (in theory)

@f8-ptrk
Copy link

f8-ptrk commented Apr 25, 2022

@cryptowhizzard this is worth applying to the dual cpu milans for pc1 you got! it might solve the problems

@vmx
Copy link
Contributor Author

vmx commented Apr 25, 2022

It will fit 2 multicore SDRs into the group. The multicore SDRs run independently, they don't share producers.

@f8-ptrk
Copy link

f8-ptrk commented Apr 25, 2022

It will fit 2 multicore SDRs into the group. The multicore SDRs run independently, they don't share producers.

they could share a core for their individual producers - thats what i meant (as a producer hardly max'es out the core)

but ack'ed: [core count per group] / 1+ producers = max PC1's per group

@vmx
Copy link
Contributor Author

vmx commented Apr 25, 2022

but ack'ed: [core count per group] / 1+ producers = max PC1's per group

Correct, each consumer/producer runs in its own thread which is bound to one specific core each.

@benjaminh83
Copy link

@vmx Sounds like we have something for testing on ZEN3/MILAN. I would love to get those tests going on my dual 7413, where I have 6 cores in each CCX, and basically want to run 2 (or maybe actually 3) in each CCX. I hope this will work.

Any Idea if this will work on a dual Intel Xeon 3rd gen? In theory, this will just look like two big CCX's. Would you expect this to work? I could test it on my dual 8380, but I wouldn't start the testing, if you know the fix does not cover this scenario?

@vmx
Copy link
Contributor Author

vmx commented Apr 25, 2022

@benjaminh83 it would be great if someone would try it out!

Any Idea if this will work on a dual Intel Xeon 3rd gen? In theory, this will just look like two big CCX's. Would you expect this to work?

Yes that would also work. It would then split those two big CXX' into smaller units.

The test might make things clearer. I'd hope that they can even be understood by people not fluent in Rust. It shows how core are assigned. It returns a list of those "units" and the multicore SDR will use one of those units one after another.

Your case sounds similar to this test case, where there are 16 cores, and two CXX', while having 2 producers (hence, due to the consumer, 3 threads running).

@vmx
Copy link
Contributor Author

vmx commented Apr 25, 2022

I should also note to everyone trying out: if you run it with RUST_LOG=debug, you should get a message like Core units: … which lists a nested array, which displays the units it calculated. This way you should see if it works as you'd expect it.

@cryptowhizzard
Copy link

I would love to test this on the latest 5.18 kernels of Linux since they have more support for the Milan and Zen kernels and NUMA. Is it possible to make a special tag for this so we can compile it in?

@vmx
Copy link
Contributor Author

vmx commented Apr 25, 2022

@cryptowhizzard what do you mean with "special tag"?

@cryptowhizzard
Copy link

@vmx to have it in lotus, to compile lotus-worker.

@vmx
Copy link
Contributor Author

vmx commented Apr 25, 2022

@cryptowhizzard oh, you mean having a released version of it?

@benjaminh83
Copy link

@vmx @cryptowhizzard no problem! I will take it up in SPX - @jennijuju and see if we can get a "jenni-tagged" lotus with this package inside.

vmx added a commit that referenced this issue Jun 20, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
vmx added a commit that referenced this issue Jun 20, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
vmx added a commit that referenced this issue Jun 20, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
@hyunmoon
Copy link

hyunmoon commented Jul 4, 2022

Is this currently stable enough to be included in the lotus release? @vmx

@hyunmoon
Copy link

hyunmoon commented Jul 4, 2022

@benjaminh83 Could you request another jenni-tag since the code has been updated?

@vmx
Copy link
Contributor Author

vmx commented Jul 4, 2022

Is this currently stable enough to be included in the lotus release? @vmx

Yes, the plan is to have it in the next Lotus release.

@hyunmoon
Copy link

hyunmoon commented Jul 4, 2022

Great! I hope you meant v1.17, not v1.18.

@vmx
Copy link
Contributor Author

vmx commented Jul 4, 2022

Great! I hope you meant v1.17, not v1.18.

v1.18.

@f8-ptrk
Copy link

f8-ptrk commented Jul 4, 2022

thats most likely september :/

@donkabat
Copy link

donkabat commented Jul 5, 2022

so... when do you schedule v1.18_rc1? :)

@OneilYang
Copy link

OneilYang commented Jul 14, 2022

Does it need more settings for cpu 7302 * 2?
cpu: 7302 * 2 (dual cpu) FIL_PROOFS_MULTICORE_SDR_PRODUCERS=1
test 2 pc1 on it and found this:
image
looks like some thing not right for 3/5/7/9 and good for 2/4/6/8, any suggestions?

vmx added a commit that referenced this issue Aug 1, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
@vmx vmx closed this as completed in #1588 Aug 2, 2022
vmx added a commit that referenced this issue Aug 2, 2022
For multicore SDR it is important that the producer and conumers share the
same (L3) cache. Hence we bind specific cores to threads. Prior to this
change there was one multicore SDR job per group, even if the group could
accompany multiple of such jobs. If there were more jobs scheduled than
groups available, those additional jobs wouldn't use specific cores, but
whatever the operating system decided.

With this change, additional jobs are now put into the groups in case there
is enough space to accompany them.

Fixes #1556.
@OneilYang
Copy link

Thanks @vmx, now it works very well on 7302*2, great work~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants