SDR: cannot put two runs into the same hwloc group #1556

vmx · 2022-01-20T13:38:22Z

Description

We are using hwloc for the multicore SDR. We want to make sure that the workers and the producer should share the same L3 cache. As reported filecoin-project/lotus#7981 it looks like we assign at most one multicore SDR per group. If there requested more then those groups, they get arbitrarily assigned.

Ideally if all groups already have one multicore SDR assigned, we should check if there is still enough room, to put another one into a group. Example: You have 24 cores and they form 4 groups where each group has its own L3 cache. You'd have 6 cores for SDR. If you now only have 2 producers (and one consumer), you only use 3 cores. You could put another SDR run in there. They will battle for the L3 cache, but that might be fine if the lookahead is small enough.

Acceptance criteria

You can have severals SDRs assigned to the same group if there are enough cores. Ideally it's a test that tests for the assignment it will use. This tests would likely need to be run manually as it's highly hardware specific, but that's fine.

Risks + pitfalls

Make sure that the current assignment of one SDR per group (until there are no more free groups left) is kept intact.

Where to begin

https://github.com/filecoin-project/rust-fil-proofs/blob/307e7af34732eed5d3a6885e4da33489dbcb1ccf/storage-proofs-porep/src/stacked/vanilla/cores.rs contains the relevant code.

The text was updated successfully, but these errors were encountered:

benjaminh83 · 2022-02-03T10:03:01Z

I have my EPYC MILAN 7413 dual CPU test rig running Multicore SDR with FIL_PROOFS_MULTICORE_SDR_PRODUCERS=2

Each CPU has 4 CCX and this gives a total of 8 CCX cross the whole platform. Logs are set to DEBUG.

I see a great thread allocation of new PC1 jobs for each CCX up to 8 PC1 jobs. So for PC1 job 1-8, logs generally look like examples here below:

2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 1
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 2
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 2
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 0
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 1
2022-02-03T09:34:14.220 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 1
2022-02-03T09:34:14.220 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner
2022-02-03T09:34:14.220 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner

and

2022-02-03T09:41:54.815 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 1
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 8
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 8
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 0
2022-02-03T09:41:54.816 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 7
2022-02-03T09:41:54.816 DEBUG storage_proofs_porep::stacked::vanilla::cores > binding to 7
2022-02-03T09:41:54.816 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner

But as mentioned in the linked issue, we get a problem/error when trying to allocate threads for PC1 job 9. This is what the logs show when the allocation is done and fails:

2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::proof > multi core replication
2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > create labels
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 0 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 1 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 2 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 3 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 4 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 5 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 6 locked, could not checkout
2022-02-03T09:43:05.282 DEBUG storage_proofs_porep::stacked::vanilla::cores > core group 7 locked, could not checkout
2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::memory_handling > initializing cache
2022-02-03T09:43:05.282 WARN storage_proofs_porep::stacked::vanilla::memory_handling > failed to lock map Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }, falling back
2022-02-03T09:43:05.282 WARN storage_proofs_porep::stacked::vanilla::memory_handling > failed to lock map Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }, falling back
2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > Layer 1
2022-02-03T09:43:05.282 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > Creating labels for layer 1
2022-02-03T09:43:05.283 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 0
2022-02-03T09:43:05.283 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner
2022-02-03T09:43:05.283 DEBUG storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 1
2022-02-03T09:43:05.283 INFO storage_proofs_porep::stacked::vanilla::create_label::multi > created label runner

I find the last couple of lines interesting. We see the message regarding binding core in producer thread 0/1, but no thread is actually picked. I my specific CPU setup, I would have wanted it to assign the next free cores on CCX0 on CPU0 (it has 6 physical cores in each CCX, so it should be allowed to run 2 jobs in each CCX with 2 producers each). I would have expected it to run the master thread of the 9th PC1 job on thread 3, as 0,1,2 is running 1st PC1 job, and then the logs should have stated:

storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 1
storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 4
storage_proofs_porep::stacked::vanilla::cores > binding to 4

storage_proofs_porep::stacked::vanilla::create_label::multi > binding core in producer thread 0
storage_proofs_porep::stacked::vanilla::cores > allowed cpuset: 5
storage_proofs_porep::stacked::vanilla::cores > binding to 5

But it didn't and now the CPUs are just balancing the 9th PC1 job cross free cores on both CPUs - sealing time goes 3h->9h for those jobs :(

vmx · 2022-02-03T10:09:35Z

Yes, this is what I tried to describe in my bug report. Currently you can only assign one SDR to each CCX, once each CCX has one SDR assigned, random cores are used.

benjaminh83 · 2022-03-16T21:01:45Z

@vmx any idea if this will be addressed any time soon?
I think it makes really good sense to get it fixed, as its more efficient to do PC1 on a dual socket system, and right now we cannot really do that without going into virtualization and cheating lotus.
I can create a VM and bind it to only the second CPU, and tell the VM to show the OS that it has 15 CCX with each 2 cores, and lotus is happily do 15 PC1 Multicore SDR jobs in parallel on this VM with normal speeds.
I just don't think its right that we need to work like this to get lotus working because of a bug right? And not being able to easily use Dual CPU servers is just a waste of power, which we are trying to prevent right? :)

vmx · 2022-03-17T10:28:55Z

@benjaminh83 there's no timeline for this yet, though it's something I'd surely like to look into.

For multicore SDR it is important that the producer and conumers share the same (L3) cache. Hence we bind specific cores to threads. Prior to this change there was one multicore SDR job per group, even if the group could accompany multiple of such jobs. If there were more jobs scheduled than groups available, those additional jobs wouldn't use specific cores, but whatever the operating system decided. With this change, additional jobs are now put into the groups in case there is enough space to accompany them. Fixes #1556.

f8-ptrk · 2022-04-24T03:00:14Z

@vmx is there a way forward for this to get into a release soon?

just to give you an impression on the impact: when planning pc1 servers this issue is the difference between a single server with 2x32core 3000$ cpus (if the scheduling works) and 2 64 core single cpu servers with each cpu at 7k

vmx · 2022-04-25T09:59:08Z

@f8-ptrk I don't know how quickly it will be in a proper release soon. But you should be able to try it out today, when built from source with this changed applied #1588.

f8-ptrk · 2022-04-25T11:04:18Z

sweet! quick question: if we have a 4 core ccx and run 1+1 producer mcSDR PC1s:

will it fit 2(2x 1+1) PC1in that group or 3 (3x 1+1) - one producer core could be shared between 3 PC1s (in theory)

f8-ptrk · 2022-04-25T11:09:12Z

@cryptowhizzard this is worth applying to the dual cpu milans for pc1 you got! it might solve the problems

vmx · 2022-04-25T11:09:14Z

It will fit 2 multicore SDRs into the group. The multicore SDRs run independently, they don't share producers.

f8-ptrk · 2022-04-25T11:09:46Z

It will fit 2 multicore SDRs into the group. The multicore SDRs run independently, they don't share producers.

they could share a core for their individual producers - thats what i meant (as a producer hardly max'es out the core)

but ack'ed: [core count per group] / 1+ producers = max PC1's per group

vmx · 2022-04-25T11:13:13Z

but ack'ed: [core count per group] / 1+ producers = max PC1's per group

Correct, each consumer/producer runs in its own thread which is bound to one specific core each.

benjaminh83 · 2022-04-25T13:02:22Z

@vmx Sounds like we have something for testing on ZEN3/MILAN. I would love to get those tests going on my dual 7413, where I have 6 cores in each CCX, and basically want to run 2 (or maybe actually 3) in each CCX. I hope this will work.

Any Idea if this will work on a dual Intel Xeon 3rd gen? In theory, this will just look like two big CCX's. Would you expect this to work? I could test it on my dual 8380, but I wouldn't start the testing, if you know the fix does not cover this scenario?

vmx · 2022-04-25T13:47:05Z

@benjaminh83 it would be great if someone would try it out!

Any Idea if this will work on a dual Intel Xeon 3rd gen? In theory, this will just look like two big CCX's. Would you expect this to work?

Yes that would also work. It would then split those two big CXX' into smaller units.

The test might make things clearer. I'd hope that they can even be understood by people not fluent in Rust. It shows how core are assigned. It returns a list of those "units" and the multicore SDR will use one of those units one after another.

Your case sounds similar to this test case, where there are 16 cores, and two CXX', while having 2 producers (hence, due to the consumer, 3 threads running).

vmx · 2022-04-25T13:50:49Z

I should also note to everyone trying out: if you run it with RUST_LOG=debug, you should get a message like Core units: … which lists a nested array, which displays the units it calculated. This way you should see if it works as you'd expect it.

cryptowhizzard · 2022-04-25T14:05:52Z

I would love to test this on the latest 5.18 kernels of Linux since they have more support for the Milan and Zen kernels and NUMA. Is it possible to make a special tag for this so we can compile it in?

vmx · 2022-04-25T14:10:11Z

@cryptowhizzard what do you mean with "special tag"?

cryptowhizzard · 2022-04-25T14:13:33Z

@vmx to have it in lotus, to compile lotus-worker.

vmx · 2022-04-25T14:16:01Z

@cryptowhizzard oh, you mean having a released version of it?

benjaminh83 · 2022-04-25T14:37:58Z

@vmx @cryptowhizzard no problem! I will take it up in SPX - @jennijuju and see if we can get a "jenni-tagged" lotus with this package inside.

For multicore SDR it is important that the producer and conumers share the same (L3) cache. Hence we bind specific cores to threads. Prior to this change there was one multicore SDR job per group, even if the group could accompany multiple of such jobs. If there were more jobs scheduled than groups available, those additional jobs wouldn't use specific cores, but whatever the operating system decided. With this change, additional jobs are now put into the groups in case there is enough space to accompany them. Fixes #1556.

hyunmoon · 2022-07-04T07:43:38Z

Is this currently stable enough to be included in the lotus release? @vmx

hyunmoon · 2022-07-04T07:57:24Z

@benjaminh83 Could you request another jenni-tag since the code has been updated?

vmx · 2022-07-04T08:47:19Z

Is this currently stable enough to be included in the lotus release? @vmx

Yes, the plan is to have it in the next Lotus release.

hyunmoon · 2022-07-04T11:57:06Z

Great! I hope you meant v1.17, not v1.18.

vmx · 2022-07-04T12:09:56Z

Great! I hope you meant v1.17, not v1.18.

v1.18.

f8-ptrk · 2022-07-04T12:12:06Z

thats most likely september :/

donkabat · 2022-07-05T19:45:13Z

so... when do you schedule v1.18_rc1? :)

OneilYang · 2022-07-14T02:45:09Z

Does it need more settings for cpu 7302 * 2？
cpu: 7302 * 2 (dual cpu) FIL_PROOFS_MULTICORE_SDR_PRODUCERS=1
test 2 pc1 on it and found this:

looks like some thing not right for 3/5/7/9 and good for 2/4/6/8, any suggestions?

For multicore SDR it is important that the producer and conumers share the same (L3) cache. Hence we bind specific cores to threads. Prior to this change there was one multicore SDR job per group, even if the group could accompany multiple of such jobs. If there were more jobs scheduled than groups available, those additional jobs wouldn't use specific cores, but whatever the operating system decided. With this change, additional jobs are now put into the groups in case there is enough space to accompany them. Fixes #1556.

OneilYang · 2022-08-07T12:43:36Z

Thanks @vmx, now it works very well on 7302*2, great work~~

vmx added the good first issue Good for newcomers label Jan 20, 2022

vmx mentioned this issue Jan 20, 2022

PC1 Multicore SDR hwloc calculation bug on MILAN EPYCs (ZEN3) filecoin-project/lotus#7981

Closed

18 tasks

vmx mentioned this issue Apr 22, 2022

fix: SDR: make it possible to bind to cores if units > groups #1588

Merged

vmx closed this as completed in #1588 Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDR: cannot put two runs into the same hwloc group #1556

SDR: cannot put two runs into the same hwloc group #1556

vmx commented Jan 20, 2022

benjaminh83 commented Feb 3, 2022

vmx commented Feb 3, 2022

benjaminh83 commented Mar 16, 2022

vmx commented Mar 17, 2022

f8-ptrk commented Apr 24, 2022 •

edited

Loading

vmx commented Apr 25, 2022

f8-ptrk commented Apr 25, 2022 •

edited

Loading

f8-ptrk commented Apr 25, 2022

vmx commented Apr 25, 2022

f8-ptrk commented Apr 25, 2022 •

edited

Loading

vmx commented Apr 25, 2022

benjaminh83 commented Apr 25, 2022

vmx commented Apr 25, 2022

vmx commented Apr 25, 2022

cryptowhizzard commented Apr 25, 2022

vmx commented Apr 25, 2022

cryptowhizzard commented Apr 25, 2022

vmx commented Apr 25, 2022

benjaminh83 commented Apr 25, 2022

hyunmoon commented Jul 4, 2022

hyunmoon commented Jul 4, 2022

vmx commented Jul 4, 2022

hyunmoon commented Jul 4, 2022

vmx commented Jul 4, 2022

f8-ptrk commented Jul 4, 2022

donkabat commented Jul 5, 2022

OneilYang commented Jul 14, 2022 •

edited

Loading

OneilYang commented Aug 7, 2022

SDR: cannot put two runs into the same hwloc group #1556

SDR: cannot put two runs into the same hwloc group #1556

Comments

vmx commented Jan 20, 2022

Description

Acceptance criteria

Risks + pitfalls

Where to begin

benjaminh83 commented Feb 3, 2022

vmx commented Feb 3, 2022

benjaminh83 commented Mar 16, 2022

vmx commented Mar 17, 2022

f8-ptrk commented Apr 24, 2022 • edited Loading

vmx commented Apr 25, 2022

f8-ptrk commented Apr 25, 2022 • edited Loading

f8-ptrk commented Apr 25, 2022

vmx commented Apr 25, 2022

f8-ptrk commented Apr 25, 2022 • edited Loading

vmx commented Apr 25, 2022

benjaminh83 commented Apr 25, 2022

vmx commented Apr 25, 2022

vmx commented Apr 25, 2022

cryptowhizzard commented Apr 25, 2022

vmx commented Apr 25, 2022

cryptowhizzard commented Apr 25, 2022

vmx commented Apr 25, 2022

benjaminh83 commented Apr 25, 2022

hyunmoon commented Jul 4, 2022

hyunmoon commented Jul 4, 2022

vmx commented Jul 4, 2022

hyunmoon commented Jul 4, 2022

vmx commented Jul 4, 2022

f8-ptrk commented Jul 4, 2022

donkabat commented Jul 5, 2022

OneilYang commented Jul 14, 2022 • edited Loading

OneilYang commented Aug 7, 2022

f8-ptrk commented Apr 24, 2022 •

edited

Loading

f8-ptrk commented Apr 25, 2022 •

edited

Loading

f8-ptrk commented Apr 25, 2022 •

edited

Loading

OneilYang commented Jul 14, 2022 •

edited

Loading