Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: PoSt workers #7971

Merged
merged 51 commits into from
Mar 26, 2022
Merged

feat: PoSt workers #7971

merged 51 commits into from
Mar 26, 2022

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented Jan 18, 2022

Related Issues

Based on #7600

Proposed Changes

PoSt workers:

  • Winning and Window PoSt
  • Multiple PoSt workers can be connected to a lotus-miner process
    • In multi-partition PoSt, it should get run on multiple workers in parallel, up to the number of partitions
  • PoSt workers can read challenges from remote storage
    • This meant that lotus-miner (nor PoSt workers) no longer need filesystem access to sectors;
  • Handling of corrupted sectors should be much more robust
    • If we fail to read challenges for a sector, we'll skip it and retry without trying to create a bad post snark

Additional Info

This PR introduces a number of new features:

  • Window and Winning PoSt Workers
    • PoSt workers are specialized instances of lotus-worker - one lotus-worker instance can only be one of:
      • WindowPoSt worker
      • WinningPoSt worker
      • Worker for other sealing tasks
    • When a [winning/window]PoSt worker connects to lotus-miner, the miner will delegate all [winning/window]PoSt tasks to those workers - no PoSt tasks will be executed locally on the miner instance
  • WindowPoSt partitions are computed in parallel
    • This should make it much easier to respond to WindowPoSt on time with multiple partitions per deadline
  • Responding to PoSt challenges from remote storage
    • PoSt workers don't need local access to sectors - workers computing PoSt will ask other workers to read challenges from their storage if PoSt workers don't have access to sector storage.

Usage

  • (update lotus-miner and all workers (at the very least all workers with access to storage paths; post workers can read challanges directly from other workers, and if old workers are called, PoSt will fail))
  • Have a lotus-miner running
  • Start lotus-worker run [--winningpost/--windowpost] (use the same envvars like for any other worker)
  • See the worker registered in lotus-miner sealing workers
  • In miner info, see the PoSt worker, e.g. Workers: Seal(1) WdPoSt(0) WinPoSt(1)

Followups

  • Redundant PoSt - compute the same PoSt on multiple workers, submit first valid result
  • Ability to entirely disable PoSt on the main lotus-miner instance - right now this happens after PoSt workers get connected

Checklist

Before you mark the PR ready for review, please make sure that:

  • All commits have a clear commit message.
  • The PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, INTERFACE BREAKING CHANGE, CONSENSUS BREAKING, build, chore, ci, docs, misc,perf, refactor, revert, style, test
    • area: api, chain, state, vm, data transfer, market, mempool, message, block production, multisig, networking, paychan, proving, sealing, wallet
  • This PR has tests for new functionality or change in behaviour
  • If new user-facing features are introduced, clear usage guidelines and / or documentation updates should be included in https://lotus.filecoin.io or Discussion Tutorials.
  • CI is green

@magik6k magik6k marked this pull request as ready for review January 20, 2022 14:03
@magik6k magik6k requested a review from a team as a code owner January 20, 2022 14:03
@codecov
Copy link

codecov bot commented Jan 20, 2022

Codecov Report

Merging #7971 (7a009ab) into master (d437f19) will increase coverage by 0.31%.
The diff coverage is 70.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7971      +/-   ##
==========================================
+ Coverage   40.23%   40.55%   +0.31%     
==========================================
  Files         684      687       +3     
  Lines       74655    75230     +575     
==========================================
+ Hits        30040    30509     +469     
- Misses      39372    39420      +48     
- Partials     5243     5301      +58     
Impacted Files Coverage Δ
api/api_storage.go 0.00% <ø> (ø)
cmd/lotus-miner/init.go 0.00% <0.00%> (ø)
cmd/lotus-miner/proving.go 27.89% <0.00%> (+0.20%) ⬆️
cmd/lotus-worker/cli.go 0.00% <ø> (ø)
cmd/lotus-worker/info.go 12.69% <ø> (ø)
cmd/lotus-worker/main.go 0.00% <0.00%> (ø)
cmd/lotus-worker/resources.go 0.00% <ø> (ø)
cmd/lotus-worker/storage.go 0.00% <0.00%> (ø)
cmd/lotus-worker/tasks.go 27.27% <ø> (ø)
extern/sector-storage/selector_task.go 33.33% <ø> (ø)
... and 66 more

Copy link
Contributor

@arajasek arajasek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes

extern/sector-storage/manager_post.go Outdated Show resolved Hide resolved
extern/sector-storage/manager_post.go Show resolved Hide resolved
extern/sector-storage/manager_post.go Outdated Show resolved Hide resolved
extern/sector-storage/sched_post.go Show resolved Hide resolved
extern/sector-storage/sched_post.go Outdated Show resolved Hide resolved
extern/sector-storage/sched_post.go Outdated Show resolved Hide resolved
extern/sector-storage/sched_post.go Show resolved Hide resolved
extern/sector-storage/worker_local.go Outdated Show resolved Hide resolved
@magik6k
Copy link
Contributor Author

magik6k commented Mar 21, 2022

Just tested in lotus-pond, and it works really well:

  • Both PoSt types work
  • Disputer doesn't dispute my PoSts
  • After restarting lotus-miner, the worker auto-reconnects
  • When the worker is stopped, lotus-miner switches back to local PoSt automatically

cmd/lotus-miner/info.go Show resolved Hide resolved
cmd/lotus-seal-worker/main.go Show resolved Hide resolved
cmd/lotus-seal-worker/main.go Outdated Show resolved Hide resolved
extern/sector-storage/faults.go Outdated Show resolved Hide resolved
@@ -161,25 +99,4 @@ func (m *Manager) CheckProvable(ctx context.Context, pp abi.RegisteredPoStProof,
return bad, nil
}

func addCachePathsForSectorSize(chk map[string]int64, cacheDir string, ssize abi.SectorSize) {
switch ssize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see these stat checks anywhere anymore and no behavior change within GenerateSingleVanillaProof. Did they all become redundant with the check in GenerateSingleVanillaProof at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we can read challanges, even random ones, that probably means that the sector is fine (when challenges are read, they are also validated, at least that's what I've been told)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it seriously increase the time spent on check?

},
sealtasks.TTGenerateWindowPoSt: {
abi.RegisteredSealProof_StackedDrg64GiBV1: Resources{
MaxMemory: 120 << 30, // TODO: Confirm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need confirmation here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

extern/sector-storage/manager_post.go Show resolved Hide resolved
extern/sector-storage/manager_post.go Show resolved Hide resolved
// list contains substitutes for skipped sectors - but we don't care about
// those for the purpose of the proof, so for things to work, we need to
// dedupe here.
sectorInfo = dedupeSectorInfo(sectorInfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anything like this currently exists. Is this new logic fixing a bug / bugs in current lotus?

sectorInfo = dedupeSectorInfo(sectorInfo)

// The partitions number of this batch
// ceil(sectorInfos / maxPartitionSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprising to me. If we had two partitions of size maxPartitionSize and half of each was skipped this code is telling me that we only need to take one partition proof rather than two partition proofs. There's no problem shifting sectors across partitions like this?

extern/sector-storage/manager_post.go Show resolved Hide resolved
extern/sector-storage/worker_local.go Outdated Show resolved Hide resolved
extern/sector-storage/worker_local.go Outdated Show resolved Hide resolved
@ZenGround0
Copy link
Contributor

@magik6k since we're using a different pathway through ffi to do post in the case a worker is hooked up it seems like lotus-bench should adapt to be able to exercise the new path. I'm not sure whether the typical lotus-bench use case is 1) exercise the operator's specific setup or 2) exercise general purpose proving pathways on the miner.

If its 1) we could hook up lotus-bench to the manager so it can talk to workers. If its 2) I think we should add a lotus-bench check that does the challenge -> vanilla proof -> snark ( -> merge multi partition?)

This may be overkill because ffi is doing similar things inside for each. What do you think?

@@ -31,15 +31,15 @@ import (

const metaFile = "sectorstore.json"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add test of new method GenerateSingleVanillaProof

extern/sector-storage/stores/remote.go Outdated Show resolved Hide resolved
extern/sector-storage/worker_local.go Show resolved Hide resolved
Copy link
Contributor

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question about lotus bench remains and I didn't review the new worker itest code too carefully but I'm happy with this going in as is.

if storiface.WorkerID(curSes) != wid {
if curSes != ClosedWorkerID {
// worker restarted
log.Warnw("worker session changed (worker restarted?)", "initial", wid, "current", curSes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable here? or is the delete on return enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that delete is enough

@magik6k
Copy link
Contributor Author

magik6k commented Mar 26, 2022

On lotus-bench + setting good resource numbers - I think this can be done after this lands in master; I'd love to stop making this PR bigger, especially given that it seems to work pretty well already

(opened #8373 with some in-progress work on top of this PR)

Copy link
Contributor

@arajasek arajasek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the last five commits

@magik6k magik6k merged commit 7401fa2 into master Mar 26, 2022
@magik6k magik6k deleted the feat/post-worker branch March 26, 2022 00:31
@magik6k magik6k mentioned this pull request Mar 26, 2022
@lbj2004032
Copy link

What would happen when miner process exit?
I thought minners prefer a framework that wndpost as a standalone process,
does wndpost/winpost run seperate from minner?

@BBQFIL
Copy link

BBQFIL commented Sep 25, 2022

@magik6k
Hello, I want to ask how miner judges that a post worker has been closed.
I encountered some problems today. When one of my post workers lost connection to the miner due to network card failure, the miner still handed over the partition to the lost post worker for processing. This is not in line with my expectations. In my opinion, when a post worker fails to handle work, miner should assign tasks to other healthy post workers.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants