Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the default post-parallel-reads value #9074

Closed
5 of 15 tasks
Tracked by #10338
rjan90 opened this issue Jul 22, 2022 · 6 comments · Fixed by #10365
Closed
5 of 15 tasks
Tracked by #10338

Changing the default post-parallel-reads value #9074

rjan90 opened this issue Jul 22, 2022 · 6 comments · Fixed by #10365
Labels
area/proving Area: Proving kind/enhancement Kind: Enhancement

Comments

@rjan90
Copy link
Contributor

rjan90 commented Jul 22, 2022

Checklist

  • This is not a new feature or an enhancement to the Filecoin protocol. If it is, please open an FIP issue.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not brainstorming ideas. If you have an idea you'd like to discuss, please open a new discussion on the lotus forum and select the category as Ideas.
  • I have a specific, actionable, and well motivated improvement to propose.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Improvement Suggestion

Background
windowPoSt workers has a post-parallel-reads feature that allows a storage provider to set an upper boundary to how many challenges are read from the storage simultaneously when doing windowPoSt. Currently that value is set to 128.

Issue
We have been getting some reports that when you have a full partition this setting can cause very short network timeouts which causes sectors to be skipped (marked as bad), since the challenges for those sectors can´t be read in the short period that the network timeouts. See #8957, especially this comment for the full context of this issue.

The issue with skipped sectors was mitigated by either tuning down the post-parallel-reads value or by disabling the windowPoSt PreChecks which are mostly redundant on windowPoSt-workers.

Improvement suggestion
I suggest that we reduce the current post-parallel-reads value to a more conservative value to mitigate these issues, and rather let SPs fine tune this value higher if their architectures can handle it.

A big thanks to @benjaminh83 for very detailed write up of the issues, and investiagtion of the fixes.

@rjan90 rjan90 added need/triage kind/enhancement Kind: Enhancement area/proving Area: Proving and removed need/triage labels Jul 22, 2022
@donkabat
Copy link

I checked few values:
128 (default): 10% bad
50: 2% bad
30: 0 bad

@rjan90
Copy link
Contributor Author

rjan90 commented Sep 14, 2022

Thanks for adding your input here @donkabat!

@benjaminh83 which values where you using for the post-parallel-reads again?

@magik6k Not sure if we should wait for more datapoints around this, or make some changes based on the values/tests we already have?

@benjaminh83
Copy link

benjaminh83 commented Sep 14, 2022

@rjan90 I was running it at 32: 0 bad
This looks like what @donkabat was experiencing. I cannot reproduce it now, as it was only able to fail sector reads when I had a degraded cluster storage attached that did perform very poorly under stress. This was not reproducible when using storage workers.
Lastly, reducing the parallel read to 32 did not really have much impact on the overall wdPoST times, so maybe a better setting for reduced risk of saturating the network/storage resulting in the timeout.

@donkabat
Copy link

About my case:
I moved from NFS to only-storage workers. wdPoSt times reduced from ~15-20min to 6-10min.
After I set post-parallel-reads 30 wdPoSt takes ~10min

@donkabat
Copy link

After yesterday's wd post, sectors:
total 25.755
faults: 12
post-parallel-reads = 30

example of error:
CheckProvable Sector FAULT: generating vanilla proof {"sector": {"ID":{"Miner":1127678,"Number":28953},"ProofType":8}, "err": "do request: Post \"http://192.168.88.16:1168/remote/vanilla/single\": context deadline exceeded", "errVerbose": "do request:\n github.com/filecoin-project/lotus/storage/paths.(*Remote).GenerateSingleVanillaProof\n /home/filecoin/networks/mainnet/build/lotus/storage/paths/remote.go:819\n - Post \"http://192.168.88.16:1168/remote/vanilla/single\": context deadline exceeded"}

@donkabat
Copy link

post-parallel-reads = 25

total 25.755
faults: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proving Area: Proving kind/enhancement Kind: Enhancement
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants