Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markets Instance Appears to Prevent sending of DeclareFaultsRecovered Message #6915

Closed
4 tasks done
TippyFlitsUK opened this issue Jul 28, 2021 · 21 comments
Closed
4 tasks done
Labels
area/api area/messages area/proving Area: Proving kind/bug Kind: Bug P1 P1: Must be resolved
Milestone

Comments

@TippyFlitsUK
Copy link
Contributor

TippyFlitsUK commented Jul 28, 2021

Checklist

  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • I am reporting a bug w.r.t one of the M1 tags. If not, choose another issue option here.
  • I am not reporting a bug around deal making. If yes, create a M1 Bug Report For Deal Making.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.

Lotus component

lotus miner - proving(WindowPoSt)

Lotus Tag and Version

Daemon:  1.11.1-dev+mainnet+git.1757ffcb2+api1.2.0
Local: lotus-miner version 1.11.1-dev+mainnet+git.1757ffcb2

Describe the Bug

My miner failed window post yesterday whilst testing the M1.1 release due to already identified excess server load which has now been resolved.

On the same submission window today (testing M1.2 release) the window post processed without issue but the miner did not send a DeclareFaultsRecovered message prior to the window post message.

On stopping the markets instance the miner immediately (within milliseconds) sent the missing DeclareFaultsRecovered message.

It appears that the running markets instance prevented the DeclareFaultsRecovered message being sent.

Logging Information

https://gist.github.com/TippyFlitsUK/e542b23aefab4930229ad99425b52392

Repo Steps

  1. Attemp to submit DeclareFaultsRocovered message with markets instance running.
@jennijuju jennijuju added this to the 1.11.1 milestone Jul 28, 2021
@jennijuju jennijuju added the P1 P1: Must be resolved label Jul 28, 2021
@jennijuju
Copy link
Member

@TippyFlitsUK can you get us lotus-miner pprof goroutines > goroutines on both miner and market process?

@TippyFlitsUK
Copy link
Contributor Author

@jennijuju
Copy link
Member

@TippyFlitsUK Can we get some miner journals, too?

@raulk
Copy link
Member

raulk commented Jul 28, 2021

@TippyFlitsUK Thanks for the report! The markets process does not interfere at all with the miner process, that's the whole goal of the MRA separation.

What could be happening is that the increased IO pressure you saw in m1.2 had a side effect that affected your miner (especially if you're running both on the same machine). This IO pressure is gone in m1.3.

I've extracted the logs from the WindowPoSt scheduler, which is in charge of doing the following:

  1. declaring recoveries for sectors being proved in the next deadline, that were in a faulted state but are now healthy.
  2. declaring faults for sectors being proved in the next deadline, that were healthy and are now in a faulted state.
  3. submitting WindowPoSt for this deadline.

1 & 2 run in that order, but concurrently to 3.

I'm analyzing these logs to see if I find something fishy.

⟩ grep storageminer log.txt
2021-07-28T12:33:38.875+0100	INFO	storageminer	storage/miner.go:243	starting up miner f08403, worker addr f3vnsby4atsqchtb4unpc5g4cbqj27saae3vncjyjfaspshoqz3s66huseoi27t2vihxfl37iuevdjbdxgcvqq
2021-07-28T12:33:38.884+0100	INFO	storageminer	storage/miner.go:292	Computing WinningPoSt ;[{SealProof:3 SectorNumber:75 SealedCID:bagboea4b5abca5kegtb3z2grfpsta2nrebefn7mrbngdy3qmyib7fwd7q5g7b2qw}]; [150 243 203 139 182 228 219 69 17 231 177 191 240 167 166 0 190 48 123 186 222 134 75 135 236 75 28 204 253 88 1 3]
2021-07-28T12:33:42.016+0100	INFO	storageminer	storage/miner.go:299	GenerateWinningPoSt took 3.132021645s
2021-07-28T14:40:24.301+0100	WARN	storageminer	storage/wdpost_run.go:227	Checked sectors	{"checked": 2, "good": 2}
2021-07-28T14:40:24.303+0100	INFO	storageminer	storage/wdpost_run.go:626	running window post	{"chain-random": "gx7VwkbXfTuXYBImIYwml1W4BKLslUZNDpTCrySV4oU=", "deadline": {"CurrentEpoch":972440,"PeriodStart":972450,"Index":0,"Open":972450,"Close":972510,"Challenge":972430,"FaultCutoff":972380,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "972440", "skipped": 0}
2021-07-28T14:49:50.883+0100	INFO	storageminer	storage/wdpost_run.go:642	computing window post	{"batch": 0, "elapsed": 566.580420489}
2021-07-28T14:50:06.399+0100	INFO	storageminer	storage/wdpost_run.go:836	Submitted window post: bafy2bzacedan5cvexkqbap6bdugdzn3oh4jh2hpq6hgtvxghehiw26x7kjb7s
2021-07-28T15:10:00.467+0100	WARN	storageminer	storage/wdpost_run.go:227	Checked sectors	{"checked": 0, "good": 0}
2021-07-28T15:20:15.315+0100	WARN	storageminer	storage/wdpost_run.go:227	Checked sectors	{"checked": 1293, "good": 1293}
2021-07-28T15:20:16.005+0100	WARN	storageminer	storage/wdpost_run.go:332	declare faults recovered Message CID	{"cid": "bafy2bzacedzqelugvlcpiqvybjgozapn5eztvqfo2lecsa5fxdeixb2lmufu4"}
2021-07-28T15:40:11.567+0100	WARN	storageminer	storage/wdpost_run.go:227	Checked sectors	{"checked": 56, "good": 56}
2021-07-28T15:40:11.576+0100	INFO	storageminer	storage/wdpost_run.go:626	running window post	{"chain-random": "/YycN2M1NLqSfHWQAw0Htr88veitbbnXkxR3DMpXHIk=", "deadline": {"CurrentEpoch":972560,"PeriodStart":972450,"Index":2,"Open":972570,"Close":972630,"Challenge":972550,"FaultCutoff":972500,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "972560", "skipped": 0}
2021-07-28T15:45:54.299+0100	INFO	storageminer	storage/wdpost_run.go:642	computing window post	{"batch": 0, "elapsed": 342.72311446}
2021-07-28T15:47:00.635+0100	INFO	storageminer	storage/wdpost_run.go:836	Submitted window post: bafy2bzacebwl7nzif27swkfjyyfk2lyf3zgcuqmvd4erbnvhjpm2ikdzck5ri

@TippyFlitsUK
Copy link
Contributor Author

Sure @jennijuju.
miner-journal.txt

@TippyFlitsUK
Copy link
Contributor Author

Many thanks for the update @raulk.

It was strange to me that the window post completed without any problem but the DeclareFaultsRecovered seemed to be an issue.

Could IO pressure have affected one but not the other? I am not too familiar with the DeclareFaultsRecovered mechanism.

@raulk
Copy link
Member

raulk commented Jul 28, 2021

These are my discoveries.

  • Miner was started at 2021-07-28T12:33:38.875+0100, at which point we warmed up the WindowPoSt worker but didn't actually generate any proofs.

  • We started processing the first deadline at 2021-07-28T14:40:24.301+0100.

    • No previous faults (therefore no attempt to recover).

    • Only 2 sectors to prove.

    • Generating WindowPoSt took 566 seconds! (is this latency normal in your environments? if not, this is a direct effect of the IO pressure)

        2021-07-28T14:40:24.303+0100    INFO    storageminer    storage/wdpost_run.go:626   running window post {"chain-random": "gx7VwkbXfTuXYBImIYwml1W4BKLslUZNDpTCrySV4oU=", "deadline": {"CurrentEpoch":972440,"PeriodStart":972450,"Index":0,"Open":972450,"Close":972510,"Challenge":972430,"FaultCutoff":972380,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "972440", "skipped": 0}
        2021-07-28T14:49:50.883+0100    INFO    storageminer    storage/wdpost_run.go:642   computing window post   {"batch": 0, "elapsed": 566.580420489}
      
    • Submitted WindowPoSt at 2021-07-28T14:50:06.399+0100.

  • Next deadline was up at 2021-07-28T15:10:00.467+0100.

    • There were sectors to recover (for the next deadline), and it took 10 minutes to check them (is this latency normal in your environments? if not, this is a direct effect of the IO pressure):

        2021-07-28T15:10:00.467+0100    WARN    storageminer    storage/wdpost_run.go:227   Checked sectors {"checked": 0, "good": 0}
        2021-07-28T15:20:15.315+0100    WARN    storageminer    storage/wdpost_run.go:227   Checked sectors {"checked": 1293, "good": 1293}
      
    • Once we checked them, we submitted the declare faults message on chain promptly:

    • 2021-07-28T15:20:16.005+0100 WARN storageminer storage/wdpost_run.go:332 declare faults recovered Message CID {"cid": "bafy2bzacedzqelugvlcpiqvybjgozapn5eztvqfo2lecsa5fxdeixb2lmufu4"}

    • We then moved on to submitting WindowPoSt for 56 sectors, took 342 seconds this time:

        2021-07-28T15:40:11.567+0100    WARN    storageminer    storage/wdpost_run.go:227   Checked sectors {"checked": 56, "good": 56}
        2021-07-28T15:40:11.576+0100    INFO    storageminer    storage/wdpost_run.go:626   running window post {"chain-random": "/YycN2M1NLqSfHWQAw0Htr88veitbbnXkxR3DMpXHIk=", "deadline": {"CurrentEpoch":972560,"PeriodStart":972450,"Index":2,"Open":972570,"Close":972630,"Challenge":972550,"FaultCutoff":972500,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "972560", "skipped": 0}
        2021-07-28T15:45:54.299+0100    INFO    storageminer    storage/wdpost_run.go:642   computing window post   {"batch": 0, "elapsed": 342.72311446}
        2021-07-28T15:47:00.635+0100    INFO    storageminer    storage/wdpost_run.go:836   Submitted window post: bafy2bzacebwl7nzif27swkfjyyfk2lyf3zgcuqmvd4erbnvhjpm2ikdzck5ri
      

@raulk
Copy link
Member

raulk commented Jul 28, 2021

I'm now looking at the journal.

@raulk
Copy link
Member

raulk commented Jul 28, 2021

A few questions for you @TippyFlitsUK:

  1. Are you running the markets node on a separate machine from your main miner? (If they are running on the same machine, the IO volume in m1.2 explains this problem).
  2. At what time exactly did you stop the markets node?

@TippyFlitsUK
Copy link
Contributor Author

Many thanks @raulk that is some great info!!

The first deadline (or deadline 0) was actually the faulted one and the DeclareFaultsRecovered was generated too late.

I believe the WindowPoSt that processed for 2 sectors is actually referring to 2 new sectors sealed since last submission on that same faulted deadline.

So the order is more like:

  1. 2021-07-28T14:40:24.301+0100 - Begin processing deadline 0 - 1293 faults - 2 newly sealed sectors.
  2. 2021-07-28T15:10:00.467+0100 - Deadline 1 is empty
  3. 2021-07-28T15:20:15.315+0100 - Submit late DeclareFaultsRecovered for 1293 faults on deadline 0.
  4. 2021-07-28T15:40:11.576+0100 - Begin processing on deadline 2 - 56 sectors

I hope that makes sense buddy!! That latency is normal for WindowPoSt's and varies with number of sectors submitted for each deadline.

@TippyFlitsUK
Copy link
Contributor Author

TippyFlitsUK commented Jul 28, 2021

@raulk

The miner and markets instances are indeed running on the same machine
I stopped my markets instance at about 15:15 BST

It sounds like the market instance may have been actively processing the migration during the DeclareFaultsRecovered deadline but had finished the migration when it came to processing the WindowPoSt itself. Would this make sense in terms of IO pressure?

@TippyFlitsUK
Copy link
Contributor Author

Actually, with regards to the time I stopped the markets instance... The DeclareFaultsRecovered message was processed the second I stopped the markets instance so from the logs above it would have been at 15:20.

@raulk
Copy link
Member

raulk commented Jul 28, 2021

Thanks for clearing up the timeline. It's clear you're familiarised with your deadlines more than I am ;-) Unfortunately it's hard to get the full picture from the logs, because they're missing info, but the journal is more useful.

In the journal, I see that at 2021-07-28T15:24:00.373614481+01:00, the recoveries for deadline index 1 were marked as finished processing (which means submitted to the chain AND with 5 confirmations):

{"System":"wdpost","Event":"recoveries_processed","Timestamp":"2021-07-28T15:24:00.373614481+01:00","Data":{"Deadline":{"CurrentEpoch":972527,"PeriodStart":972450,"Index":1,"Open":972510,"Close":972570,"Challenge":972490,"FaultCutoff":972440,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70},"Height":972527,"TipSet":[{"/":"bafy2bzacebpyghdyokcq2clgchh3ewzbknynaok4wksbkt4mgvuuitrtsgpbo"},{"/":"bafy2bzacebot7zwf3v7pjlrg2lminpfdbl2ler3aiwq4vub3iejg2f52lduok"},{"/":"bafy2bzaceb6zxcoyglipiurpztcllrw6kdkdunqloiaec3hfxiva3sjhztil2"},{"/":"bafy2bzacedjsgqx2iaumc4nukdhwjglfcwgwyzij4wnkoqjfmjvbbgefpe5dm"},{"/":"bafy2bzaceaszowtiumgafc2ac7stl3timoffsvq5jlblzmp7amfrv3bfwhkva"}],"Declarations":[{"Deadline":0,"Partition":0,"Sectors":[53,1,24,1,2,1,7,1,15,1,2,2,2,2,2,1,1,2,2,6,3,4,1,2,2,5,1,3,2,5,2,1,2,1,3,1,6,1,6,1,2,2,7,1,6,3,1,2,2,3,4,1,1,1,5,1,1,3,2,1,2,1,1,1,2,6,3,1,6,1,3,2,13,3,2,2,1,2,2,1,2,1,2,2,19,2,6,1,7,1,13,1,3,1,2,1,1,1,1,3,4,1,1,1,1,1,2,1,4,1,1,2,3,1,4,2,1,1,5,1,3,1,5,1,1,2,2,1,3,2,3,1,2,1,2,1,1,1,5,1,2,1,4,2,2,1,2,1,2,1,14,18,10,2,1,1,3,28,5,1,3,2,1,1,5,1,2,3,1,1,1,1,4,1,5,1,3,1,1,2,9,1,2,1,6,1,1,1,1,3,8,1,1,1,8,1,1,3,5,1,1,2,4,1,1,3,5,2,3,1,7,2,2,1,3,2,1,1,6,1,4,1,1,1,1,1,7,1,3,5,2,2,4,1,3,5,9,2,1,2,12,7,1,11,1,5,1,2,1,5,1,6,1,2,1,2,1,9,1,19,1,1,2,9,1,12,4,1,1,1,3,23,1,24,1,21,1,3,1,8,1,6,1,20,1,1,1,10,1,1,1,8,1,1,30,1,4,1,1,2,7,16,4,1,1,2,1,1,12,1,1,1,1,1,2,2,1,1,2,1,21,1,4,1,5,1,4,1,1,3,2,1,2,1,1,1,7,1,1,1,1,1,10,1,1,2,2,1,3,1,3,1,2,1,4,2,10,1,12,1,3,2,9,1,19,1,2,1,4,9,7,1,7,1,5,1,2,3,8,8,1,5,3,37,1,22,8,2,10,22,4,3,3,11,1,7,1,4,7,26,1,1,1,18,1,35,4,14,1,19,1,1,2,4,2,1,5,1,6,1,1,1,7,1,12,1,13,1,5,1,3,15,2,1,15,1,1,1,5,1,3,2,7,1,1,1,6,1,2,1,2,1,1,1,1,1,5,1,5,1,1,2,2,1,3,2,2,1,1,8,1,20,2,16,2,1,5,2,1,3,1,2,1,80,1,28,1,26,1,20,1,28,1,11,1,2,1,25,1,7,1,126,2,4,1,2,6,5,1,9,1,17,1,16,2,15,1,5]}],"MessageCID":{"/":"bafy2bzacedzqelugvlcpiqvybjgozapn5eztvqfo2lecsa5fxdeixb2lmufu4"}}}

Was this too late for the deadline?

@TippyFlitsUK
Copy link
Contributor Author

That looks right @raulk.

The WindowPoSt for the faulted deadline is this line from the logs:

2021-07-28T14:50:06.399+0100 INFO storageminer storage/wdpost_run.go:836 Submitted window post: bafy2bzacedan5cvexkqbap6bdugdzn3oh4jh2hpq6hgtvxghehiw26x7kjb7s

I waited until the close of the deadline window and then closed my markets instance which is when the DeclareFaultsRecovered message was sent.

@raulk
Copy link
Member

raulk commented Jul 28, 2021

Do you run some system/infrastructure monitoring tool like netdata that would give us historical info on IO utilisation around that time?

@TippyFlitsUK
Copy link
Contributor Author

I'm afraid not. I've been looking at Netdata for a while... time to get it installed I think ;-)

@raulk
Copy link
Member

raulk commented Jul 28, 2021

I can vouch for it! It has helped tremendously in the past to debug all kinds of situations ;-)

@TippyFlitsUK
Copy link
Contributor Author

I really appreciate all your time and help buddy!! I'll take the necessary steps to make sure that the WindowPoSt processes correctly tomorrow.

@TippyFlitsUK
Copy link
Contributor Author

Caused by IO pressure sustained in migration process of M1.2 release. Issue will be addressed and resolved in upcoming M1.3.

@raulk
Copy link
Member

raulk commented Jul 28, 2021

Just some extra thoughts on what happened here, providing further nuance to this discussion and diagnosis.

The storage subsystem uses the concept of locks to coordinate access to sectors. Due to the lack of throttling in the DAG store, we ended up acquiring the unsealed copy lock for all sectors.

Furthermore, the FaultTracker#CheckProvable() method used by the window PoSt fault recovery logic, also ends up acquiring locks. In this case, it wanted to lock the faulted 1000+ sectors, although just locking the sealed or cached files. In theory both read locks shouldn't interact, but perhaps they do.

Finally, the FaultTracker#CheckProvable() ensures that sectors are not only accessible, but actually provable by running a "vanilla proof", which I suspect places further IO demand, which was likely started by the DAG store's pre-existing IO pressure.

I have confidence that m1.3 will solve this problem, but would like to continue working together with you @TippyFlitsUK, to verify that things are nominal, ideally after you've passed your critical proving windows.

@TippyFlitsUK
Copy link
Contributor Author

Many thanks @raulk

Excellent information. Definitely sounds like the locks you describe could have had a part to play.

I'm happy to say that the faulted sector has just passed WindowPoSt so panic over.

Sincerest thanks again for all your help!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api area/messages area/proving Area: Proving kind/bug Kind: Bug P1 P1: Must be resolved
Projects
None yet
Development

No branches or pull requests

3 participants