Markets Instance Appears to Prevent sending of DeclareFaultsRecovered Message #6915

TippyFlitsUK · 2021-07-28T16:50:44Z

Checklist

This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
I am reporting a bug w.r.t one of the M1 tags. If not, choose another issue option here.
I am not reporting a bug around deal making. If yes, create a M1 Bug Report For Deal Making.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.

Lotus component

lotus miner - proving(WindowPoSt)

Lotus Tag and Version

Daemon:  1.11.1-dev+mainnet+git.1757ffcb2+api1.2.0
Local: lotus-miner version 1.11.1-dev+mainnet+git.1757ffcb2

Describe the Bug

My miner failed window post yesterday whilst testing the M1.1 release due to already identified excess server load which has now been resolved.

On the same submission window today (testing M1.2 release) the window post processed without issue but the miner did not send a DeclareFaultsRecovered message prior to the window post message.

On stopping the markets instance the miner immediately (within milliseconds) sent the missing DeclareFaultsRecovered message.

It appears that the running markets instance prevented the DeclareFaultsRecovered message being sent.

Logging Information

https://gist.github.com/TippyFlitsUK/e542b23aefab4930229ad99425b52392

Repo Steps

Attemp to submit DeclareFaultsRocovered message with markets instance running.

The text was updated successfully, but these errors were encountered:

jennijuju · 2021-07-28T18:48:10Z

@TippyFlitsUK can you get us lotus-miner pprof goroutines > goroutines on both miner and market process?

TippyFlitsUK · 2021-07-28T18:54:04Z

No problem @jennijuju.

markets-goroutines.txt
miner-goroutines.txt

jennijuju · 2021-07-28T21:32:16Z

@TippyFlitsUK Can we get some miner journals, too?

raulk · 2021-07-28T21:37:25Z

@TippyFlitsUK Thanks for the report! The markets process does not interfere at all with the miner process, that's the whole goal of the MRA separation.

What could be happening is that the increased IO pressure you saw in m1.2 had a side effect that affected your miner (especially if you're running both on the same machine). This IO pressure is gone in m1.3.

I've extracted the logs from the WindowPoSt scheduler, which is in charge of doing the following:

declaring recoveries for sectors being proved in the next deadline, that were in a faulted state but are now healthy.
declaring faults for sectors being proved in the next deadline, that were healthy and are now in a faulted state.
submitting WindowPoSt for this deadline.

1 & 2 run in that order, but concurrently to 3.

I'm analyzing these logs to see if I find something fishy.

⟩ grep storageminer log.txt
2021-07-28T12:33:38.875+0100	INFO	storageminer	storage/miner.go:243	starting up miner f08403, worker addr f3vnsby4atsqchtb4unpc5g4cbqj27saae3vncjyjfaspshoqz3s66huseoi27t2vihxfl37iuevdjbdxgcvqq
2021-07-28T12:33:38.884+0100	INFO	storageminer	storage/miner.go:292	Computing WinningPoSt ;[{SealProof:3 SectorNumber:75 SealedCID:bagboea4b5abca5kegtb3z2grfpsta2nrebefn7mrbngdy3qmyib7fwd7q5g7b2qw}]; [150 243 203 139 182 228 219 69 17 231 177 191 240 167 166 0 190 48 123 186 222 134 75 135 236 75 28 204 253 88 1 3]
2021-07-28T12:33:42.016+0100	INFO	storageminer	storage/miner.go:299	GenerateWinningPoSt took 3.132021645s
2021-07-28T14:40:24.301+0100	WARN	storageminer	storage/wdpost_run.go:227	Checked sectors	{"checked": 2, "good": 2}
2021-07-28T14:40:24.303+0100	INFO	storageminer	storage/wdpost_run.go:626	running window post	{"chain-random": "gx7VwkbXfTuXYBImIYwml1W4BKLslUZNDpTCrySV4oU=", "deadline": {"CurrentEpoch":972440,"PeriodStart":972450,"Index":0,"Open":972450,"Close":972510,"Challenge":972430,"FaultCutoff":972380,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "972440", "skipped": 0}
2021-07-28T14:49:50.883+0100	INFO	storageminer	storage/wdpost_run.go:642	computing window post	{"batch": 0, "elapsed": 566.580420489}
2021-07-28T14:50:06.399+0100	INFO	storageminer	storage/wdpost_run.go:836	Submitted window post: bafy2bzacedan5cvexkqbap6bdugdzn3oh4jh2hpq6hgtvxghehiw26x7kjb7s
2021-07-28T15:10:00.467+0100	WARN	storageminer	storage/wdpost_run.go:227	Checked sectors	{"checked": 0, "good": 0}
2021-07-28T15:20:15.315+0100	WARN	storageminer	storage/wdpost_run.go:227	Checked sectors	{"checked": 1293, "good": 1293}
2021-07-28T15:20:16.005+0100	WARN	storageminer	storage/wdpost_run.go:332	declare faults recovered Message CID	{"cid": "bafy2bzacedzqelugvlcpiqvybjgozapn5eztvqfo2lecsa5fxdeixb2lmufu4"}
2021-07-28T15:40:11.567+0100	WARN	storageminer	storage/wdpost_run.go:227	Checked sectors	{"checked": 56, "good": 56}
2021-07-28T15:40:11.576+0100	INFO	storageminer	storage/wdpost_run.go:626	running window post	{"chain-random": "/YycN2M1NLqSfHWQAw0Htr88veitbbnXkxR3DMpXHIk=", "deadline": {"CurrentEpoch":972560,"PeriodStart":972450,"Index":2,"Open":972570,"Close":972630,"Challenge":972550,"FaultCutoff":972500,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "972560", "skipped": 0}
2021-07-28T15:45:54.299+0100	INFO	storageminer	storage/wdpost_run.go:642	computing window post	{"batch": 0, "elapsed": 342.72311446}
2021-07-28T15:47:00.635+0100	INFO	storageminer	storage/wdpost_run.go:836	Submitted window post: bafy2bzacebwl7nzif27swkfjyyfk2lyf3zgcuqmvd4erbnvhjpm2ikdzck5ri

TippyFlitsUK · 2021-07-28T21:41:43Z

Sure @jennijuju.
miner-journal.txt

TippyFlitsUK · 2021-07-28T21:45:44Z

Many thanks for the update @raulk.

It was strange to me that the window post completed without any problem but the DeclareFaultsRecovered seemed to be an issue.

Could IO pressure have affected one but not the other? I am not too familiar with the DeclareFaultsRecovered mechanism.

raulk · 2021-07-28T21:59:13Z

These are my discoveries.

Miner was started at 2021-07-28T12:33:38.875+0100, at which point we warmed up the WindowPoSt worker but didn't actually generate any proofs.

We started processing the first deadline at 2021-07-28T14:40:24.301+0100.

No previous faults (therefore no attempt to recover).
Only 2 sectors to prove.

Generating WindowPoSt took 566 seconds! (is this latency normal in your environments? if not, this is a direct effect of the IO pressure)

  2021-07-28T14:40:24.303+0100    INFO    storageminer    storage/wdpost_run.go:626   running window post {"chain-random": "gx7VwkbXfTuXYBImIYwml1W4BKLslUZNDpTCrySV4oU=", "deadline": {"CurrentEpoch":972440,"PeriodStart":972450,"Index":0,"Open":972450,"Close":972510,"Challenge":972430,"FaultCutoff":972380,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "972440", "skipped": 0}
  2021-07-28T14:49:50.883+0100    INFO    storageminer    storage/wdpost_run.go:642   computing window post   {"batch": 0, "elapsed": 566.580420489}

Submitted WindowPoSt at 2021-07-28T14:50:06.399+0100.

Next deadline was up at 2021-07-28T15:10:00.467+0100.

There were sectors to recover (for the next deadline), and it took 10 minutes to check them (is this latency normal in your environments? if not, this is a direct effect of the IO pressure):

  2021-07-28T15:10:00.467+0100    WARN    storageminer    storage/wdpost_run.go:227   Checked sectors {"checked": 0, "good": 0}
  2021-07-28T15:20:15.315+0100    WARN    storageminer    storage/wdpost_run.go:227   Checked sectors {"checked": 1293, "good": 1293}

Once we checked them, we submitted the declare faults message on chain promptly:
2021-07-28T15:20:16.005+0100 WARN storageminer storage/wdpost_run.go:332 declare faults recovered Message CID {"cid": "bafy2bzacedzqelugvlcpiqvybjgozapn5eztvqfo2lecsa5fxdeixb2lmufu4"}

We then moved on to submitting WindowPoSt for 56 sectors, took 342 seconds this time:

  2021-07-28T15:40:11.567+0100    WARN    storageminer    storage/wdpost_run.go:227   Checked sectors {"checked": 56, "good": 56}
  2021-07-28T15:40:11.576+0100    INFO    storageminer    storage/wdpost_run.go:626   running window post {"chain-random": "/YycN2M1NLqSfHWQAw0Htr88veitbbnXkxR3DMpXHIk=", "deadline": {"CurrentEpoch":972560,"PeriodStart":972450,"Index":2,"Open":972570,"Close":972630,"Challenge":972550,"FaultCutoff":972500,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "972560", "skipped": 0}
  2021-07-28T15:45:54.299+0100    INFO    storageminer    storage/wdpost_run.go:642   computing window post   {"batch": 0, "elapsed": 342.72311446}
  2021-07-28T15:47:00.635+0100    INFO    storageminer    storage/wdpost_run.go:836   Submitted window post: bafy2bzacebwl7nzif27swkfjyyfk2lyf3zgcuqmvd4erbnvhjpm2ikdzck5ri

raulk · 2021-07-28T21:59:43Z

I'm now looking at the journal.

raulk · 2021-07-28T22:10:42Z

A few questions for you @TippyFlitsUK:

Are you running the markets node on a separate machine from your main miner? (If they are running on the same machine, the IO volume in m1.2 explains this problem).
At what time exactly did you stop the markets node?

TippyFlitsUK · 2021-07-28T22:16:20Z

Many thanks @raulk that is some great info!!

The first deadline (or deadline 0) was actually the faulted one and the DeclareFaultsRecovered was generated too late.

I believe the WindowPoSt that processed for 2 sectors is actually referring to 2 new sectors sealed since last submission on that same faulted deadline.

So the order is more like:

2021-07-28T14:40:24.301+0100 - Begin processing deadline 0 - 1293 faults - 2 newly sealed sectors.
2021-07-28T15:10:00.467+0100 - Deadline 1 is empty
2021-07-28T15:20:15.315+0100 - Submit late DeclareFaultsRecovered for 1293 faults on deadline 0.
2021-07-28T15:40:11.576+0100 - Begin processing on deadline 2 - 56 sectors

I hope that makes sense buddy!! That latency is normal for WindowPoSt's and varies with number of sectors submitted for each deadline.

TippyFlitsUK · 2021-07-28T22:22:29Z

@raulk

The miner and markets instances are indeed running on the same machine
I stopped my markets instance at about 15:15 BST

It sounds like the market instance may have been actively processing the migration during the DeclareFaultsRecovered deadline but had finished the migration when it came to processing the WindowPoSt itself. Would this make sense in terms of IO pressure?

TippyFlitsUK · 2021-07-28T22:26:39Z

Actually, with regards to the time I stopped the markets instance... The DeclareFaultsRecovered message was processed the second I stopped the markets instance so from the logs above it would have been at 15:20.

raulk · 2021-07-28T22:26:50Z

Thanks for clearing up the timeline. It's clear you're familiarised with your deadlines more than I am ;-) Unfortunately it's hard to get the full picture from the logs, because they're missing info, but the journal is more useful.

In the journal, I see that at 2021-07-28T15:24:00.373614481+01:00, the recoveries for deadline index 1 were marked as finished processing (which means submitted to the chain AND with 5 confirmations):

{"System":"wdpost","Event":"recoveries_processed","Timestamp":"2021-07-28T15:24:00.373614481+01:00","Data":{"Deadline":{"CurrentEpoch":972527,"PeriodStart":972450,"Index":1,"Open":972510,"Close":972570,"Challenge":972490,"FaultCutoff":972440,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70},"Height":972527,"TipSet":[{"/":"bafy2bzacebpyghdyokcq2clgchh3ewzbknynaok4wksbkt4mgvuuitrtsgpbo"},{"/":"bafy2bzacebot7zwf3v7pjlrg2lminpfdbl2ler3aiwq4vub3iejg2f52lduok"},{"/":"bafy2bzaceb6zxcoyglipiurpztcllrw6kdkdunqloiaec3hfxiva3sjhztil2"},{"/":"bafy2bzacedjsgqx2iaumc4nukdhwjglfcwgwyzij4wnkoqjfmjvbbgefpe5dm"},{"/":"bafy2bzaceaszowtiumgafc2ac7stl3timoffsvq5jlblzmp7amfrv3bfwhkva"}],"Declarations":[{"Deadline":0,"Partition":0,"Sectors":[53,1,24,1,2,1,7,1,15,1,2,2,2,2,2,1,1,2,2,6,3,4,1,2,2,5,1,3,2,5,2,1,2,1,3,1,6,1,6,1,2,2,7,1,6,3,1,2,2,3,4,1,1,1,5,1,1,3,2,1,2,1,1,1,2,6,3,1,6,1,3,2,13,3,2,2,1,2,2,1,2,1,2,2,19,2,6,1,7,1,13,1,3,1,2,1,1,1,1,3,4,1,1,1,1,1,2,1,4,1,1,2,3,1,4,2,1,1,5,1,3,1,5,1,1,2,2,1,3,2,3,1,2,1,2,1,1,1,5,1,2,1,4,2,2,1,2,1,2,1,14,18,10,2,1,1,3,28,5,1,3,2,1,1,5,1,2,3,1,1,1,1,4,1,5,1,3,1,1,2,9,1,2,1,6,1,1,1,1,3,8,1,1,1,8,1,1,3,5,1,1,2,4,1,1,3,5,2,3,1,7,2,2,1,3,2,1,1,6,1,4,1,1,1,1,1,7,1,3,5,2,2,4,1,3,5,9,2,1,2,12,7,1,11,1,5,1,2,1,5,1,6,1,2,1,2,1,9,1,19,1,1,2,9,1,12,4,1,1,1,3,23,1,24,1,21,1,3,1,8,1,6,1,20,1,1,1,10,1,1,1,8,1,1,30,1,4,1,1,2,7,16,4,1,1,2,1,1,12,1,1,1,1,1,2,2,1,1,2,1,21,1,4,1,5,1,4,1,1,3,2,1,2,1,1,1,7,1,1,1,1,1,10,1,1,2,2,1,3,1,3,1,2,1,4,2,10,1,12,1,3,2,9,1,19,1,2,1,4,9,7,1,7,1,5,1,2,3,8,8,1,5,3,37,1,22,8,2,10,22,4,3,3,11,1,7,1,4,7,26,1,1,1,18,1,35,4,14,1,19,1,1,2,4,2,1,5,1,6,1,1,1,7,1,12,1,13,1,5,1,3,15,2,1,15,1,1,1,5,1,3,2,7,1,1,1,6,1,2,1,2,1,1,1,1,1,5,1,5,1,1,2,2,1,3,2,2,1,1,8,1,20,2,16,2,1,5,2,1,3,1,2,1,80,1,28,1,26,1,20,1,28,1,11,1,2,1,25,1,7,1,126,2,4,1,2,6,5,1,9,1,17,1,16,2,15,1,5]}],"MessageCID":{"/":"bafy2bzacedzqelugvlcpiqvybjgozapn5eztvqfo2lecsa5fxdeixb2lmufu4"}}}

Was this too late for the deadline?

TippyFlitsUK · 2021-07-28T22:32:09Z

That looks right @raulk.

The WindowPoSt for the faulted deadline is this line from the logs:

2021-07-28T14:50:06.399+0100 INFO storageminer storage/wdpost_run.go:836 Submitted window post: bafy2bzacedan5cvexkqbap6bdugdzn3oh4jh2hpq6hgtvxghehiw26x7kjb7s

I waited until the close of the deadline window and then closed my markets instance which is when the DeclareFaultsRecovered message was sent.

raulk · 2021-07-28T22:36:19Z

Do you run some system/infrastructure monitoring tool like netdata that would give us historical info on IO utilisation around that time?

TippyFlitsUK · 2021-07-28T22:37:10Z

I'm afraid not. I've been looking at Netdata for a while... time to get it installed I think ;-)

raulk · 2021-07-28T22:39:18Z

I can vouch for it! It has helped tremendously in the past to debug all kinds of situations ;-)

TippyFlitsUK · 2021-07-28T22:41:47Z

I really appreciate all your time and help buddy!! I'll take the necessary steps to make sure that the WindowPoSt processes correctly tomorrow.

TippyFlitsUK · 2021-07-28T22:43:19Z

Caused by IO pressure sustained in migration process of M1.2 release. Issue will be addressed and resolved in upcoming M1.3.

raulk · 2021-07-28T23:18:01Z

Just some extra thoughts on what happened here, providing further nuance to this discussion and diagnosis.

The storage subsystem uses the concept of locks to coordinate access to sectors. Due to the lack of throttling in the DAG store, we ended up acquiring the unsealed copy lock for all sectors.

Furthermore, the FaultTracker#CheckProvable() method used by the window PoSt fault recovery logic, also ends up acquiring locks. In this case, it wanted to lock the faulted 1000+ sectors, although just locking the sealed or cached files. In theory both read locks shouldn't interact, but perhaps they do.

Finally, the FaultTracker#CheckProvable() ensures that sectors are not only accessible, but actually provable by running a "vanilla proof", which I suspect places further IO demand, which was likely started by the DAG store's pre-existing IO pressure.

I have confidence that m1.3 will solve this problem, but would like to continue working together with you @TippyFlitsUK, to verify that things are nominal, ideally after you've passed your critical proving windows.

TippyFlitsUK · 2021-07-29T14:00:08Z

Many thanks @raulk

Excellent information. Definitely sounds like the locks you describe could have had a part to play.

I'm happy to say that the faulted sector has just passed WindowPoSt so panic over.

Sincerest thanks again for all your help!!

TippyFlitsUK added kind/bug Kind: Bug M1-release labels Jul 28, 2021

jennijuju added area/api area/messages area/proving Area: Proving and removed need/triage labels Jul 28, 2021

jennijuju added this to the 1.11.1 milestone Jul 28, 2021

jennijuju added the P1 P1: Must be resolved label Jul 28, 2021

TippyFlitsUK closed this as completed Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markets Instance Appears to Prevent sending of DeclareFaultsRecovered Message #6915

Markets Instance Appears to Prevent sending of DeclareFaultsRecovered Message #6915

TippyFlitsUK commented Jul 28, 2021 •

edited

Loading

jennijuju commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

jennijuju commented Jul 28, 2021

raulk commented Jul 28, 2021 •

edited

Loading

TippyFlitsUK commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021 •

edited

Loading

raulk commented Jul 28, 2021

raulk commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021 •

edited

Loading

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021 •

edited

Loading

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021

TippyFlitsUK commented Jul 29, 2021

Markets Instance Appears to Prevent sending of DeclareFaultsRecovered Message #6915

Markets Instance Appears to Prevent sending of DeclareFaultsRecovered Message #6915

Comments

TippyFlitsUK commented Jul 28, 2021 • edited Loading

Checklist

Lotus component

Lotus Tag and Version

Describe the Bug

Logging Information

Repo Steps

jennijuju commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

jennijuju commented Jul 28, 2021

raulk commented Jul 28, 2021 • edited Loading

TippyFlitsUK commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021 • edited Loading

raulk commented Jul 28, 2021

raulk commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021 • edited Loading

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021 • edited Loading

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

TippyFlitsUK commented Jul 28, 2021

raulk commented Jul 28, 2021

TippyFlitsUK commented Jul 29, 2021

TippyFlitsUK commented Jul 28, 2021 •

edited

Loading

raulk commented Jul 28, 2021 •

edited

Loading

raulk commented Jul 28, 2021 •

edited

Loading

TippyFlitsUK commented Jul 28, 2021 •

edited

Loading

raulk commented Jul 28, 2021 •

edited

Loading