-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Markets Instance Appears to Prevent sending of DeclareFaultsRecovered Message #6915
Comments
@TippyFlitsUK can you get us |
No problem @jennijuju. |
@TippyFlitsUK Can we get some miner journals, too? |
@TippyFlitsUK Thanks for the report! The markets process does not interfere at all with the miner process, that's the whole goal of the MRA separation. What could be happening is that the increased IO pressure you saw in m1.2 had a side effect that affected your miner (especially if you're running both on the same machine). This IO pressure is gone in m1.3. I've extracted the logs from the WindowPoSt scheduler, which is in charge of doing the following:
1 & 2 run in that order, but concurrently to 3. I'm analyzing these logs to see if I find something fishy.
|
Sure @jennijuju. |
Many thanks for the update @raulk. It was strange to me that the window post completed without any problem but the Could IO pressure have affected one but not the other? I am not too familiar with the |
These are my discoveries.
|
I'm now looking at the journal. |
A few questions for you @TippyFlitsUK:
|
Many thanks @raulk that is some great info!! The first deadline (or deadline 0) was actually the faulted one and the DeclareFaultsRecovered was generated too late. I believe the WindowPoSt that processed for 2 sectors is actually referring to 2 new sectors sealed since last submission on that same faulted deadline. So the order is more like:
I hope that makes sense buddy!! That latency is normal for WindowPoSt's and varies with number of sectors submitted for each deadline. |
The It sounds like the market instance may have been actively processing the migration during the DeclareFaultsRecovered deadline but had finished the migration when it came to processing the WindowPoSt itself. Would this make sense in terms of IO pressure? |
Actually, with regards to the time I stopped the markets instance... The |
Thanks for clearing up the timeline. It's clear you're familiarised with your deadlines more than I am ;-) Unfortunately it's hard to get the full picture from the logs, because they're missing info, but the journal is more useful. In the journal, I see that at 2021-07-28T15:24:00.373614481+01:00, the recoveries for deadline index 1 were marked as finished processing (which means submitted to the chain AND with 5 confirmations):
Was this too late for the deadline? |
That looks right @raulk. The WindowPoSt for the faulted deadline is this line from the logs:
I waited until the close of the deadline window and then closed my |
Do you run some system/infrastructure monitoring tool like netdata that would give us historical info on IO utilisation around that time? |
I'm afraid not. I've been looking at Netdata for a while... time to get it installed I think ;-) |
I can vouch for it! It has helped tremendously in the past to debug all kinds of situations ;-) |
I really appreciate all your time and help buddy!! I'll take the necessary steps to make sure that the |
Caused by IO pressure sustained in migration process of |
Just some extra thoughts on what happened here, providing further nuance to this discussion and diagnosis. The storage subsystem uses the concept of locks to coordinate access to sectors. Due to the lack of throttling in the DAG store, we ended up acquiring the unsealed copy lock for all sectors. Furthermore, the Finally, the I have confidence that m1.3 will solve this problem, but would like to continue working together with you @TippyFlitsUK, to verify that things are nominal, ideally after you've passed your critical proving windows. |
Many thanks @raulk Excellent information. Definitely sounds like the locks you describe could have had a part to play. I'm happy to say that the faulted sector has just passed Sincerest thanks again for all your help!! |
Checklist
Lotus component
lotus miner - proving(WindowPoSt)
Lotus Tag and Version
Describe the Bug
My miner failed window post yesterday whilst testing the
M1.1
release due to already identified excess server load which has now been resolved.On the same submission window today (testing
M1.2
release) the window post processed without issue but the miner did not send aDeclareFaultsRecovered
message prior to the window post message.On stopping the markets instance the miner immediately (within milliseconds) sent the missing
DeclareFaultsRecovered
message.It appears that the running markets instance prevented the
DeclareFaultsRecovered
message being sent.Logging Information
Repo Steps
DeclareFaultsRocovered
message withmarkets
instance running.The text was updated successfully, but these errors were encountered: