-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Mainnet nodes with incoming connections unexpected shutdown with Failure in Data.Map.balanceR (error, called at src/Data/Map/Internal.hs:4157:30 in containers-0.6.5.1-EiES0HFUZ8PBGNrpVjoYRF:Data.Map.Internal) #4826
Comments
Many SPOs are reporting the same on SPO workgroup (node versions across 1.35.3 and 1.35.4). Only nodes that were not publicly available to connect (thus, not receiving all transactions) survived |
7-minute gap in my logs.
|
Network synchronization and block height alignment from pooltool.io (https://pooltool.io/networkhealth): |
We saw the same. Last block to come in was 8300568. As it came in, the nodes crashed. |
All our active nodes (mix of cloud and baremetal running both versions 1.35.3 and 1.35.4) went down. Only one node that had no incoming connections did not have the error. Looking for the error message we found its from here so not a node exception. |
Same error too - all nodes restarted... |
Can confirm the same, all nodes restarted at the same time that was reported here. |
Same here |
Ditto here |
On initial inspection this seems to be the offending transaction the block it was in 95050 fails to register on almost all Nodes, seems like they all die as soon as they see the block. The Collaterals are consumed, indicating the the TX failed to execute. I theorize that the Redeemer had some sort of runway process happen, causing all nodes that tied to process it to shut down. |
Just confirming saw the same thing as you all |
Same as above. All nodes have been restarted. |
Yup, as most reported. This is from my FreeBSD nodes: error came in:
Then eventually node got shutdown:
|
@leo42 - what about the cardano-scan output makes you think that the collateral was consumed? collateral is only consumed if a transaction is successfully processed, if it contains at least one failing plutus script, and if the transaction is marked as expected to have a failed script. if you happen to have the CBOR for this transaction that would speed up my investigation |
@JaredCorduan I Unfortunately do not have the CBOR. I scanned all the TXs in the block that seems to have caused the stall. according to Cardano-scan 3 collateral inputs where consumed and one output was created. it registered total collateral as 1.25A that seems to be the amount missing from the output. Since ADA was substracted from the collateral output I assumed that indicates a failed script execution. |
it's wasn't clear to me that this wasn't just the collateral inputs and collateral return listed in the tx. |
All nodes crashed per the above here as well |
My Droperator relay, which didn't process the transaction didn't crash, but my main relay and BP did crash. |
Same here: |
@JaredCorduan Here is the whole block 8300569 CBOR which I think contains the transaction discussed above:
|
Can also confirm all 8 Mainnet Nodes I have under management restarted. Mix of Cloud and Baremetal, multiple Geographical locations. v1.35.4 |
the collateral is not consumed for this tx, in case of failed tx; cardanoscan will display FAILED tag. cc @JaredCorduan |
Do we have reason to believe that it was the block above, beyond just being the last block received before the crash? It seems more likely to me that if there was a block causing the crash, it would be the one received immediately after that persisted block, and the crashing would prevent it from being adopted. Once nodes restarted, they would likely select a different tip and continue producing nodes from there. In particular, SMAUG's original error message seems to indicate that the headerHash of the tip at the time was Also, if it was a specific block in the chain, I'd imagine it would continue to cause issues after the restart. The fact that the network recovered implies that the block that causes the issue is no longer part of the chain. |
I agree completely @Quantumplation, we came to this same conclusion. what's weird is how such a block or transaction would propagate and cause so many failures all within the same second. |
Upstream |
You used |
I'm not the cardano team and not complaining. :) Just interested in improving Haskell libs and trying to ask useful questions. |
@simonmichael Glad to hear. It's most certainly possible that this is caused by a |
Is the new node version 1.35.5 meant to fix this bug? |
Yes, go update your node to 1.35.5. |
Could anyone please share which Github merge request fixed this issue? |
Please wait for the blog post that will be released in ~6-8 weeks with details |
Yes, as you (?) pointed out in a now-deleted comment, the fix is in this PR. The code was for a function called case compare kx ky of
LT -> balanceL ky y (go f kx x l) r
GT -> balanceR ky y l (go f kx x r)
EQ -> if new == zeroC then link2 l r else Bin sy kx new l r The trouble is in the |
The cause of the issue was known for some days and fixed, though any ppl from anywhere who have git account and have wrong assumption can raise a non valid issue in any repo. But reacting like you did (assuming that the mentioned Cardano folks blame you), imo is not really distinguishing you from those who have strong opinion without validation. It's just my 2 cents. In the future do some fact checks first, before you make any firm statements. |
@simonmichael This was due to the Cardano devs using a function in Until a larger portion of the network is using the fixed version (currently its at 21% according to Pooltool) we would like to keep public knowledge of this issue as vague as possible. Once the network is safe we will be as transparent as possible about what happened. |
Why? This is a double-edge sword to attempt to conceal problems, and it will damage the image of the community and the platform. If it is a potential "attack vector", you keep developers "up at night" to fix it, but obfuscating facts will always come back and byte you. I do not want this to become a controversial topic, and also to fill people's emails with garbage. Just raising a point on trnsparency and principles. Peace out. |
I don't think this is an attempt to "conceal" problems; it's IOG exercising their responsible disclosure policy. The plan is to release a detailed post-mortem, but doing so before a significant number of SPOs have upgraded is just adding unnecessary risk that someone malicious will be able to reproduce it. IOG actually went above and beyond to include members of the community in this effort for the first time, for which they should be applauded. Waiting a few weeks to release a post-mortem is standard security practice, and is going to make very little difference for any benign parties who could benefit from a deep understanding of the attack, but makes a huge difference for any potential malicious actors out there. |
Agreed with Pi. Just wanted to add my own thought (though it has been covered already). The process by which IOG is following is standard practice in Security. It is a. If you reveal too much too soon, you are making it easier for bad actors to gain the information needed to use to their advantage, prior to the resolution being widely implemented. Transparency does not mean haphazardly prematurely releasing information that would be a benefit to bad actors, prior to the risk being mitigated to a manageable/reasonable degree.
Additionally, if curious, here are the general stage flows of an Incident Response (from my experience in Ransomware Forensic Cases). They vary, in general these are the steps. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days. |
This issue has been fixed but will remain open until the problem, its cause and its solution has been documented and made public. |
I've opened nick8325/quickcheck#354 to help address the root cause. |
My sincere thanks goes to everyone who helped resolve this issue. @lehins and @TimSheard were true heroes in resolving it fast. @Quantumplation and @AndrewWestberg were also heroes, it's always a real joy to work with y'all. The issue was resolved by IntersectMBO/cardano-ledger#3343. @kevinhammond published a brief incident report: https://input-output-hk.github.io/cardano-updates/2023-04-17-ledger @Quantumplation published a well written and expository postmortem: https://www.314pool.com/post/cardano-post-mortem-1 |
External
External
Area
cardano-node exception leading to shutdown
Summary
All my public nodes with incoming connections (active block producer and relays) shut down unexpectedly on 2022 January 22 at 00:09:01 UTC following the following fatal exception:
(time in UTC+1 above)
Some other operators confirmed this also happened to some of their nodes.
Steps to reproduce
No idea yet.
Expected behavior
No exception is expected in
cardano-node
.System info (please complete the following information):
Screenshots and attachments
Previous logs (UTC+1):
Additional context
All my nodes without incoming connections (just outgoing ones) were not affected.
The text was updated successfully, but these errors were encountered: