-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seg fault / panic syncing mainnet on v1.7.4 #2134
Comments
Possibly related to #1982 |
Maybe we should not silently discard the error in celo-blockchain/miner/block.go Line 386 in 85c1757
|
This involves this function: celo-blockchain/core/types/transaction.go Lines 604 to 628 in de3f84b
|
From reading the stack trace, I see only one way to cause the error and that is
However, I don't see this being the case for any of the fee currencies around that time: for curr in 0x765DE816845861e75A25fCA122bb6898B8B1282a 0xD8763CBa276a3738E6DE85b4b3bF5FDed6D6cA73 0xe8537a3d056DA446677B9E9d6c5dB704EaAb4787
cast call -r https://celo-mainnet.infura.io/v3/$INFURA_API_KEY 0xefB849352
39dAcdecF7c5bA76d8dE40b077B7b33 'function medianRate(address token) external view returns (uint256, uint256)' $curr -b 18831000
end
713715994804191731790000
1000000000000000000000000
653368356238021788240000
1000000000000000000000000
3553065332252728052300000
1000000000000000000000000 Have I missed anything? |
I was unable to reproduce the panic with v1.7.4, but I just ran it locally and didn't use the docker container as described in the issue. We could easily handle the case and avoid a panic (e.g. by not processing a tx when we can't convert the miner fee to CELO). But since it does not happen for all validators, we would get a consensus failure instead of a panic, which would be even harder to debug. Alternatively, we could keep panicking and add some additional debugging output to show the specific tx that fails when it happens the next time. |
This might help us debug #2134 if it happens again.
This might help us debug #2134 if it happens again.
This might help us debug #2134 if it happens again.
#2134 leads us to believe that there are failures happening in production. This change * Handles error cases by skipping processing of respective txs instead of segfaulting * Logs tx information in these cases to better understand why this is happening Skipping transactions where conversion is not possible is analogous to handling other cases of bad transactions. But since the linked issue only happens for some nodes, I am not sure if this is the right thing to do or if we should intentionally panic in that case to avoid harder to debug consensus failures due to having different nodes process a different set of transactions.
#2134 leads us to believe that there are failures happening in production. This change * Handles error cases by skipping processing of respective txs instead of segfaulting * Logs tx information in these cases to better understand why this is happening Skipping transactions where conversion is not possible is analogous to handling other cases of bad transactions. But since the linked issue only happens for some nodes, I am not sure if this is the right thing to do or if we should intentionally panic in that case to avoid harder to debug consensus failures due to having different nodes process a different set of transactions.
#2134 leads us to believe that there are failures happening in production. This change * Handles error cases by skipping processing of respective txs instead of segfaulting * Logs tx information in these cases to better understand why this is happening Skipping transactions where conversion is not possible is analogous to handling other cases of bad transactions. But since the linked issue only happens for some nodes, I am not sure if this is the right thing to do or if we should intentionally panic in that case to avoid harder to debug consensus failures due to having different nodes process a different set of transactions.
* Define ToCeloFn type The declaration is repeated multiple times and I am about to change it, so it is nice to pull it into a typedef. * Return errors from toCeloFn #2134 leads us to believe that there are failures happening in production. This change * Handles error cases by skipping processing of respective txs instead of segfaulting * Logs tx information in these cases to better understand why this is happening Skipping transactions where conversion is not possible is analogous to handling other cases of bad transactions. But since the linked issue only happens for some nodes, I am not sure if this is the right thing to do or if we should intentionally panic in that case to avoid harder to debug consensus failures due to having different nodes process a different set of transactions.
Closing stale, and I believe this was resolved, although I don't think it was specifically reproducable on ARM. |
Expected Behavior
Sync from genesis without panic
Actual Behavior
Panic and geth terminated. Resumes syncing normally after restart.
Steps to reproduce the behavior
Build celo-blockchain from source tag
v1.7.4
Go:
go1.19.2.linux-arm64, dependency in custom Dockfile
Docker:
24.0.2
Running on:
Ubuntu 22.04.2 LTS ARM64
Dockerfile:
Backtrace
Chain/Network:
Mainnet
The text was updated successfully, but these errors were encountered: