Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault / panic syncing mainnet on v1.7.4 #2134

Closed
aaronmboyd opened this issue Jun 20, 2023 · 6 comments
Closed

Seg fault / panic syncing mainnet on v1.7.4 #2134

aaronmboyd opened this issue Jun 20, 2023 · 6 comments
Labels
type:bug Something isn't working

Comments

@aaronmboyd
Copy link

aaronmboyd commented Jun 20, 2023

Expected Behavior

Sync from genesis without panic

Actual Behavior

Panic and geth terminated. Resumes syncing normally after restart.

Steps to reproduce the behavior

Build celo-blockchain from source tag v1.7.4
Go: go1.19.2.linux-arm64, dependency in custom Dockfile
Docker: 24.0.2
Running on: Ubuntu 22.04.2 LTS ARM64

Dockerfile:

FROM arm64v8/ubuntu:jammy as builder

RUN apt update && apt install gcc make musl-dev wget -y

RUN wget https://go.dev/dl/go1.19.2.linux-arm64.tar.gz
RUN rm -rf /usr/local/go && tar -C /usr/local -xzf go1.19.2.linux-arm64.tar.gz
ENV PATH $PATH:/usr/local/go/bin

ADD . /go-ethereum
WORKDIR /go-ethereum
RUN make geth-musl

FROM arm64v8/ubuntu:jammy
ARG COMMIT_SHA

COPY --from=builder /go-ethereum/build/bin/geth /usr/local/bin/
RUN echo $COMMIT_SHA > /version.txt
ADD scripts/run_geth_in_docker.sh /

EXPOSE 8545 8546 30303 30303/udp
ENTRYPOINT ["sh", "/run_geth_in_docker.sh"]

# Add some metadata labels to help programatic image consumption
ARG COMMIT=$COMMIT_SHA
ARG VERSION=""
ARG BUILDNUM=""

LABEL commit="$COMMIT" version="$VERSION" buildnum="$BUILDNUM"

Backtrace

INFO [06-19\|13:17:04.423] Imported new chain segment               blocks=190  txs=1079  mgas=243.352 elapsed=8.336s    mgasps=29.192  number=18,830,021 hash=234037..f78e82 age=2mo2d1h  dirty=74.86MiB
--
INFO [06-19\|13:17:07.629] Sending val enode share msg to proxy     func=sendValEnodeShareMsgs       proxy peer="Peer 16331f68e399c8af 10.192.11.7:30503" valAddresses length=119
INFO [06-19\|13:17:07.630] Skipping sending ValEnodesShareMsg b/c not validating func=sendValEnodesShareMsg
INFO [06-19\|13:17:12.431] Imported new chain segment               blocks=169  txs=889   mgas=206.133 elapsed=8.008s    mgasps=25.740  number=18,830,190 hash=3cd7bb..1c429a age=2mo2d1h  dirty=75.76MiB
INFO [06-19\|13:17:20.441] Imported new chain segment               blocks=221  txs=1171  mgas=267.872 elapsed=8.010s    mgasps=33.439  number=18,830,411 hash=570d7c..095d89 age=2mo2d59m dirty=76.94MiB
INFO [06-19\|13:17:28.969] Imported new chain segment               blocks=199  txs=820   mgas=223.748 elapsed=8.527s    mgasps=26.240  number=18,830,610 hash=63997b..6b01d5 age=2mo2d43m dirty=76.46MiB
INFO [06-19\|13:17:36.971] Imported new chain segment               blocks=145  txs=573   mgas=211.915 elapsed=8.002s    mgasps=26.481  number=18,830,755 hash=e204e2..de3f67 age=2mo2d31m dirty=77.38MiB
INFO [06-19\|13:17:37.634] Sending val enode share msg to proxy     func=sendValEnodeShareMsgs       proxy peer="Peer 16331f68e399c8af 10.192.11.7:30503" valAddresses length=119
INFO [06-19\|13:17:37.634] Skipping sending ValEnodesShareMsg b/c not validating func=sendValEnodesShareMsg
INFO [06-19\|13:17:42.725] Imported new chain segment               blocks=245  txs=1128  mgas=244.124 elapsed=5.753s    mgasps=42.428  number=18,831,000 hash=6af2e0..cfc589 age=2mo2d11m dirty=79.62MiB
INFO [06-19\|13:17:42.726] Downloader queue stats                   receiptTasks=0 blockTasks=33151 itemSize=1.83KiB  throttle=8192
INFO [06-19\|13:17:43.728] Unindexing transactions                  blocks=1727 txs=39667 total=2048 elapsed=1.002s
INFO [06-19\|13:17:44.066] Unindexed transactions                   blocks=2048 txs=47208 tail=16,481,001 elapsed=1.340s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xc44908]
goroutine 574387 [running]:
github.com/celo-org/celo-blockchain/contracts/currency.(*Currency).ToCELO(...)
github.com/celo-org/celo-blockchain/contracts/currency/currency.go:62
github.com/celo-org/celo-blockchain/miner.createConversionFunctions.func2(0x3c831c0a?, 0x4011eba400?)
github.com/celo-org/celo-blockchain/miner/block.go:380 +0x28
github.com/celo-org/celo-blockchain/core/types.NewTxWithMinerFee(0x401328e120, 0x4091e8fbd0, 0x4091e8fbe0)
github.com/celo-org/celo-blockchain/core/types/transaction.go:555 +0x88
github.com/celo-org/celo-blockchain/core/types.NewTransactionsByPriceAndNonce({0x1914680?, 0x405c819bb0}, 0x409b8fe1e0, 0x4091e8fbd0, 0x4091e8fbe0)
github.com/celo-org/celo-blockchain/core/types/transaction.go:609 +0x168
github.com/celo-org/celo-blockchain/miner.(*worker).constructPendingStateBlock(0x40029794a0, {0x1911e58, 0x403d4b1d40}, 0x40034d0a80)
github.com/celo-org/celo-blockchain/miner/worker.go:366 +0x408
github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.2()
github.com/celo-org/celo-blockchain/miner/worker.go:415 +0x38
created by github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1
github.com/celo-org/celo-blockchain/miner/worker.go:414 +0x1f4

Chain/Network: Mainnet

@aaronmboyd aaronmboyd added type:bug Something isn't working triage Issue needs triaging labels Jun 20, 2023
@aaronmboyd
Copy link
Author

Possibly related to #1982

@carterqw2 carterqw2 removed the triage Issue needs triaging label Jun 20, 2023
@karlb
Copy link
Contributor

karlb commented Jun 20, 2023

Maybe we should not silently discard the error in

curr, _ := currencyManager.GetCurrency(feeCurrency)

@palango
Copy link
Contributor

palango commented Jun 26, 2023

This involves this function:

func NewTransactionsByPriceAndNonce(signer Signer, txs map[common.Address]Transactions, baseFeeFn func(feeCurrency *common.Address) *big.Int, toCELO func(amount *big.Int, feeCurrency *common.Address) *big.Int) *TransactionsByPriceAndNonce {
// Initialize a price and received time based heap with the head transactions
heads := make(TxByPriceAndTime, 0, len(txs))
for from, accTxs := range txs {
acc, _ := Sender(signer, accTxs[0])
wrapped, err := NewTxWithMinerFee(accTxs[0], baseFeeFn, toCELO)
// Remove transaction if sender doesn't match from, or if wrapping fails.
if acc != from || err != nil {
delete(txs, from)
continue
}
heads = append(heads, wrapped)
txs[from] = accTxs[1:]
}
heap.Init(&heads)
// Assemble and return the transaction set
return &TransactionsByPriceAndNonce{
txs: txs,
heads: heads,
signer: signer,
baseFeeFn: baseFeeFn,
toCELO: toCELO,
}
}

@karlb
Copy link
Contributor

karlb commented Jun 28, 2023

From reading the stack trace, I see only one way to cause the error and that is

  • Have a curr of nil in toCeloFn
  • because GetExchangeRate failed to deliver an exchange rate for the given currency
  • because NewExchangeRate gets called with either numerator or denominator being zero

However, I don't see this being the case for any of the fee currencies around that time:

for curr in 0x765DE816845861e75A25fCA122bb6898B8B1282a 0xD8763CBa276a3738E6DE85b4b3bF5FDed6D6cA73 0xe8537a3d056DA446677B9E9d6c5dB704EaAb4787
    cast call -r https://celo-mainnet.infura.io/v3/$INFURA_API_KEY 0xefB849352
39dAcdecF7c5bA76d8dE40b077B7b33 'function medianRate(address token) external view returns (uint256, uint256)' $curr -b 18831000
end
713715994804191731790000
1000000000000000000000000
653368356238021788240000
1000000000000000000000000
3553065332252728052300000
1000000000000000000000000

Have I missed anything?

This was referenced Jul 5, 2023
@karlb
Copy link
Contributor

karlb commented Jul 7, 2023

I was unable to reproduce the panic with v1.7.4, but I just ran it locally and didn't use the docker container as described in the issue.

We could easily handle the case and avoid a panic (e.g. by not processing a tx when we can't convert the miner fee to CELO). But since it does not happen for all validators, we would get a consensus failure instead of a panic, which would be even harder to debug.

Alternatively, we could keep panicking and add some additional debugging output to show the specific tx that fails when it happens the next time.

karlb added a commit that referenced this issue Jul 11, 2023
This might help us debug
#2134 if it happens
again.
karlb added a commit that referenced this issue Jul 12, 2023
This might help us debug
#2134 if it happens
again.
karlb added a commit that referenced this issue Jul 12, 2023
This might help us debug
#2134 if it happens
again.
karlb added a commit that referenced this issue Jul 12, 2023
#2134 leads us to
believe that there are failures happening in production. This change
* Handles error cases by skipping processing of respective txs instead
  of segfaulting
* Logs tx information in these cases to better understand why this is
  happening

Skipping transactions where conversion is not possible is analogous to
handling other cases of bad transactions. But since the linked issue
only happens for some nodes, I am not sure if this is the right thing to
do or if we should intentionally panic in that case to avoid harder to
debug consensus failures due to having different nodes process a
different set of transactions.
karlb added a commit that referenced this issue Jul 12, 2023
#2134 leads us to
believe that there are failures happening in production. This change
* Handles error cases by skipping processing of respective txs instead
  of segfaulting
* Logs tx information in these cases to better understand why this is
  happening

Skipping transactions where conversion is not possible is analogous to
handling other cases of bad transactions. But since the linked issue
only happens for some nodes, I am not sure if this is the right thing to
do or if we should intentionally panic in that case to avoid harder to
debug consensus failures due to having different nodes process a
different set of transactions.
karlb added a commit that referenced this issue Oct 6, 2023
#2134 leads us to
believe that there are failures happening in production. This change
* Handles error cases by skipping processing of respective txs instead
  of segfaulting
* Logs tx information in these cases to better understand why this is
  happening

Skipping transactions where conversion is not possible is analogous to
handling other cases of bad transactions. But since the linked issue
only happens for some nodes, I am not sure if this is the right thing to
do or if we should intentionally panic in that case to avoid harder to
debug consensus failures due to having different nodes process a
different set of transactions.
karlb added a commit that referenced this issue Oct 9, 2023
* Define ToCeloFn type

The declaration is repeated multiple times and I am about to change it,
so it is nice to pull it into a typedef.

* Return errors from toCeloFn

#2134 leads us to
believe that there are failures happening in production. This change
* Handles error cases by skipping processing of respective txs instead
  of segfaulting
* Logs tx information in these cases to better understand why this is
  happening

Skipping transactions where conversion is not possible is analogous to
handling other cases of bad transactions. But since the linked issue
only happens for some nodes, I am not sure if this is the right thing to
do or if we should intentionally panic in that case to avoid harder to
debug consensus failures due to having different nodes process a
different set of transactions.
@aaronmboyd
Copy link
Author

Closing stale, and I believe this was resolved, although I don't think it was specifically reproducable on ARM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants