Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVM-800 nonce too low from eth_getTransactionCount #1853

Merged

Conversation

igorcrevar
Copy link
Contributor

@igorcrevar igorcrevar commented Aug 28, 2023

Description

BlockNumber structure has these supported strings as special kind of a block numbers:

  1. "pending" - this returns information from the block that is currently being mined, which might not yet be part of the blockchain.
  2. "latest" - refers to the latest block in state
  3. "earliest" - refers to the genesis block

We should allow pending blocks for the eth_getTransactionCount endpoint only. All the other parts of the current system should treat pending as the latest.

I believe the issue is that Polygon Edge checks the nonce in two different places. In [txpool.validateTx](https://github.com/0xPolygon/polygon-edge/blob/develop/txpool/txpool.go#L672-L676) (using p.store.GetNonce(stateRoot, tx.From) and [txpool.addTx](https://github.com/0xPolygon/polygon-edge/blob/develop/txpool/txpool.go#L824C11-L824C25) (using account.getNonce()). Not sure why it’s tracked in two different places.

When we do check in validate, we assume that txpool does not have any transaction for that specific address. So we are checking if tx nonce is lower than expected nonce from state.
Later on, we load(or create) object where we store all the txpool information for specific address (including nonce). Then we check additionally (if there was already some tx for that address in a pool) correctness of a nonce.
There is no bug here. I just added metrics counter which was missing.

Problem was only because getTransactionCount always returned nonce from the state (even for pending). Nonce from the state and nonce from the txpool account can differ in situations when there is already more than one tx for same address in a pool.
This fix will also fix https://polygon.atlassian.net/browse/EVM-801.

Important note!: two or more consecutive calls to eth_getTransactionCount(address, "pending") can result in same value and that’s is just because nonce updating in txpool is not immediate. So, user of this endpoint must handle this case.

  1. The RPC and Validator agree on nonce
  2. The RPC fails to accept a transaction with nonce too low
  3. The Validator will accept that same transaction
  4. After that transaction has been mined, the validator and RPC still agree on nonce
  5. The RPC will still fail and the validator will succeed in mining

Question @jp have you tried configuring --max-enqueued or --max-slots on the rpc? When I ran your scripts, I saw a lot of maximum number of enqueued transactions reached errors and afterward the RPC would reject txs with the correct nonce. I changed the setup-rpc.sh script to look like this./polygon-edge server --data-dir ./data-rpc/data-1 --chain ./data/genesis.json --grpc-address :50000 --libp2p :30305 --jsonrpc :50002 --price-limit 0 --log-level DEBUG --max-enqueued 100000 --max-slots 10000 > rpc-005.log 2>&1 &The submit_multiple.sh script will still print out some errors but no more errors about max enqueued tx. After cast gives up trying to send the 500 tx, the rpc will work as expected and accept transactions with the correct nonce. So a few thoughts

I agree this is a bug of some kind in the node. The key features might be
It seems to only show up in the RPC / non-validating nodes
It occurs after exceeding the configured queue size
The nonce for that particular account will not be accepted by the node anymore

Mitigations
Requires restarting the node to clear the state

Configure a large queue size
Even with a large queue, I assume the bug could still occur, but it becomes less likely

I didn’t try increasing the account or pool limits. I agree it might help but, as you point out, it probably doesn’t resolve the issue, it just hides it a bit more. I’m also concerned about the memory requirements associated with increasing those limits.I believe the issue is that Polygon Edge checks the nonce in two different places. In txpool.validateTx (using p.store.GetNonce(stateRoot, tx.From) and txpool.addTx (using account.getNonce()). Not sure why it’s tracked in two different places.If you enable prometheus in the RPC node you’ll notice that nonce_too_low_tx metric is not reported, which means that our issue comes from the addTx check (which probably should update the prometheus gauge, but that’s another issue). The addTx check uses the nonce value it finds in account.getNonce(). Calls to account.setNonce happen in account.reset, account.promote and txpool.Drop.My bet is the reproduction steps I shared are causing the issue in txpool.Drop, but all three should be properly reviewed.

Changes include

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)

Breaking changes

It should not be

Checklist

  • I have assigned this PR to myself
  • I have added at least 1 reviewer
  • I have added the relevant labels
  • I have updated the official documentation
  • I have added sufficient documentation in code

Testing

  • I have tested this code with the official test suite
  • I have tested this code manually

Manual tests

  • Start cluster with some premine address
  • From some script send tx to one the nodes this way:
    1. first send txs with nonces: 0, 10, 20, 30, 40
    2. after that send 45 txs with same account but each time set nonce to the value retrieved by eth_getTransactionCount(address, 'pending')
    3. note that two consecutive eth_getTransactionCount calls can return same value, so in that case just wait some time and continue (skip this iteration) with a loop. this is ok, because updating nonce in txpool is not immediate operation, it took some (small) amount of time
    4. after all transactions are sent, we can wait for their receipts
    5. note that eth_getTransactionCount(address, 'latest') will in most case return same value that is smaller than value from eth_getTransactionCount(address, 'pending'). This is totally normal and as soon as all pending txs from txpool are included in some blocks those values will be the same.

I do not think this e2e is necessary but i can write it if needed

@igorcrevar igorcrevar added the bug fix Functionality that fixes a bug label Aug 28, 2023
@igorcrevar igorcrevar requested a review from a team August 28, 2023 11:30
@igorcrevar igorcrevar self-assigned this Aug 28, 2023
@rachit77
Copy link
Contributor

LGTM

@igorcrevar igorcrevar marked this pull request as ready for review August 28, 2023 15:01
@igorcrevar igorcrevar requested a review from a team August 28, 2023 21:50
jsonrpc/eth_endpoint.go Show resolved Hide resolved
jsonrpc/query.go Outdated Show resolved Hide resolved
jsonrpc/codec.go Show resolved Hide resolved
jsonrpc/codec.go Show resolved Hide resolved
jsonrpc/query_test.go Outdated Show resolved Hide resolved
@igorcrevar igorcrevar merged commit 9b5bbe4 into develop Aug 31, 2023
6 checks passed
@igorcrevar igorcrevar deleted the feature/EVM-800_nonce_too_low_from_eth_getTransactionCount branch August 31, 2023 13:27
@github-actions github-actions bot locked and limited conversation to collaborators Aug 31, 2023
@vcastellm
Copy link
Contributor

Late LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug fix Functionality that fixes a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants