Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new acceptance test to soak test BFT chains #7023

Merged
merged 29 commits into from
Jun 11, 2024

Conversation

matthew1001
Copy link
Contributor

@matthew1001 matthew1001 commented Apr 30, 2024

PR description

This PR adds a new long-running acceptance test that is designed as a soak test for BFT chains.

It is disabled by default (i.e. acceptanceTestCliqueBft does not run this test) but can be run as an individual acceptance test with ./gradlew acceptanceTestBftSoak.

The current acceptance test tasks for gradle have not been updated to run the new test to ensure that PRs do not require an hour-long test before being mergeable. It is intended to be run manually by anyone who wants to easily run a soak test of the latest code on a QBFT or IBFT chain. Future discussions with the maintainers might decide to add the test somewhere else in the release pipeline.

By default the test runs for 60 minutes, but can be run for much longer by configuring the acctests.soakTimeMins system property in build.gradle. It has been tested for runs of up to 6 hours so far.

Here's a summary of what the test does:

  1. Creates a 4-validator QBFT chain with berlin fork
  2. Starts the nodes
  3. Deploys a smart contract to create some initial state
  4. Attempts to deploy a shanghai contract and checks that it fails
  5. Lets the chain continue mining empty blocks for a proportion of the total run time (e.g. for a 60 minute run it will do this step for 10 minutes)
    • Every minute it checks that the chain height is increasing, with some tolerance allowed
  6. Stops a node and checks the chain continues to mine blocks
    • Again every minute it checks that the chain height is increased, but with more tolerance for the slower mining rate
  7. Stops another node and checks the the chain now stalls
  8. Starts both of the stopped nodes. Waits for the chain to continue mining blocks again
  9. Stops all of the nodes one-by-one, upgrading them from berlin to london fork
  10. Checks that the chain continues mining new blocks once the nodes have restarted
  11. Does another check that the state of the smart contract deployed at step 3 is correct (periodic updates are made to the state of the contract in between the other steps so the last check is to make sure the state is correct from the latest update)
  12. Upgrades the chain from london to shanghai
  13. Attempts to deploy the shanghai contract that failed at step 4 and checks that it was successful

If the test is configured to run for twice as long (e.g. 2 hours) each of the main steps runs for twice as long. The test cannot be configured to run for < 1 hour as the timings for the nodes resyncing & QBFT rounds being agreed upon after the chain has stalled have minimum reliable values. There's not theoretical upper limit to how long the test can run for.

Future additions I'd like to add to the test:

  • Upgrade the chain from london to shanghai and check that the shanghai contract is deployed successfully Added under the latest commit

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
matthew1001 and others added 3 commits April 30, 2024 18:32
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
…n delays

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
@matthew1001 matthew1001 marked this pull request as ready for review May 1, 2024 07:31
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
matthew1001 and others added 5 commits May 1, 2024 16:20
….sol files

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Copy link
Contributor

@macfarla macfarla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall I like the idea of extra tests that can run on demand. Just a bit nervous about deterministically allocating the ports

Math.abs(name.hashCode() % 60000)
+ 1024
+ 500) // Generate a consistent port for p2p based on node name (+ 500 to avoid
// clashing with RPC port or other nodes with a similar name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing the ports makes me nervous - we've had flaky tests for this reason in the past. Realise it's unlikely but will it be obvious if there is a port conflict at this point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createQbftNode is used by other tests so would not want to introduce any flakiness here even if it's unlikely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree - it's a little difficult to get the choice right re. hard-coding or generating each time. The issue I was hitting was that stopping and starting a node caused it to be restarted with a new p2p node, so either nodes couldn't connect back to the bootnode (if the bootnode had been stopped and restarted) or the other nodes couldn't contact it - I think because its enode had been cached based on the node ID still being the same. The best thing I could think of was to ensure that for a given node name, the port will always be the same. That way if a test creates N nodes of different names you can guarantee they won't clash.

All of the current acceptance tests are still passing with this change in place, so maybe it's OK to leave and see if anyone hits any other issue with test nodes?

An alternative would be to create another version of createQbftNode() that uses fixed ports and one that doesn't. But it feels a little over the top if we're not seeing any port clashes currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthew1001 Would be a good idea to refactor method with signature public BesuNode createQbftNode(final String name, final int port). Then the existing method can pass 0 to use random port? while new tests can pass calculated ports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'd be happy to do that if we think fixing ports in the way I have done is going to be problematic. I'd personally vote for it being fairly stable (certainly the existing acceptance tests all still pass) but v happy to refactor in this way if the consensus is to do so.

…e the time taken to mine new blocks

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
// in between certain steps. There should be no upper-limit to how long the test is run for
assertThat(getTestDurationMins()).isGreaterThanOrEqualTo(MIN_TEST_TIME_MINS);

final BesuNode minerNode1 = nodeFactory.createNode(besu, "miner1");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this supposed to be createQbftNode??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually ends up being createQbftNode - the createNode call ^^ maps to this definition for the parameterised factory:

public static Stream<Arguments> getFactories() {
    return Stream.of(
        Arguments.of(
            "ibft2",
            new BftAcceptanceTestParameterization(
                BesuNodeFactory::createIbft2Node, BesuNodeFactory::createIbft2NodeWithValidators)),
        Arguments.of(
            "qbft",
            new BftAcceptanceTestParameterization(
                BesuNodeFactory::createQbftNode, BesuNodeFactory::createQbftNodeWithValidators)));  <<<---
  }

Copy link
Contributor

@siladu siladu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the useful PR description and code comments.

I won't approve yet to give the others a chance to re-review.

Also, for reference, just wanted to point out a mainnet/engine api-focussed test that is along similar lines (upgrading to shanghai). For example, this request includes a tx that uses push0:
https://github.com/hyperledger/besu/blob/main/acceptance-tests/tests/src/test/resources/jsonrpc/engine/shanghai/test-cases/14_shanghai_newPayloadV2_push0_tx.json

Obviously very different approach to this PR, but worth highlighting the conceptual duplication. I also think it's too early to attempt any de-duplication.

Longer term, I personally think we could move to something more like this PR for mainnet, although maybe using something like https://github.com/kurtosis-tech/ethereum-package

// solc SimpleStorageShanghai.sol --bin --abi --optimize --overwrite -o .
// then create web3j wrappers with:
// web3j generate solidity -b ./SimpleStorageShanghai.bin -a ./SimpleStorageShanghai.abi -o ../../../../../ -p org.hyperledger.besu.tests.web3j.generated
contract SimpleStorageShanghai {
Copy link
Contributor

@siladu siladu May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but might be interesting to include a shanghai-specific opcode in this contract to be sure that Shanghai EVM code is triggering. Sorry my solidity experience is next to zero, so can't suggest anything, though I imagine some shanghai contract tests exist somewhere in the community and we could steal one!

Copy link
Contributor Author

@matthew1001 matthew1001 May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did check this to make sure SimpleStorage uses PUSH0 if compiled for the shanghai EVM by compiling SimpleStorage with different EVM settings in Remix and checking the op-codes. It does use PUSH0 when compiled for shanghai. Also the test currently checks that deploying SimpleStorageShanghai to the chain when it's still on the berlin fork fails, so I think that confirms it uses PUSH0.

Comment on lines +53 to +54
@ParameterizedTest(name = "{index}: {0}")
@MethodSource("factoryFunctions")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in testing IBFT2 every time as well? Wondering if we could make this QBFT-specific since it's recommended over IBFT2. Or maybe separate them...guess it depends on how these are intended to be run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with just re-using the combinations that all the current BFT acceptance tests use because I think we might still have users on IBFT2 with existing chains who still want to move up to shanghai. I'd personally suggest that we remove IBFT2 from all or none (unless we see issues with the run time for this particular test)

acceptance-tests/tests/shanghai/build.gradle Outdated Show resolved Hide resolved
Comment on lines 20 to 23
publishing {
publications {
mavenJava(MavenPublication) { artifactId 'acceptance-tests-tests' }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this result in publishing these artifacts to artifactory? We currently publish libs to hyperledger jfrog as part of the release using artifactoryPublish task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I've got to admit added this to avoid a gradle build failure and wasn't quite sure of the side-affects (other than it fixed the build! :) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - think the latest commit sorts this out @siladu. I've removed the publish specs and just disabled jar building for the new shanghai project.

Co-authored-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Matt Whitehead <matthew1001@hotmail.com>
@matthew1001
Copy link
Contributor Author

matthew1001 commented May 3, 2024

Also, for reference, just wanted to point out a mainnet/engine api-focussed test that is along similar lines (upgrading to shanghai). For example, this request includes a tx that uses push0: https://github.com/hyperledger/besu/blob/main/acceptance-tests/tests/src/test/resources/jsonrpc/engine/shanghai/test-cases/14_shanghai_newPayloadV2_push0_tx.json

Yeah there are certainly a couple of different approaches used in the tests currently. The BFT acceptance tests use the web3j gradle extension to compile source .sol files at build time with specific EVM/compiler versions. That feels a little better than having to hard-code pre-compiled contracts into test files, although really I think both are perfectly valid. It does feel like it will be easy to extend to new EVM versions in the future with "simple" gradle changes/additions. The thing I feel most strongly about is having the option to invoke arbitrary numbers of transactions during a test (i.e. not use pre-created transaction payloads with hard-coded nonces in).

matthew1001 and others added 3 commits May 3, 2024 16:12
Signed-off-by: Matt Whitehead <matthew.whitehead@kaleido.io>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
matthew1001 and others added 3 commits May 9, 2024 09:57
… project

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
@macfarla
Copy link
Contributor

hit a failure in starting the cluster in a BFT AT https://github.com/hyperledger/besu/actions/runs/9104377722/job/25028115138

BftMiningAcceptanceTest > shouldStillMineWhenANonProposerNodeFailsAndHasSufficientValidators(String, BftAcceptanceTestParameterization) > 1: ibft2 FAILED

Nothing in the stack trace to tell me whether it's a port conflict.

But - line 450 is waiting for the ports file so that makes me think it is
at app//org.hyperledger.besu.tests.acceptance.dsl.node.BesuNodeRunner.waitForFile(BesuNodeRunner.java:51)
at app//org.hyperledger.besu.tests.acceptance.dsl.node.ProcessBesuNodeRunner.startNode(ProcessBesuNodeRunner.java:450)

That's the only thing stopping me from approving this, that I don't want to add any more flakiness into the ATs.

if you refactor that method @matthew1001 so that port 0 can be used for existing tests, and the deterministic port can be used for the soak test, I'll approve

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
@macfarla
Copy link
Contributor

the "compile" error is this

* What went wrong:
Execution failed for task ':checkMavenCoordinateCollisions'.
> Duplicate maven coordinates detected, .:besu:24.5-develop-be7c665 is used by both :acceptance-tests:tests and :acceptance-tests:tests:shanghai.

Signed-off-by: Matthew Whitehead <matthew1001@gmail.com>
@matthew1001
Copy link
Contributor Author

Yeah was just looking at that issue @macfarla It turned out that the approach I used of setting jar { enabled = false } prevented the shanghai contract from being available on the classpath at test runtime. So I re-instated the jar build, but that has re-introduced the earlier error I was seeing where the acceptance-tests:tests and acceptance-tests:tests:shanghai generate the same maven artifact coordinate.

I don't believe we actual publish acceptance test artifacts to maven, and the coordinate it says is clashing is .:besu:24.5-develop-be7c665, which looks like an incomplete artifact coordinate. The other ones we publish are things like org.hyperledger.besu.internal:rlp:24.5-develop-be7c665.

I've just pushed an update which modifies the coordinate-check task. I'm not a gradle expert so I think I'm probably missing a better/simpler fix.

@matthew1001
Copy link
Contributor Author

Compilation is looking OK again now @macfarla

@macfarla
Copy link
Contributor

Yeah was just looking at that issue @macfarla It turned out that the approach I used of setting jar { enabled = false } prevented the shanghai contract from being available on the classpath at test runtime. So I re-instated the jar build, but that has re-introduced the earlier error I was seeing where the acceptance-tests:tests and acceptance-tests:tests:shanghai generate the same maven artifact coordinate.

I don't believe we actual publish acceptance test artifacts to maven, and the coordinate it says is clashing is .:besu:24.5-develop-be7c665, which looks like an incomplete artifact coordinate. The other ones we publish are things like org.hyperledger.besu.internal:rlp:24.5-develop-be7c665.

I've just pushed an update which modifies the coordinate-check task. I'm not a gradle expert so I think I'm probably missing a better/simpler fix.

@usmansaleem can you review the gradle changes?

@matthew1001 matthew1001 merged commit b1ac5ac into hyperledger:main Jun 11, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants