Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show that locust can saturate the gas limits and describe chain behavior #9118

Closed
Tracked by #8999
jakmeier opened this issue May 26, 2023 · 6 comments
Closed
Tracked by #8999
Assignees
Labels
A-congestion Work aimed at ensuring good system performance under congestion

Comments

@jakmeier
Copy link
Contributor

jakmeier commented May 26, 2023

Running locust based loadtests as described in #8999, we want to observe a test where at least one shard (ideally all shards) have full chunks for a extended period of time.

This will prove that gas is a bottleneck before any bottlenecks of the test setup prevent more traffic.
And it will show what we should expect in a congestion case today.
This is a per-requisite for #8920.

@jakmeier jakmeier added the A-congestion Work aimed at ensuring good system performance under congestion label May 26, 2023
@jakmeier jakmeier self-assigned this May 26, 2023
@jakmeier
Copy link
Contributor Author

I'm still unable to get blocks with >1000Tgas :(

So far I resolved 2 problem that prevented me from hitting the gas limit:

  • Load generator CPU bottleneck: Swarming locust across enough threads (I ended up using 32 threads) where each has it's own funding account to avoid Nonce collisions. (resolved by feat: swarmable FT loadtest #9111)
    • This problem shows up as a warning by locust that CPU usage is > 90%
  • RPC node bottleneck: If all requests are going thorugh the same RPC node, his node will become a bottleneck in accepting more TXs. (resolved by using different -H args for different workers)
    • This problem shows up as TIMEOUT_ERROR on the requests to RPC nodes, reported as 'No result returned' in locust statistics

But even with that, I am not quite able to saturate even a single shard the way I was hoping to do.

I ran a 4 shard, 4 node localnet. And locust with 6000 users spanning across 32 workers, each with 4 separate FT contracts. The 32 workers send their request to 2 different RPC nodes. This setup peaked around 900 TPS, with only about 75% of gas capacity on each shard. (evenly distributed)

Note: 900TPS peak throughput corresponds to 900 * ~5 Tgas = 4500Tgas per second. With a block time of 1.3s that means 4500Tgas / 1.3s = 3461 Tgas / block which is about 86% of the 4000Tgas capacity.
The expected throughput at 100% gas capacity would be around 1050TPS.

image

image

Looking at the response time going up significantly starting at around 3800 users, it suggests that we are hitting a bottleneck there. But this has only about 750 TPS, far below the 1050TPS I want to see. So I need to figure out what the current bottleneck is. Trying more than 2 RPC nodes next.

cc @akhi3030 maybe you have some ideas regarding the bottleneck, or see flaws in my reasoning?

@jakmeier
Copy link
Contributor Author

I've repeated the experiment with more RPC nodes - same results.

Then I run with just a single shard. (thanks @akhi3030 for the idea!) Then I was hitting a limit at around 900 users, with again chunks never filling up. They are stuck at around 750Tgas again.

But after that, I figured out one big factor: Compute Costs! FT calls are doing a decent amount of storage requests, which means they are charged a higher compute cost than the gas cost. Removing compute cost parameters gives me almost full chunks, but sadly still not quite.

With ~4200 users I'm getting close to ~900 TPS with still a mostly stable median response time of 2.5s.
Going up all the way to 7000 users, I see short spikes of up to 1000 TPS and chunks filled up to 910 Tgas. The response time goes up to ~5.5s median, so things must be queuing up somewhere. But still it's not quite the gas limit we are hitting.

Next week I'll integrate it with Prometheus and Grafana to get more data about what the nodes are doing.

@bowenwang1996
Copy link
Collaborator

@jakmeier you mentioned that you used 2 rpc nodes for the test and I wonder whether that is enough. Would it help if there are more rpc nodes to distribute the rpc request load?

@jakmeier
Copy link
Contributor Author

@jakmeier you mentioned that you used 2 rpc nodes for the test and I wonder whether that is enough. Would it help if there are more rpc nodes to distribute the rpc request load?

yes, that was one experiemnt

I've repeated the experiment with more RPC nodes - same results.

This was with 4 RPC nodes. And I even test with a single shard. I think that should be enough to rule out this bottleneck in this particular setup. But I think for the final benchmark, it would be good to have at least as many RPC nodes as number of shards.

@jakmeier
Copy link
Contributor Author

jakmeier commented Jun 6, 2023

While I'm working on running this on top of testnet state, @Akashin has been able to saturate chunks with gas already last week: #8920 (comment)

But that's with larger receipts. We still want to show it with many small receipts, too.

@jakmeier
Copy link
Contributor Author

Filling chunks to the limit using locust load has been demonstrated multiple time by now. With the new metrics, it is also easy to observe. Hence I am going to mark this issue as completed.

A few comments on how to check for "full" chunks:

  • When running on a testnet / mainnet fork, there is going to be no traffic on shard 1 (aurora's shard)
  • When running with socialDB workload, one shard (shard 3 with testnet/mainnet sharding layout) will be the bottleneck.
  • Compute cost limit is hit before the gas cost limit

As an example, below are the compute cost heatmaps for all 4 shards, where only shard 3 is at capacity.

image
image
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-congestion Work aimed at ensuring good system performance under congestion
Projects
None yet
Development

No branches or pull requests

2 participants