Show that locust can saturate the gas limits and describe chain behavior #9118

jakmeier · 2023-05-26T12:06:27Z

Running locust based loadtests as described in #8999, we want to observe a test where at least one shard (ideally all shards) have full chunks for a extended period of time.

This will prove that gas is a bottleneck before any bottlenecks of the test setup prevent more traffic.
And it will show what we should expect in a congestion case today.
This is a per-requisite for #8920.

jakmeier · 2023-05-26T14:01:52Z

I'm still unable to get blocks with >1000Tgas :(

So far I resolved 2 problem that prevented me from hitting the gas limit:

Load generator CPU bottleneck: Swarming locust across enough threads (I ended up using 32 threads) where each has it's own funding account to avoid Nonce collisions. (resolved by feat: swarmable FT loadtest #9111)
- This problem shows up as a warning by locust that CPU usage is > 90%
RPC node bottleneck: If all requests are going thorugh the same RPC node, his node will become a bottleneck in accepting more TXs. (resolved by using different -H args for different workers)
- This problem shows up as TIMEOUT_ERROR on the requests to RPC nodes, reported as 'No result returned' in locust statistics

But even with that, I am not quite able to saturate even a single shard the way I was hoping to do.

I ran a 4 shard, 4 node localnet. And locust with 6000 users spanning across 32 workers, each with 4 separate FT contracts. The 32 workers send their request to 2 different RPC nodes. This setup peaked around 900 TPS, with only about 75% of gas capacity on each shard. (evenly distributed)

Note: 900TPS peak throughput corresponds to 900 * ~5 Tgas = 4500Tgas per second. With a block time of 1.3s that means 4500Tgas / 1.3s = 3461 Tgas / block which is about 86% of the 4000Tgas capacity.
The expected throughput at 100% gas capacity would be around 1050TPS.

Looking at the response time going up significantly starting at around 3800 users, it suggests that we are hitting a bottleneck there. But this has only about 750 TPS, far below the 1050TPS I want to see. So I need to figure out what the current bottleneck is. Trying more than 2 RPC nodes next.

cc @akhi3030 maybe you have some ideas regarding the bottleneck, or see flaws in my reasoning?

jakmeier · 2023-05-26T17:15:11Z

I've repeated the experiment with more RPC nodes - same results.

Then I run with just a single shard. (thanks @akhi3030 for the idea!) Then I was hitting a limit at around 900 users, with again chunks never filling up. They are stuck at around 750Tgas again.

But after that, I figured out one big factor: Compute Costs! FT calls are doing a decent amount of storage requests, which means they are charged a higher compute cost than the gas cost. Removing compute cost parameters gives me almost full chunks, but sadly still not quite.

With ~4200 users I'm getting close to ~900 TPS with still a mostly stable median response time of 2.5s.
Going up all the way to 7000 users, I see short spikes of up to 1000 TPS and chunks filled up to 910 Tgas. The response time goes up to ~5.5s median, so things must be queuing up somewhere. But still it's not quite the gas limit we are hitting.

Next week I'll integrate it with Prometheus and Grafana to get more data about what the nodes are doing.

bowenwang1996 · 2023-05-29T17:42:10Z

@jakmeier you mentioned that you used 2 rpc nodes for the test and I wonder whether that is enough. Would it help if there are more rpc nodes to distribute the rpc request load?

jakmeier · 2023-05-30T07:57:21Z

@jakmeier you mentioned that you used 2 rpc nodes for the test and I wonder whether that is enough. Would it help if there are more rpc nodes to distribute the rpc request load?

yes, that was one experiemnt

I've repeated the experiment with more RPC nodes - same results.

This was with 4 RPC nodes. And I even test with a single shard. I think that should be enough to rule out this bottleneck in this particular setup. But I think for the final benchmark, it would be good to have at least as many RPC nodes as number of shards.

jakmeier · 2023-06-06T13:00:01Z

While I'm working on running this on top of testnet state, @Akashin has been able to saturate chunks with gas already last week: #8920 (comment)

But that's with larger receipts. We still want to show it with many small receipts, too.

jakmeier · 2023-06-21T11:47:32Z

Filling chunks to the limit using locust load has been demonstrated multiple time by now. With the new metrics, it is also easy to observe. Hence I am going to mark this issue as completed.

A few comments on how to check for "full" chunks:

When running on a testnet / mainnet fork, there is going to be no traffic on shard 1 (aurora's shard)
When running with socialDB workload, one shard (shard 3 with testnet/mainnet sharding layout) will be the bottleneck.
Compute cost limit is hit before the gas cost limit

As an example, below are the compute cost heatmaps for all 4 shards, where only shard 3 is at capacity.

jakmeier mentioned this issue May 26, 2023

Tracking issue: Benchmark TPS of NEAR Protocol #8999

Open

8 tasks

jakmeier added the A-congestion Work aimed at ensuring good system performance under congestion label May 26, 2023

jakmeier self-assigned this May 26, 2023

jakmeier mentioned this issue May 26, 2023

Implement test for local congestion control #8920

Closed

jakmeier closed this as completed Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show that locust can saturate the gas limits and describe chain behavior #9118

Show that locust can saturate the gas limits and describe chain behavior #9118

jakmeier commented May 26, 2023 •

edited

Loading

jakmeier commented May 26, 2023

jakmeier commented May 26, 2023

bowenwang1996 commented May 29, 2023

jakmeier commented May 30, 2023

jakmeier commented Jun 6, 2023

jakmeier commented Jun 21, 2023

Show that locust can saturate the gas limits and describe chain behavior #9118

Show that locust can saturate the gas limits and describe chain behavior #9118

Comments

jakmeier commented May 26, 2023 • edited Loading

jakmeier commented May 26, 2023

jakmeier commented May 26, 2023

bowenwang1996 commented May 29, 2023

jakmeier commented May 30, 2023

jakmeier commented Jun 6, 2023

jakmeier commented Jun 21, 2023

jakmeier commented May 26, 2023 •

edited

Loading