Add new RPC stress testing tool (lotus-bench rpc) with rich reporting #10761

fridrik01 · 2023-04-26T10:05:01Z

Fixes: #10752
Fixes: https://github.com/filecoin-project/fvm-pm/issues/494.

Context

We need a more elaborate tool to stress test our RPC methods in order address and fix the reported performance issues (example #10670, #10539, #10540, #10541, #10663).

This PR implements such tool (lotus-bench rpc) and has the following features:

Can query each method both sequentially and concurrently
Supports rate limiting
Can query multiple different endpoints at once (supporting different concurrency level and rate limiting for each method)
Gives a nice reporting summary of the stress testing of each method (including latency distribution, histogram, errors (http and json error codes) and more)
Supports --watch option which prints out intermediate progress which is useful for long running benchmark
Easy to use

NOTE: Right now everything is within a single source file (rpc.go) but can be easily refactored and split into multiple files and moved into its own package.

NOTE: To support any type of PARAMS we need to be able to pass , from the command line. This however requires an upgrade to urfave which added support for that via flag DisableSliceFlagSeparator. However, upgrading urface brings in regressions in how it generates --help output and does also not support displaying categories in subcommands. I raised this issue in urfave and will update the urfave dependency once that is fixed and then explicitly set the DisableSliceFlagSeparator so we can support any type of PARAMS

Test plan

Build:

make lotus-bench

Stress test eth_chainId using default options :

lotus-bench rpc --method='eth_chainId'
[eth_chainId]:
- Options:
  - concurrency: 10
  - params: []
  - qps: 0
- Total Requests: 3920235
- Total Duration: 59992ms
- Requests/sec: 65345.869940
- Avg latency: 0ms
- Median latency: 0ms
- Latency distribution:
    10.00% in 0ms
    50.00% in 0ms
    90.00% in 0ms
    95.00% in 0ms
    99.00% in 1ms
    99.90% in 1ms
- Histogram:
     0-1ms|  3918201|################################################################################################### (99.95%)
     1-2ms|     1044| (0.03%)
     2-3ms|      421| (0.01%)
     3-4ms|      265| (0.01%)
     4-5ms|      132| (0.00%)
     5-6ms|       84| (0.00%)
     6-7ms|       33| (0.00%)
     7-8ms|       29| (0.00%)
     8-9ms|       12| (0.00%)
    9-14ms|       14| (0.00%)
- Status codes:
    [200]: 3920235
- Errors (top 10):
    [nil]: 3920235

Now lets try stress testing the eth_getTransactionCount rpc method for 120 seconds using the specified rpc method params:

lotus-bench rpc --duration=120s --method='eth_getTransactionCount:::["0xd4c70007F3F502f212c7e6794b94C06F36173B36", "latest"]' 
[eth_getTransactionCount]:
- Options:
  - concurrency: 10
  - params: ["0xd4c70007F3F502f212c7e6794b94C06F36173B36", "latest"]
  - qps: 0
- Total Requests: 3294912
- Total Duration: 119992ms
- Requests/sec: 27459.420012
- Avg latency: 0ms
- Median latency: 0ms
- Latency distribution:
    10.00% in 0ms
    50.00% in 0ms
    90.00% in 0ms
    95.00% in 0ms
    99.00% in 0ms
    99.90% in 9ms
- Histogram:
       0-16ms|  3294108|################################################################################################### (99.98%)
      16-32ms|      467| (0.01%)
      32-48ms|       90| (0.00%)
      48-64ms|       64| (0.00%)
      64-80ms|       43| (0.00%)
      80-96ms|       40| (0.00%)
     96-112ms|       49| (0.00%)
    112-128ms|       16| (0.00%)
    128-144ms|       10| (0.00%)
    144-165ms|       25| (0.00%)
- Status codes:
    [200]: 3294912
- Errors (top 10):
    [nil]: 3294912

Now lets try stress testing both the eth_chainId and eth_getTransactionCount at the same time

eth_chainId will be stress tested using 5 concurrent workers limited to 1000 queries per second, and
eth_getTransactionCount will be stress tested using 10 concurrent workers limited ot 2000 queries per second:

lotus-bench rpc --duration=10s --method='eth_chainId:5:1000'  --method='eth_getTransactionCount:10:2000:["0xd4c70007F3F502f212c7e6794b94C06F36173B36", "latest"]' 
[eth_chainId]:
- Options:
  - concurrency: 5
  - params: []
  - qps: 1000
- Total Requests: 9447
- Total Duration: 10000ms
- Requests/sec: 944.689930
- Avg latency: 0ms
- Median latency: 0ms
- Latency distribution:
    10.00% in 0ms
    50.00% in 0ms
    90.00% in 0ms
    95.00% in 0ms
    99.00% in 0ms
    99.90% in 2ms
- Histogram:
      0-2ms|  9438|################################################################################################### (99.90%)
      2-4ms|     3| (0.03%)
      4-6ms|     0| (0.00%)
      6-8ms|     1| (0.01%)
     8-10ms|     0| (0.00%)
    10-12ms|     0| (0.00%)
    12-14ms|     0| (0.00%)
    14-16ms|     2| (0.02%)
    16-18ms|     1| (0.01%)
    18-20ms|     2| (0.02%)
- Status codes:
    [200]: 9447
- Errors (top 10):
    [nil]: 9447

[eth_getTransactionCount]:
- Options:
  - concurrency: 10
  - params: ["0xd4c70007F3F502f212c7e6794b94C06F36173B36", "latest"]
  - qps: 2000
- Total Requests: 11415
- Total Duration: 10000ms
- Requests/sec: 1141.477942
- Avg latency: 0ms
- Median latency: 0ms
- Latency distribution:
    10.00% in 0ms
    50.00% in 0ms
    90.00% in 2ms
    95.00% in 6ms
    99.00% in 14ms
    99.90% in 50ms
- Histogram:
      0-5ms|  10722|############################################################################################# (93.93%)
     5-10ms|    424|### (3.71%)
    10-15ms|    187|# (1.64%)
    15-20ms|     44| (0.39%)
    20-25ms|     14| (0.12%)
    25-30ms|      4| (0.04%)
    30-35ms|      0| (0.00%)
    35-40ms|      0| (0.00%)
    40-45ms|      0| (0.00%)
    45-55ms|     20| (0.18%)
- Status codes:
    [200]: 11415
- Errors (top 10):
    [nil]: 11415

Test that errors are reported correctly for both http and json errors. In this example the params given to eth_estimateGas are invalid so a json response is returned with an error message. Also, after running this for 2sec I killed lotus and it correctly reported then http errors for the remaining requests:

lotus-bench rpc --method='eth_estimateGas:1:1:[{"to": "0x7B90337f65fAA2B2B8ed583ba1Ba6EB0C9D7eA44"}]' --duration=5s
- Options:
  - concurrency: 1
  - params: [{"to": "0x7B90337f65fAA2B2B8ed583ba1Ba6EB0C9D7eA44"}]
  - qps: 1
- Total Requests: 5
- Total Duration: 5000ms
- Requests/sec: 0.999891
- Avg latency: 560ms
- Median latency: 0ms
- Latency distribution:
    10.00% in 0ms
    50.00% in 0ms
    90.00% in 1633ms
    95.00% in 1633ms
    99.00% in 1633ms
    99.90% in 1633ms
- Histogram:
        0-163ms|  3|############################################################ (60.00%)
      163-326ms|  0| (0.00%)
      326-489ms|  0| (0.00%)
      489-652ms|  0| (0.00%)
      652-815ms|  0| (0.00%)
      815-978ms|  0| (0.00%)
     978-1141ms|  0| (0.00%)
    1141-1304ms|  1|#################### (20.00%)
    1304-1467ms|  0| (0.00%)
    1467-1633ms|  1|#################### (20.00%)
- Status codes:
    [200]: 2
- Errors (top 10):
    [HTTP error: Post "http://127.0.0.1:1234/rpc/v1": dial tcp 127.0.0.1:1234: connect: connection refused]: 3
    [JSON error: code:1, message:failed to estimate gas: message execution failed: exit 33, revert reason: none, vm error: message failed with backtrace:00: f02064481 (method 3844450837) -- contract reverted (33) (RetCode=33)]: 2

snissn · 2023-05-02T18:26:44Z

This looks good to me, the code works, and is very helpful for debugging. @arajasek are there any additional steps or checks we should take before this is approved?

snissn · 2023-05-02T18:27:26Z

@fridrik01 there is a failing test -- https://app.circleci.com/pipelines/github/filecoin-project/lotus/28376/workflows/553452e4-b449-4ff4-b750-4eb4cca0b41a/jobs/950691

fridrik01 · 2023-05-03T11:12:33Z

@fridrik01 there is a failing test -- https://app.circleci.com/pipelines/github/filecoin-project/lotus/28376/workflows/553452e4-b449-4ff4-b750-4eb4cca0b41a/jobs/950691

Ok, the upgrade of urfave/cli/v2 changed the --help output so I needed to run make docsgen-cli to update them. The changes I can see in the output are the following:

Categories for subcommands for some reason are not shown anymore. I only see this used for lotus client --help though so it may not be a big issue
Options are now sorted by the order they are added in the code instead of by name (which should actually be better IMO.

magik6k

Code looks good, just a few non-blocking nitpicks.

Not sure why we need to update urfave/cli here, it does seem to break groups in helptext - that should be either fixed or we should drop the update from this PR.

documentation/en/cli-lotus.md

go.mod

cmd/lotus-bench/rpc.go

fridrik01 · 2023-05-27T10:52:10Z

I have removed the urfave upgdare from this PR, it does mean that we don't support bencmarking all RPC methods but we can at least then land this and wait for this being fixed upstream in urfave (issue here and fix here).

cc: @magik6k @snissn

This benchmark is designed to stress test the rpc methods of a lotus node so that we can simulate real world usage and measure the performance of rpc methods on the node. This benchmark has the following features: * Can query each method both sequentially and concurrently * Supports rate limiting * Can query multiple different endpoints at once (supporting different concurrency level and rate limiting for each method) * Gives a nice reporting summary of the stress testing of each method (including latency distribution, histogram and more) * Easy to use To use this benchmark you must specify the rpc methods you want to test using the --method options, the format of it is: --method=NAME[:CONCURRENCY][:QPS][:PARAMS] where only METHOD is required. Here are some real examples: lotus-bench rpc --method='eth_chainId' // run eth_chainId with default concurrency and qps lotus-bench rpc --method='eth_chainId:3' // override concurrency to 3 lotus-bench rpc --method='eth_chainId::100' // override to 100 qps while using default concurrency lotus-bench rpc --method='eth_chainId:3:100' // run using 3 workers but limit to 100 qps lotus-bench rpc --method='eth_getTransactionCount:::["0xd4c70007F3F502f212c7e6794b94C06F36173B36", "latest"]' // run using optional params while using default concurrency and qps lotus-bench rpc --method='eth_chainId' --method='eth_getTransactionCount:10:0:["0xd4c70007F3F502f212c7e6794b94C06F36173B36", "latest"]' // run multiple methods at once`, Fixes: #10752

fridrik01 requested review from raulk and maciejwitowski April 26, 2023 10:05

fridrik01 force-pushed the 10752-bench-rpc branch 3 times, most recently from fe614fc to 9cb1ef2 Compare April 26, 2023 12:07

fridrik01 requested a review from snissn April 26, 2023 12:07

fridrik01 marked this pull request as ready for review April 26, 2023 12:25

fridrik01 requested a review from a team as a code owner April 26, 2023 12:25

fridrik01 force-pushed the 10752-bench-rpc branch 7 times, most recently from 1161d5e to 77b04d1 Compare April 26, 2023 19:46

fridrik01 force-pushed the 10752-bench-rpc branch from 3dfcbd1 to 77b04d1 Compare May 3, 2023 10:02

magik6k reviewed May 9, 2023

View reviewed changes

fridrik01 force-pushed the 10752-bench-rpc branch from b2592d4 to 120531b Compare May 10, 2023 21:36

fridrik01 force-pushed the 10752-bench-rpc branch from 120531b to fc309a5 Compare May 27, 2023 10:30

fridrik01 requested a review from a team May 27, 2023 10:52

fridrik01 added 6 commits May 27, 2023 10:55

small fixes

d0e9502

Also report on json errors (not only http errors)

dcc72a4

Add --watch option to see progress while benchmark is running

4b0ca30

Address review comments

b563a36

Cleanup after removing urface upgrade

e1b69f8

fridrik01 force-pushed the 10752-bench-rpc branch from fc309a5 to e1b69f8 Compare May 27, 2023 10:55

magik6k approved these changes May 30, 2023

View reviewed changes

magik6k merged commit 1205024 into master May 30, 2023

magik6k deleted the 10752-bench-rpc branch May 30, 2023 17:37

fridrik01 mentioned this pull request Jun 3, 2023

Upgrade urfave dependency which now supports DisableSliceFlagSeparato… #10950

Merged

snissn mentioned this pull request Jul 27, 2023

Add RPC Stress Testing Tool (lotus-bench rpc) to lotus-ec2-tools Testing and Server Deployment Framework #11102

Open

9 tasks

fridrik01 mentioned this pull request Sep 14, 2023

feat: Add lotus-bench cli option to stress test any binary #11270

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new RPC stress testing tool (lotus-bench rpc) with rich reporting #10761

Add new RPC stress testing tool (lotus-bench rpc) with rich reporting #10761

fridrik01 commented Apr 26, 2023 •

edited

Loading

snissn commented May 2, 2023

snissn commented May 2, 2023

fridrik01 commented May 3, 2023 •

edited

Loading

magik6k left a comment

fridrik01 commented May 27, 2023

Add new RPC stress testing tool (lotus-bench rpc) with rich reporting #10761

Add new RPC stress testing tool (lotus-bench rpc) with rich reporting #10761

Conversation

fridrik01 commented Apr 26, 2023 • edited Loading

Context

Test plan

snissn commented May 2, 2023

snissn commented May 2, 2023

fridrik01 commented May 3, 2023 • edited Loading

magik6k left a comment

Choose a reason for hiding this comment

fridrik01 commented May 27, 2023

fridrik01 commented Apr 26, 2023 •

edited

Loading

fridrik01 commented May 3, 2023 •

edited

Loading