Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Locust performance with high TPS #11352

Closed
Tracked by #11348
aborg-dev opened this issue May 20, 2024 · 11 comments
Closed
Tracked by #11348

Improve Locust performance with high TPS #11352

aborg-dev opened this issue May 20, 2024 · 11 comments
Assignees

Comments

@aborg-dev
Copy link
Contributor

aborg-dev commented May 20, 2024

Our experiments show that a single Locust instance struggles to run with over 500 TPS. This is not expected as Locust documentation claims a single thread can handle 4000 TPS: https://docs.locust.io/en/2.27.0/increase-performance.html
We need to investigate and fix this discrepancy.

@aborg-dev
Copy link
Contributor Author

This should be the fix: aabf18b
Let's see if we can productionize it.

@MCJOHN974
Copy link
Contributor

Small note:

This fix uses FastHttpUser instead of HttpUser, and, according to locust docs it can perform not 4000, but 16000 TPS. From my experience with our setup switching from HttpUser to FastHttpUser gives us around 10-20% extra TPS, not 300%

@MCJOHN974
Copy link
Contributor

Probably, such difference in ratio shows that in our case bottleneck for locust is computations to generate and sign a transaction, not sending it to the node. (But I didn't read yet what is actually the difference between fast and not fast HttpUser in locust)

@MCJOHN974
Copy link
Contributor

MCJOHN974 commented May 20, 2024

TLDR: I don't surprised that generation of NEAR transactions is 10 times slower than "best case scenario"

And for me it sounds reasonable, they say about 4k TPS in "best case scenario", so probably they just continuously sending thousands of identical requests like GET / HTTP/1.1 Host: example.com in their benchmark

So probably their request generation takes 0 CPU (while for us CPU is the bottleneck for locust) and all CPU is provided for locust internals. Probably they also located the server on same Mac machine and made it as simple as possible, like just send back HTTP/1.1 200 OK Content-Length: 0

But maybe I'm wrong.

@aborg-dev
Copy link
Contributor Author

All fair observations - I want to confirm this by running Locust under profiler.

Here is a sample profile of send_tx function:

Timer unit: 1e-09 s

Total time: 2612.09 s
File: /home/akashin/repos/work/nearcore/pytest/tests/loadtest/locust/common/base.py
Function: send_tx at line 268

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   268                                               @line_profiler.profile
   269                                               def send_tx(self, tx: Transaction, locust_name: str) -> dict:
   270                                                   """
   271                                                   Send a transaction and return the result, no retry attempted.
   272                                                   """
   273      3115        4e+11    1e+08     14.2          block_hash = self.final_block_hash()
   274      2937 5597757423.0    2e+06      0.2          signed_tx = tx.sign_and_serialize(block_hash)
   275                                           
   276      2937   10644362.0   3624.2      0.0          meta = self.new_locust_metadata(locust_name)
   277      2937    2838567.0    966.5      0.0          start_perf_counter = time.perf_counter()
   278                                           
   279      2937     717732.0    244.4      0.0          try:
   280      2937     467654.0    159.2      0.0              try:
   281                                                           # To get proper errors on invalid transaction, we need to use sync api first
   282      4222        2e+12    5e+08     85.5                  result = self.post_json(
   283      2937     698936.0    238.0      0.0                      "broadcast_tx_commit",
   284      2937   19713730.0   6712.2      0.0                      [base64.b64encode(signed_tx).decode('utf8')])
   285      1462  162273411.0 110994.1      0.0                  evaluate_rpc_result(result.json())
   286         1        747.0    747.0      0.0              except TxUnknownError as err:
   287                                                           # This means we time out in one way or another.
   288                                                           # In that case, the stateless transaction validation was
   289                                                           # successful, we can now use async API without missing errors.
   290                                                           submit_raw_response = self.post_json(
   291                                                               "broadcast_tx_async",
   292                                                               [base64.b64encode(signed_tx).decode('utf8')])
   293                                                           meta["response_length"] = len(submit_raw_response.text)
   294                                                           submit_response = submit_raw_response.json()
   295                                                           # extract transaction ID from response, it should be "{ "result": "id...." }"
   296                                                           if not "result" in submit_response:
   297                                                               meta["exception"] = RpcError(message="Didn't get a TX ID",
   298                                                                                            details=submit_response)
   299                                                               meta["response"] = submit_response.content
   300                                                           else:
   301                                                               tx.transaction_id = submit_response["result"]
   302                                                               # using retrying lib here to poll until a response is ready
   303                                                               self.poll_tx_result(meta, tx)
   304         1        328.0    328.0      0.0          except NearError as err:
   305                                                       logging.warn(f"marking an error {err.message}, {err.details}")
   306                                                       meta["exception"] = err
   307                                           
   308      4386    4688350.0   1068.9      0.0          meta["response_time"] = (time.perf_counter() -
   309      2924    1080117.0    369.4      0.0                                   start_perf_counter) * 1000
   310                                           
   311                                                   # Track request + response in Locust
   312      1462   77682295.0  53134.3      0.0          self.request_event.fire(**meta)
   313      1462     583757.0    399.3      0.0          return meta

2612.09 seconds - /home/akashin/repos/work/nearcore/pytest/tests/loadtest/locust/common/base.py:268 - send_tx

The 99% of the wall-clock time is split between two lines:

  1. block_hash = self.final_block_hash()
  2. result = self.post_json("broadcast_tx_commit", ...)

Which are both RPC requests to the node.

The transaction signing takes 0.2% of wall-clock time which is non-zero but also not that significant.

I think the next step is to understand the threading model that Locust is using - if these transaction submissions are executed on a single thread in a blocking fashion that will surely limit the throughput. Instead, we should be using something like asyncio here.

@aborg-dev
Copy link
Contributor Author

To analyse this further, signing takes 2ms of wallclock time, so we can do around 500 of those per second in a Python process, which would be a bottleneck for a single worker setup. For 16 workers that should not be a problem though.

It would be nice to improve the performance of signing, but that probably is not the root cause of limited throughput that we see with multiple shards.

@aborg-dev
Copy link
Contributor Author

Here is another profile from a longer run with more users:

Timer unit: 1e-09 s

Total time: 6.04117e+06 s
File: /home/akashin/repos/work/nearcore/pytest/tests/loadtest/locust/common/base.py
Function: send_tx at line 268

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   268                                               @line_profiler.profile
   269                                               def send_tx(self, tx: Transaction, locust_name: str) -> dict:
   270                                                   """
   271                                                   Send a transaction and return the result, no retry attempted.
   272                                                   """
   273   2062925        5e+15    2e+09     80.7          block_hash = self.final_block_hash()
   274   1889490        2e+12    1e+06      0.0          signed_tx = tx.sign_and_serialize(block_hash)
   275                                           
   276   1889490 4870509915.0   2577.7      0.0          meta = self.new_locust_metadata(locust_name)
   277   1889490 1717009524.0    908.7      0.0          start_perf_counter = time.perf_counter()
   278                                           
   279   1889490  341764477.0    180.9      0.0          try:
   280   1889490  230492041.0    122.0      0.0              try:
   281                                                           # To get proper errors on invalid transaction, we need to use sync api first
   282   2528557        1e+15    5e+08     19.3                  result = self.post_json(
   283   1889490  337802006.0    178.8      0.0                      "broadcast_tx_commit",
   284   1889490 8958336492.0   4741.1      0.0                      [base64.b64encode(signed_tx).decode('utf8')])
   285    812412        8e+10 104359.0      0.0                  evaluate_rpc_result(result.json())
   286      2262    1064247.0    470.5      0.0              except TxUnknownError as err:
   287                                                           # This means we time out in one way or another.
   288                                                           # In that case, the stateless transaction validation was
   289                                                           # successful, we can now use async API without missing errors.
   290                                                           submit_raw_response = self.post_json(
   291                                                               "broadcast_tx_async",
   292                                                               [base64.b64encode(signed_tx).decode('utf8')])
   293                                                           meta["response_length"] = len(submit_raw_response.text)
   294                                                           submit_response = submit_raw_response.json()
   295                                                           # extract transaction ID from response, it should be "{ "result": "id...." }"
   296                                                           if not "result" in submit_response:
   297                                                               meta["exception"] = RpcError(message="Didn't get a TX ID",
   298                                                                                            details=submit_response)
   299                                                               meta["response"] = submit_response.content
   300                                                           else:
   301                                                               tx.transaction_id = submit_response["result"]
   302                                                               # using retrying lib here to poll until a response is ready
   303                                                               self.poll_tx_result(meta, tx)
   304      2262     490022.0    216.6      0.0          except NearError as err:
   305      2188  144508862.0  66046.1      0.0              logging.warn(f"marking an error {err.message}, {err.details}")
   306      2188     804307.0    367.6      0.0              meta["exception"] = err
   307                                           
   308   2437236 2564044058.0   1052.0      0.0          meta["response_time"] = (time.perf_counter() -
   309   1624824  519629076.0    319.8      0.0                                   start_perf_counter) * 1000
   310                                           
   311                                                   # Track request + response in Locust
   312    812412        4e+10  49686.0      0.0          self.request_event.fire(**meta)
   313    812412  303569684.0    373.7      0.0          return meta

6041174.94 seconds - /home/akashin/repos/work/nearcore/pytest/tests/loadtest/locust/common/base.py:268 - send_tx

Here final_block_hash is taking a massive 80% of the time and would be a first candidate to optimize.

Signing on average takes 1ms which is similar to a previous run.

The rest 20% is in broadcast_tx_commit, which on average takes 500ms.

I'll go ahead and try a few local optimizations to improve the single-thread performance.

@aborg-dev
Copy link
Contributor Author

For the reference, here is the result of running a sampling profiler py-spy in the same setup:
profile

@MCJOHN974
Copy link
Contributor

Andrei, thanks for your profiling! Can you please repeat this on this commit to check if (and how) my changes reduce time consumption by self.post_json

@aborg-dev
Copy link
Contributor Author

Andrei, thanks for your profiling! Can you please repeat this on this commit to check if (and how) my changes reduce time consumption by self.post_json

Yes, I plan to do this as a part of refactoring started in #11364. Concretely, I plan to add an API that allows to submit transaction without waiting for completion directly by the Locust user.
I'm still thinking how exactly the Locust user should behave to avoid overloading the system with these async transactions as we lose the backpressure mechanism that comes from waiting for transaction.

github-merge-queue bot pushed a commit that referenced this issue May 28, 2024
We migrate away from `broadcast_tx_commit` and `broadcast_tx_async`
which are now deprecated in favor of `send_tx`:
https://docs.near.org/api/rpc/transactions

I believe this PR should preserve the behavior that we had before and I
plan to do any behavior changes in the future PRs.

Tested by running:
```sh
locust -H 127.0.0.1:3030 \
  -f locustfiles/ft.py \
  --funding-key=$KEY --headless -u 2000 -r 50 -t 120s
```

This is a part of work on improving the performance of Locust load test
runner: #11352
github-merge-queue bot pushed a commit that referenced this issue May 28, 2024
We use it extensively in Locust benchmarks and it shows as a hotspot. A
single signature takes around 1-2ms on my machine. PyNacl is a
recommended replacement for python-ed25519 that is supposed to be 10-20
times faster for signatures.

This is a part of #11352

I've followed this guide:
https://github.com/warner/python-ed25519?tab=readme-ov-file#migrating-to-pynacl
@aborg-dev
Copy link
Contributor Author

We can now genereate 1000 TPS with a single Locust process at 0.8 CPU core usage, which should be good enough for our benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants