Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden drop in TPS after total 14k transactions #17447

Closed
drandreaskrueger opened this issue Aug 20, 2018 · 30 comments
Closed

Sudden drop in TPS after total 14k transactions #17447

drandreaskrueger opened this issue Aug 20, 2018 · 30 comments
Assignees

Comments

@drandreaskrueger
Copy link

Hello geth team, perhaps you have ideas where this sudden drop in TPS in quorum might come from:
Consensys/quorum#479 (comment)

System information

Geth version: quorum Geth/v1.7.2-stable-3f1817ea/linux-amd64/go1.10.1
OS & Version: Linux

Expected behaviour

constant TPS

Actual behaviour

For the first 14k transactions, we see averages of ~465 TPS, but then it suddenly goes down considerably, to an average of ~270 TPS.

Steps to reproduce the behaviour

See https://gitlab.com/electronDLT/chainhammer/blob/master/quorum-IBFT.md

@karalabe
Copy link
Member

We've made 14 releases to go-ethereum since then, also I've no idea what modifications quorum made to our code. I'd gladly debug Ethereum related issues, but I can't help debug a codebase that's a derivative of ours.

@drandreaskrueger
Copy link
Author

Thx.

Not asking for debugging, but asking for ideas what else to try out.

Not a single idea crossed your mind when you saw that?

@karalabe
Copy link
Member

karalabe commented Aug 20, 2018 via email

@drandreaskrueger
Copy link
Author

Would be nice to see how it behaves on top of the latest go-ethereum code.

Yes.

For that I am looking for something like the quorum 7nodes example or the blk.io crux-quorum-4nodes-docker-compose.

Perhaps this works: https://medium.com/@javahippie/building-a-local-ethereum-network-with-docker-and-geth-5b9326b85f37 any better suggestion?

@drandreaskrueger
Copy link
Author

@karalabe please reopen

I am seeing the exact same problem in geth v1.8.13-stable-225171a4

will document it now

@karalabe karalabe reopened this Aug 20, 2018
@drandreaskrueger
Copy link
Author

drandreaskrueger commented Aug 20, 2018

@drandreaskrueger
Copy link
Author

oh, and I asked - yes, Quorum is going to catch up soon.

@ghost
Copy link

ghost commented Aug 23, 2018

@drandreaskrueger It wouldn't drop if you send transactions via mining node ( aka signer )

Here is the live testnet stats https://test-stats.eos-classic.io from eosclassic POA Testnet https://github.com/eosclassic/eosclassic/commit/7d76e6c86f5b0385755e492240eea4b3fba8d60d

with 1s block time and 100 mil gas limit

explorer: https://test-explorer.eos-classic.io/home

Current tps around 200 tps

@drandreaskrueger
Copy link
Author

drandreaskrueger commented Aug 23, 2018

if you send transactions via mining node ( aka signer )

I am.

In geth submitting them to :8545 in this network https://github.com/javahippie/geth-dev
In quorum submitting them to :22001 in this network https://github.com/drandreaskrueger/crux/commits/master

See https://gitlab.com/electronDLT/chainhammer/blob/master/geth.md and https://gitlab.com/electronDLT/chainhammer/blob/master/quorum-IBFT.md#crux-docker-4nodes

So ... the question is still open: Why the drop?

@drandreaskrueger
Copy link
Author

drandreaskrueger commented Aug 23, 2018

eosclassic POA Testnet

Interesting.

Does it make sense to also benchmark that one?
It is a public testnet, right? How can I get testnetcoins? How to add my node?

Is it a unique PoA algorithm, or one of those which I have already benchmarked ?

Sorry for the perhaps naive question, I will hopefully be able to look deeper when I am back in the office on Monday.

eosclassic POA Testnet

Is there perhaps also a dockerized/vagrant-virtualbox version of such a network with a private chain - like the 7nodes example of quorum, or the above mentioned crux 4nodes (which both run complete networks and all nodes on my one one machine). If yes, then I would rather benchmark that. (My tobalaba.md benchmarking has shown that public PoA chains can lead to very unstable results.)

@ghost
Copy link

ghost commented Aug 24, 2018

@drandreaskrueger It is a public testnet with clique engine ( and some modifications for 0 fee transactions ).

There is a testnet faucet running so you can download the node here https://github.com/eosclassic/eosclassic/releases/tag/v2.1.1 and earn some testnet coins here https://faucet.eos-classic.io/

Right now there is no dockerized/vagrant-virtualbox version for our testnet but we are planning to improve them with AWS / Azure template support soon.

Our purpose of benchmark is to improve existing consensus algorithm for public use and implement modified version of DPOS under ethereum environment, integrating DPOS will be done after specification of DPOS.

So feel free to use our environment, I will list our testnet/development resources here.

EOS Classic POA Testnet Resources

Explorer: https://test-explorer.eos-classic.io/

Faucet: https://faucet.eos-classic.io/

Wallet: https://wallet.eos-classic.io/

RPC-Node: https://test-node.eos-classic.io/

Network stats: https://test-stats.eos-classic.io

@drandreaskrueger
Copy link
Author

drandreaskrueger commented Aug 28, 2018

EOS - see

but please keep this thread to answering my geth question!

@drandreaskrueger
Copy link
Author

drandreaskrueger commented Aug 28, 2018

.

@drandreaskrueger
Copy link
Author

the question is still open: Why the drop?

Might have time for it next week, please share any idea, thanks.

@drandreaskrueger
Copy link
Author

any new ideas about that?

You can now super-easily reproduce my results, in less than 10 minutes, with my Amazon AMI image:

https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md#readymade-amazon-ami

@ijxy
Copy link

ijxy commented Sep 21, 2018

@drandreaskrueger perhaps a silly question, but one that hasn't been asked: have you eliminated the possibility that this is artefact of using a docker/VM-based network? Not directly related, but I have in the past had problems when developing on fabric due to VM memory limitations after a buildup of transactions. It probably isn't the problem, but it's something to check, which is what you originally asked for.

@drandreaskrueger
Copy link
Author

drandreaskrueger commented Sep 21, 2018

@jdparkinson93 no, I haven't. Not a silly question. It is good to eliminate all possible factors.

Please send the exact commandline/s for creating a self-sufficient node/network without docker.

Then the benchmarking can be run on that, to see whether this replicates or not.

Thanks.

@drandreaskrueger
Copy link
Author

drandreaskrueger commented Sep 21, 2018

And do reproduce my results, in less than 10 minutes, with my Amazon AMI image:
https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md#readymade-amazon-ami

or same file scroll up ... for instructions how to do it on any Linux system.

Then you can do experiments yourself.

@ijxy
Copy link

ijxy commented Sep 25, 2018

@jdparkinson93 no, I haven't. Not a silly question. It is good to eliminate all possible factors.

Please send the exact commandline/s for creating a self-sufficient node/network without docker.

Then the benchmarking can be run on that, to see whether this replicates or not.

Thanks.

@drandreaskrueger

I attempted to do this in the laziest possible way by leveraging ganache, but submitting that many transactions at once caused it a problem. However, drip-feeding them in with batches of 100 I was able to reach and exceed 14k txs.

Time (as reported by time unix command) to run the first few batches of 100 txs:

15.02 real         3.55 user         0.28 sys
15.19 real         3.61 user         0.28 sys
15.51 real         3.69 user         0.29 sys
15.65 real         3.57 user         0.27 sys
15.96 real         3.65 user         0.28 sys
16.32 real         3.76 user         0.33 sys
16.29 real         3.53 user         0.27 sys
16.62 real         3.56 user         0.27 sys
16.15 real         3.70 user         0.28 sys
15.99 real         3.53 user         0.27 sys
16.22 real         3.51 user         0.27 sys

After 14000 txs:

21.64 real         3.63 user         0.29 sys
21.19 real         3.62 user         0.28 sys
20.97 real         3.65 user         0.27 sys
20.70 real         3.62 user         0.29 sys
23.41 real         3.67 user         0.29 sys
22.51 real         3.81 user         0.30 sys
21.03 real         3.64 user         0.27 sys
21.76 real         3.77 user         0.30 sys
21.38 real         3.76 user         0.28 sys
21.69 real         3.75 user         0.30 sys       
21.67 real         3.65 user         0.28 sys
20.89 real         3.75 user         0.30 sys
21.12 real         3.68 user         0.29 sys

So as we can see, the time does increase somewhat... but I'd argue that's expected in this case because the transactions I sent in were non-trivial and the execution time per transaction is expected to grow as there are more transactions (since the use-case involves matching related things, which in turn involves some iteration).

Apologies that I didn't use simple Ether/ERC20 transfers for the test, as in hindsight that would have been a better comparison. Nevertheless, I'm not convinced I am seeing a massive (unexpected) slowdown of the magnitude your graphs suggest.

@drandreaskrueger
Copy link
Author

Thanks a lot, @jdparkinson93 - great that you try to reproduce that dropoff with different means.

Please also do reproduce my results, in less than 10 minutes, with my Amazon AMI image:
https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md#readymade-amazon-ami

Or with a tiny bit more time ... on any linux system
https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md

Thanks.

@drandreaskrueger
Copy link
Author

Today, I had done a quorum run on a really fast AWS machine:
https://gitlab.com/electronDLT/chainhammer/blob/master/reproduce.md#quorum-crux-ibft

And got these results:
https://gitlab.com/electronDLT/chainhammer/blob/master/quorum-IBFT.md#amazon-aws-results

with a mild drop of TPS in the last third.

@ijxy
Copy link

ijxy commented Sep 27, 2018

@drandreaskrueger There isn't really any drop in your latest test - if anything, it appeared to incrementally speed up and then reach a ceiling. What are your thoughts regarding these new results?

@drandreaskrueger
Copy link
Author

After 14000 txs ... So as we can see, the time does increase somewhat

Yes, interesting. Thanks a lot, @jdparkinson93 !

You see a (21-15) / 15 = 40% increase in "real time".
I see a (350 - 220) / 350 ~40% drop in TPS.

Could be the same signal, no?

Could you plot your first column over the whole range of 20k transactions?
I would like to see whether you also see a rather sudden drop, like in most of my earlier experiments.

didn't use simple Ether/ERC20 transfers for the test

I am always overwriting one uint in a smart contract. Easiest to understand here; but then later optimized here.

@drandreaskrueger
Copy link
Author

drandreaskrueger commented Oct 21, 2018

I have new information for you guys.

I ran the same thing again, BUT this time with 50k transactions (not only 20k).

Have a look at this:

geth-clique-50kTx_t2xlarge_tps-bt-bs-gas_blks12-98.png
= https://github.com/drandreaskrueger/chainhammer/blob/eb97301c6f054cbe6e6278c91a138d10fa04a779/reader/img/geth-clique-50kTx_t2xlarge_tps-bt-bs-gas_blks12-98.png

Isn't that odd?

I would not have expected that.

TPS is dropping considerably after ~14k transactions - but then it actually recovers. Rinse repeat.

More infos, log of whole experiment, etc:

https://github.com/drandreaskrueger/chainhammer/blob/master/geth.md#run-2

EDIT: sorry guys, after massive reorg, some paths had changed. Repaired that now, image is visible again.

@drandreaskrueger
Copy link
Author

sorry guys, after massive reorg, some paths had changed. Repaired that now, image is visible again.

@fjl
Copy link
Contributor

fjl commented Feb 19, 2019

Thank you for putting so much work into debugging this issue.

@drandreaskrueger
Copy link
Author

Thank you for putting so much work into debugging this issue.

Happy to help. I don't speak go yet, and I am not debugging it myself ... but I provide you with a tool that makes debugging this problem ... much easier.

chainhammer - I have just published a brand new version v55:
https://github.com/drandreaskrueger/chainhammer/#quickstart

Instead of installing everything to your main work computer, better use (a virtualbox Debian/Ubuntu installation or) my Amazon AMI to spin up a t2.medium machine, see docs/cloud.md#readymade-amazon-ami.

Then all you need is this line:

CH_TXS=50000 CH_THREADING="threaded2 20" ./run.sh "YourNaming-Geth" geth-clique

and afterwards check results/runs/ to find an autogenerated results page, with time series diagrams.

Hope that helps! Keep me posted please.

@MariusVanDerWijden
Copy link
Member

@drandreaskrueger We recently did some changes in the txpool s.th. local transactions are not stalled for that long, and some improvements to the memory mgmt. Previously we would need to reheap the txpool whenever a new tx was send, this created a lot of pressure on the garbage collector. I don't know how to use AWS and have no account there. Could maybe rerun your chainhammer once again with the newest release? Thank you very much for all your work!

@drandreaskrueger
Copy link
Author

drandreaskrueger commented Sep 24, 2020

Thank you very much for your message, @MariusVanDerWijden. Happy to be of use.

Right now, it's not my (paid) focus, honestly - but ... I intended to rerun 'everything chainhammer' again with the -then newest- versions of all available Ethereum clients too, at some time in the future anyways. I could also imagine to then publish all the -updated- results in some paper, so ... this is good to know. Great. Thanks for getting back to me. Much appreciated.

Cool that you made some changes here, and that perhaps this issue might have something to do with those symptoms, and you improved the client. Great.

Oh, and ... AWS is really not important here. If you want to, try yourself now - I recommend to first watch the (boring) 15 minutes video, also these step by step instructions had been really helpful for other users.
It shouldn't be difficult for you to get new geth results, using chainhammer. It could also be a useful tool for you, to identify and fix further bottlenecks, especially because it is so simplistic.

@holiman
Copy link
Contributor

holiman commented Feb 21, 2023

This ticket/investigation is getting quite old, and I suspect maybe not relevant with the current code. I'm going to close this for now.

@holiman holiman closed this as completed Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants