-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce txArriveTimeout to 100ms #627
Reduce txArriveTimeout to 100ms #627
Conversation
Outstanding research! Thank you! Apart of this question, I believe that it's quite safe to accept it. From our side it'd be great to have txpool metrics also to check how useful the change is. |
Codecov ReportBase: 56.74% // Head: 56.85% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## develop #627 +/- ##
===========================================
+ Coverage 56.74% 56.85% +0.11%
===========================================
Files 607 609 +2
Lines 70294 70915 +621
===========================================
+ Hits 39888 40320 +432
- Misses 26983 27152 +169
- Partials 3423 3443 +20
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
The fetcher is already pretty extensive instrumented, it should have enough metrics to capture the effects of this change: I do have numbers from with the change. They don't look that different (as I would expect), as I don't capture packets that are returned when we request a refetch, only from either hash or direct transaction broadcasts. You can see the tail latency on the upper end does reduce. This is mainly due to us broadcasting out transactions sooner, and thus other nodes not sending us these late broadcasts, so it doesn't accurately capture the latency drop. |
Hi @builder90210, thank you for writing such a detailed post with data and reasoning. I think it's a good idea to tune these parameters to a certain degree. However, I have a concern regarding the claim that "the burden from the p2p layer should not appreciably change" when we expect to rebroadcast a larger percentage of transactions. Without supporting data, it's difficult to know if the traffic load of serving fetch tx will increase significantly if all nodes within the network change this parameter at the same time. Would you be able to run an experiment in a cluster where all nodes have 100ms configured and collect some data to support this claim? Thank you! |
@cffls Here's the analysis that I based the initial claim on. I might have been a little unclear in the original post, I didn't mean to say that with the change we expect to rebroadcast much more. Within the system that already exists, the broadcasting loop outside of the TxFetcher dwarfs any increase we'd see from the fetcher changes. Let me know if you're comfortable with this. If not, what type of data would you like to see? I may already have some, as I have a number of the nodes on the network that are statically peered and have run with changes here. For each transaction a node sees, it will broadcast that transaction (either hash or full) to peers that it have not sent that transaction to us already. This means that for every transaction, for each of our peers there must be a message sent in one direction, whether it's us notifying the peer we have the transaction, or them notifying us they that have it. We could expect about 50% of these to be sends, and 50% to be receives on average, but the actual percent will vary node-by-node depending on how centrally they are located in the network. This is a conservative model, as it doesn't account for latency, which means some some transactions we'd expect both nodes to send a message to eachother. With the TxFetcher change, we expect to do a fetch for around 13% of transactions. However, the TxFetcher will only fetch from a single peer concurrently per transaction. With the timeouts set at current levels, the vast majority of the time that means we'll only ping back to a single peer, instead of every peer as with broadcasting. This behavior is covered in the fetcher unit tests, notably here: https://github.com/maticnetwork/bor/blob/develop/eth/fetcher/tx_fetcher_test.go#L641. We can model the network burden for request type as: For the standard rebroadcast, this calculates out to: Whereas for the TxFetcher, this calculates out to: so the load in the fetcher should be tiny compared to existing load from broadcasting. The above analysis does ignore the size of the messages (fetches the fetcher are larger than announcements due to them being full transactions rather than hashes). This shouldn't effect the conclusion much, as the rebroadcast loop sends out full transactions from 6% of peers, and the burdens are orders of magnitude apart. Even if the fetcher was activating for every transactions, the additional load would still only be a fraction of the existing broadcast protocol, due to the 1 concurrent request per transaction in the Txfetcher. You could repeat these analyses for send traffic and receipt traffic independently. The number should work out the same on average, since requests are symmetric in throughput analysis (a request into my node is a request sent by another node). I believe this analysis should hold for effects on the network as a whole, as I don't see a reason why we'd have compounding effects the more nodes that have the change. |
Thank you for your clear and thorough explanation, @builder90210. Your response effectively addressed my concern about the compounding effect. Really appreciate the time and effort you took to help me understand this concept. |
It's great to be able to contribute :). I have a lot of experience in this area of the code, so happy to help out in the future if you need anything. |
thank you @builder90210 ! I really appreciate your help. |
Description
I'm suggesting that the polygon team consider lowering the TxFetcher pingback threshold from 500ms to 100ms. This change would have a number of benefits for the network, including faster transaction propagation, improved communication between validators and sentry nodes, and reduced incentives for spamming transactions on the network. Lowering the threshold to 100ms would not significantly increase network load, and is already in use by many power users who care about receiving transactions quickly. This change would improve the overall performance and reliability of the polygon network.
In propagating transactions through the p2p network, a node will send the full transaction directly to a small number of their peers, but the transaction hash to the majority. The TxFetcher is responsible for handling the hash-only announcements on the receiving node. When a hash announcement is received, the TxFetcher will wait some period of time (currently the 500ms), and if it does not receive a direct broadcast in that time, will ping an announcing node to get the full transaction.
This 500ms threshold is a legacy setting from original geth fork. It was quite appropriate for a network with longer block times, and the ability to run on very low end hardware; however for the faster block time and higher infrastructure requirement of polygon, I believe a faster pingback would have a number of benefits.
This is one of the first and most common modifications made by power users who care about receiving transactions faster. I myself have run this setting much lower than 100ms, and noticed no performance issues on my nodes whatsoever. I know several others who have done the same.
Lowering this threshold will have a number of benefits to the network:
I log some portion of network traffic packets that are received by my node, and present some data below on the distribute of latency in these network packets. This 100ms was based on data I gathered in these logs.
Here is a histogram of the latency in milliseconds between first hash announcement vs. the first direct transaction received by my node. A negative number indicates the first direct announcement was received earlier than the first hash announcement. This is run on a 16 core machine, with 250 peers in a Hetzner datacenter, so should be fairly typical for most nodes.
Indeed when we look at percentile latency (below), we see that the current 500ms is on the 98 percentile, meaning that the TxFetcher is used on less than 2% of transactions. 100ms is around the 87 percentile, so would get activated on around 13%, and cut off the tail latency on the difference. This may seems like it should increase load. However when one considers that we expect to rebroadcast a much larger percent of transactions we receive out, we can see that the burden from the p2p layer should not appreciably change. The p2p burden is only a concern for full / sentry nodes, as validators have a very small number of peers due to sentry setups.
My node has a pretty good set of static peers, I would expect this advantage to be greater for nodes who don't prune their peerset.
Changes
Checklist
Cross repository changes
Testing