Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make txArriveTimeout Configurable w/ CLI Flag #734

Merged
merged 14 commits into from
May 2, 2023

Conversation

thogard785
Copy link
Contributor

@thogard785 thogard785 commented Feb 9, 2023

Description

Following up on:

  1. PR 627 - txArriveTimeout change
  2. PR 707 - txArriveTimeout change reverted
  • Renamed txArriveTimeout to txArrivalWait to more accurately describe the purpose of the argument - this is a variable that controls how long to wait before requesting a transaction, there's no action/request/response that has been 'timed out'.
  • Added a CLI flag txarrivalwait to allow node operators to set their own maximum wait period before explicitly requesting an announced transaction.
  • Set the default txarrivalwait period to 500ms (EDIT: updated from 100ms per comment from jekamas). While the motives of the author of PR 627 are unclear, his analysis and data collection are very thorough. Extremely thorough for somebody's first PR...

I suspect that the optimal txarrivalwait will vary per node - some nodes may prioritize a decreased load or may be in higher-latency locations while others may want a very rapid response. By allowing the setting to be configurable, we give node operators the chance to choose a setting that best fits their need.

Note that this also will allow validators the ability to customize their interactions with the FastLane MEV system, as referenced in PR 707.

Changes

  • New feature (non-breaking change that adds functionality)

Nodes audience

This

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply

Testing

  • I have updated the tx_fetcher unit test
  • tested on live w/ various flag values
  • tested on live w/o flag (default value)

Additional comments

  • I do not have high conviction either for or against the default value of 100ms and am open to changing it. The rationale for the selection of 100ms as default is solely due to PR 627 being accepted. EDIT: changed to 500ms per comments from JekaMas.

  • EDIT: Added a minimum txArrivalWait value Added a check that sets the f.txArrivalWait value to the txGatherSlack value if f.txArrivalWait < txGatherSlack.

The rationale is due to the context of the usage of txGatherSlack by the fetcher:
if time.Duration(f.clock.Now()-instance)+txGatherSlack > f.txArrivalWait {...
(found on lines 451 and 488 )

if f.txArrivalWait is less than txGatherSlack, it is impossible for the above conditional to return false, rendering the value of f.txArrivalWait meaningless and potentially leading to node operators misunderstanding the impact of the txArrivalWait value they set.

(The exception to the above is if the node operator has modified the hardcoded gatherSlack variable (default: 100ms) to be less than the txGatherSlack variable (default: 20ms). Comparing f.txArrivalWait against these two values is txArrivalWait's only hot path usage in bor.

d56370b

@thogard785 thogard785 marked this pull request as draft February 9, 2023 02:15
@thogard785 thogard785 marked this pull request as ready for review February 9, 2023 02:18
@JekaMas JekaMas added the MEV label Feb 9, 2023
docs/cli/server.md Outdated Show resolved Hide resolved
p2p/server.go Outdated Show resolved Hide resolved
Copy link
Contributor

@JekaMas JekaMas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with few comments.

… to indicate the new default value. updated tx_fetcher_test to test using the new default value
…flags to duration type. Tested on live both w/o flag set (default) and w/ flag set
docs/cli/server.md Outdated Show resolved Hide resolved
@thogard785
Copy link
Contributor Author

thogard785 commented Feb 9, 2023

Added a check that sets the f.txArrivalWait value to the txGatherSlack value if f.txArrivalWait < txGatherSlack.

The rationale is due to the context of the usage of txGatherSlack by the fetcher:
if time.Duration(f.clock.Now()-instance)+txGatherSlack > f.txArrivalWait {...
(found on lines 451 and 488 )

if f.txArrivalWait is less than txGatherSlack, it is impossible for the above conditional to return false, rendering the value of f.txArrivalWait meaningless and potentially leading to node operators misunderstanding the impact of the txArrivalWait value they set.

(The exception to the above is if the node operator has modified the hardcoded gatherSlack variable (default: 100ms) to be less than the txGatherSlack variable (default: 20ms). Comparing txArrivalWait against these two values is txArrivalWait's only hot path usage in bor.

d56370b

@thogard785 thogard785 requested review from JekaMas and removed request for temaniarpit27 and 0xsharma February 11, 2023 00:18
@thogard785
Copy link
Contributor Author

@JekaMas Wanted to make sure you saw the change to add in a min value for the f.txArrivalWait - I think that commit went through right as you hit approve and so I wanted to make sure nothing slipped through the cracks.

@0xsharma @temaniarpit27

cmd/utils/flags.go Outdated Show resolved Hide resolved
docs/cli/server.md Outdated Show resolved Hide resolved
@thogard785
Copy link
Contributor Author

@JekaMas Hey - I believe I erroneously removed @temaniarpit27 and @0xsharma from the "review" position - could you re-add them?

@pratikspatil024 pratikspatil024 requested a review from a team February 15, 2023 03:55
Copy link
Member

@pratikspatil024 pratikspatil024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, update all the TOML configuration files for consistency.

internal/cli/server/config.go Outdated Show resolved Hide resolved
…ArrivalWaitRaw string for toml and hcl value inputs
@JekaMas
Copy link
Contributor

JekaMas commented Feb 20, 2023

How about a conflict of interest disclosure from Fastlane labs before you go accusing others of having ulterior motives?

Who are the investors in Fastlane, and are any of them related to Polygon?

Thank you for your interest @ajb ! Polygon dev team welcomes all opinions and proposals.
If you believe that the network will benefit from setting the timeout default value to some particular value, we are more than just interested in that kind of research.
By the time we decided to make a few constants as a params to simplify mev related research and put on hold the rest of the tasks related to mev. We are creating a mev study core dev group with focus on having as less spam as possible. If you'd like to participate and can help with this research I'd ve happy to invite you to the group.

@JekaMas
Copy link
Contributor

JekaMas commented Feb 20, 2023

@pratikspatil024 @temaniarpit27 @0xsharma I am ok with that change, it does nothing with the network, although will simply the research. Let's merge it, if you are ok too.

@ajb
Copy link

ajb commented Feb 20, 2023

How about a conflict of interest disclosure from Fastlane labs before you go accusing others of having ulterior motives?
Who are the investors in Fastlane, and are any of them related to Polygon?

Thank you for your interest @ajb ! Polygon dev team welcomes all opinions and proposals. If you believe that the network will benefit from setting the timeout default value to some particular value, we are more than just interested in that kind of research. By the time we decided to make a few constants as a params to simplify mev related research and put on hold the rest of the tasks related to mev. We are creating a mev study core dev group with focus on having as less spam as possible. If you'd like to participate and can help with this research I'd ve happy to invite you to the group.

Sure, I will participate. Please let me know.

@JekaMas
Copy link
Contributor

JekaMas commented Feb 20, 2023

Sure, I will participate. Please let me know.

@ajb great! By far could you give as the way to reproduce your experiments and data? Like what steps, setup, etc.

@ajb
Copy link

ajb commented Feb 20, 2023

@ajb great! By far could you give as the way to reproduce your experiments and data? Like what steps, setup, etc.

What do you mean? PR #627 is not mine.

I would be happy to do some experiments regarding the PR I authored (#292) but I'm unsure how to really advocate for that change, since with these types of changes, you would need to make the change on the network first, then wait for bots to change their behavior accordingly.

@asbansal
Copy link

A slightly different take - Ethereum had spam issues too although not as much as the low fee chains. Introduction of mev-geth eliminated most of that. mev-bor as an alternative to bor already exists and it can be updated to feature full block templates like mev-boost. Allowing client diversity may be a simpler solution to the spam issue.

@JekaMas we would be interested in joining the group as well!

@github-actions
Copy link

This PR is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@temaniarpit27
Copy link
Contributor

@thogard785 we would like to go ahead with your PR. Can you please pull in the latest changes and resolve conflicts?

@CitizenSearcher
Copy link

CitizenSearcher commented Apr 4, 2023

I had a concern regarding this PR that I think could be solved with a minor addition.

There's a danger of misconfiguration of this parameter by a validator. Validator nodes rely on the TxFetcher in order to receive all transactions, since the small number of peers means most transaction will not be directly propagated. If a validator were to inadvertently set this value too high, it could result in a validator being unable to process most, or all of the transactions during their sprint. In the case of a normal validator, a small minority of transaction might still make it through with direct propagation, and with PFL validators only those transactions that win an PFL auction might make it through. This would result in a pretty bad degradation in user experience, and would be difficult to debug by either the validator or polygon team. It's hard to predict what kind of effects on the network there long periods of no transaction processing would have on the network, so it's certainly possible this would have network level effects on other participants.

Could I suggest putting a cap on this value enforced inside bor? I think the old value of .5 seconds would make a lot of sense. I can understand why a node operator might want to lower the value, but it seems unlikely that a they would want to place set this value higher than the old default. If there's a concern over resource load, I'd suggest other settings to be considered, as this parameter would have almost no effect on resources given how little the TxFetcher is used by nodes with any reasonable number of peers.

In the PFL whitepaper, the number one design objective is that the system "Incurs no transaction delay beyond the limit of what is already possible.". The suggested value would certainly help to achieve this. There might be some perverse incentive for PFL / validators to want to delay transactions with this parameter in order to extract more MEV; this kind of thing being supported opens up a lot of other questions / issues that I would need to be clearfully considered.

@thogard785
Copy link
Contributor Author

thogard785 commented Apr 5, 2023

@CitizenSearcher that's good feedback, and you're right about our objective of not delaying regular users past the "default" of 500ms. I'm 100% ok with a 500ms ceiling.

@temaniarpit27 thoughts on adding a cap?

@temaniarpit27
Copy link
Contributor

@CitizenSearcher thanks a lot for the feedback
@thogard785 I think we should add a upper cap on this with 500ms as the val

…longer durations than maxTxArrivalWait will default to the maxTxArrivalWait duration
@thogard785
Copy link
Contributor Author

@CitizenSearcher thanks a lot for the feedback
@thogard785 I think we should add a upper cap on this with 500ms as the val
added:
a5b49c3

@thogard785
Copy link
Contributor Author

@thogard785 we would like to go ahead with your PR. Can you please pull in the latest changes and resolve conflicts?

@temaniarpit27 I updated the branch to resolve conflicts:
thogard785@1e926cf

@temaniarpit27
Copy link
Contributor

@thogard785 can you please fix the failing CI?

@thogard785
Copy link
Contributor Author

@thogard785 can you please fix the failing CI?

Hey @temaniarpit27 - sorry for the delay on this. I was sidelined for health reasons, but I'm back to 100% now.

These changes should fix the cuddling issues the linter was having w/ the codebase.

7d765e0
and
63f79e3

The core/tx_list.go change seemed to be totally unrelated to what i was working on. It looks like that cuddle issue was fixed separately in the develop branch.. so i overwrote my own fix w/ the one from develop branch when i updated this branch w/ the changes to develop.

e62cf51

@thogard785
Copy link
Contributor Author

Looks like merging w/ dev again added some new checks to fail - will fix on Monday.

@temaniarpit27
Copy link
Contributor

Looks like merging w/ dev again added some new checks to fail - will fix on Monday.

Ya will need to fix testcases as well

@codecov
Copy link

codecov bot commented May 1, 2023

Codecov Report

Patch coverage: 85.71% and project coverage change: -0.04 ⚠️

Comparison is base (3607151) 56.64% compared to head (62d8cc0) 56.60%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #734      +/-   ##
===========================================
- Coverage    56.64%   56.60%   -0.04%     
===========================================
  Files          611      611              
  Lines        72238    72447     +209     
===========================================
+ Hits         40919    41012      +93     
- Misses       27823    27926     +103     
- Partials      3496     3509      +13     
Impacted Files Coverage Δ
eth/backend.go 0.00% <0.00%> (ø)
internal/cli/dumpconfig.go 0.00% <0.00%> (ø)
node/defaults.go 24.13% <ø> (ø)
p2p/server.go 65.48% <ø> (ø)
eth/fetcher/tx_fetcher.go 88.35% <83.33%> (-0.96%) ⬇️
eth/handler.go 54.31% <100.00%> (ø)
internal/cli/server/config.go 69.44% <100.00%> (+0.26%) ⬆️
internal/cli/server/flags.go 100.00% <100.00%> (ø)

... and 31 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@thogard785
Copy link
Contributor Author

62d8cc0

updated the fuzzer.

@temaniarpit27 codecov/project is still failing but all the other tests are passing. I'm unsure how to resolve the codecov/project fail - any advice?

@temaniarpit27 temaniarpit27 merged commit 5fba098 into maticnetwork:develop May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants