Make txArriveTimeout Configurable w/ CLI Flag #734

thogard785 · 2023-02-09T01:00:42Z

Description

Following up on:

Renamed txArriveTimeout to txArrivalWait to more accurately describe the purpose of the argument - this is a variable that controls how long to wait before requesting a transaction, there's no action/request/response that has been 'timed out'.
Added a CLI flag txarrivalwait to allow node operators to set their own maximum wait period before explicitly requesting an announced transaction.
Set the default txarrivalwait period to 500ms (EDIT: updated from 100ms per comment from jekamas). While the motives of the author of PR 627 are unclear, his analysis and data collection are very thorough. Extremely thorough for somebody's first PR...

I suspect that the optimal txarrivalwait will vary per node - some nodes may prioritize a decreased load or may be in higher-latency locations while others may want a very rapid response. By allowing the setting to be configurable, we give node operators the chance to choose a setting that best fits their need.

Note that this also will allow validators the ability to customize their interactions with the FastLane MEV system, as referenced in PR 707.

Changes

New feature (non-breaking change that adds functionality)

Nodes audience

This

Checklist

I have added at least 2 reviewer or the whole pos-v1 team
I have added sufficient documentation in code
I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply

Testing

I have updated the tx_fetcher unit test
tested on live w/ various flag values
tested on live w/o flag (default value)

Additional comments

I do not have high conviction either for or against the default value of 100ms and am open to changing it. The rationale for the selection of 100ms as default is solely due to PR 627 being accepted. EDIT: changed to 500ms per comments from JekaMas.
EDIT: Added a minimum txArrivalWait value Added a check that sets the f.txArrivalWait value to the txGatherSlack value if f.txArrivalWait < txGatherSlack.

The rationale is due to the context of the usage of txGatherSlack by the fetcher:
if time.Duration(f.clock.Now()-instance)+txGatherSlack > f.txArrivalWait {...
(found on lines 451 and 488 )

if f.txArrivalWait is less than txGatherSlack, it is impossible for the above conditional to return false, rendering the value of f.txArrivalWait meaningless and potentially leading to node operators misunderstanding the impact of the txArrivalWait value they set.

(The exception to the above is if the node operator has modified the hardcoded gatherSlack variable (default: 100ms) to be less than the txGatherSlack variable (default: 20ms). Comparing f.txArrivalWait against these two values is txArrivalWait's only hot path usage in bor.

d56370b

…c to handler to fetcher. TODO verify default, expose var to api

…default) settings

docs/cli/server.md

internal/cli/server/config.go

p2p/server.go

JekaMas

LGTM with few comments.

… to indicate the new default value. updated tx_fetcher_test to test using the new default value

…flags to duration type. Tested on live both w/o flag set (default) and w/ flag set

docs/cli/server.md

internal/cli/server/config.go

…routine process

… if txArrivalWait < txGatherSlack

thogard785 · 2023-02-09T11:32:19Z

Added a check that sets the f.txArrivalWait value to the txGatherSlack value if f.txArrivalWait < txGatherSlack.

The rationale is due to the context of the usage of txGatherSlack by the fetcher:
if time.Duration(f.clock.Now()-instance)+txGatherSlack > f.txArrivalWait {...
(found on lines 451 and 488 )

if f.txArrivalWait is less than txGatherSlack, it is impossible for the above conditional to return false, rendering the value of f.txArrivalWait meaningless and potentially leading to node operators misunderstanding the impact of the txArrivalWait value they set.

(The exception to the above is if the node operator has modified the hardcoded gatherSlack variable (default: 100ms) to be less than the txGatherSlack variable (default: 20ms). Comparing txArrivalWait against these two values is txArrivalWait's only hot path usage in bor.

d56370b

thogard785 · 2023-02-11T00:22:35Z

@JekaMas Wanted to make sure you saw the change to add in a min value for the f.txArrivalWait - I think that commit went through right as you hit approve and so I wanted to make sure nothing slipped through the cracks.

@0xsharma @temaniarpit27

cmd/utils/flags.go

docs/cli/server.md

thogard785 · 2023-02-15T03:06:24Z

@JekaMas Hey - I believe I erroneously removed @temaniarpit27 and @0xsharma from the "review" position - could you re-add them?

pratikspatil024

Also, update all the TOML configuration files for consistency.

internal/cli/server/config.go

…ArrivalWaitRaw string for toml and hcl value inputs

JekaMas · 2023-02-20T03:01:19Z

How about a conflict of interest disclosure from Fastlane labs before you go accusing others of having ulterior motives?

Who are the investors in Fastlane, and are any of them related to Polygon?

Thank you for your interest @ajb ! Polygon dev team welcomes all opinions and proposals.
If you believe that the network will benefit from setting the timeout default value to some particular value, we are more than just interested in that kind of research.
By the time we decided to make a few constants as a params to simplify mev related research and put on hold the rest of the tasks related to mev. We are creating a mev study core dev group with focus on having as less spam as possible. If you'd like to participate and can help with this research I'd ve happy to invite you to the group.

JekaMas · 2023-02-20T03:24:32Z

@pratikspatil024 @temaniarpit27 @0xsharma I am ok with that change, it does nothing with the network, although will simply the research. Let's merge it, if you are ok too.

ajb · 2023-02-20T04:14:34Z

How about a conflict of interest disclosure from Fastlane labs before you go accusing others of having ulterior motives?
Who are the investors in Fastlane, and are any of them related to Polygon?

Thank you for your interest @ajb ! Polygon dev team welcomes all opinions and proposals. If you believe that the network will benefit from setting the timeout default value to some particular value, we are more than just interested in that kind of research. By the time we decided to make a few constants as a params to simplify mev related research and put on hold the rest of the tasks related to mev. We are creating a mev study core dev group with focus on having as less spam as possible. If you'd like to participate and can help with this research I'd ve happy to invite you to the group.

Sure, I will participate. Please let me know.

JekaMas · 2023-02-20T04:54:46Z

Sure, I will participate. Please let me know.

@ajb great! By far could you give as the way to reproduce your experiments and data? Like what steps, setup, etc.

ajb · 2023-02-20T15:52:53Z

@ajb great! By far could you give as the way to reproduce your experiments and data? Like what steps, setup, etc.

What do you mean? PR #627 is not mine.

I would be happy to do some experiments regarding the PR I authored (#292) but I'm unsure how to really advocate for that change, since with these types of changes, you would need to make the change on the network first, then wait for bots to change their behavior accordingly.

asbansal · 2023-02-28T14:43:19Z

A slightly different take - Ethereum had spam issues too although not as much as the low fee chains. Introduction of mev-geth eliminated most of that. mev-bor as an alternative to bor already exists and it can be updated to feature full block templates like mev-boost. Allowing client diversity may be a simpler solution to the spam issue.

@JekaMas we would be interested in joining the group as well!

github-actions · 2023-03-22T00:15:01Z

This PR is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 14 days.

temaniarpit27 · 2023-04-03T11:06:09Z

@thogard785 we would like to go ahead with your PR. Can you please pull in the latest changes and resolve conflicts?

CitizenSearcher · 2023-04-04T09:21:26Z

I had a concern regarding this PR that I think could be solved with a minor addition.

There's a danger of misconfiguration of this parameter by a validator. Validator nodes rely on the TxFetcher in order to receive all transactions, since the small number of peers means most transaction will not be directly propagated. If a validator were to inadvertently set this value too high, it could result in a validator being unable to process most, or all of the transactions during their sprint. In the case of a normal validator, a small minority of transaction might still make it through with direct propagation, and with PFL validators only those transactions that win an PFL auction might make it through. This would result in a pretty bad degradation in user experience, and would be difficult to debug by either the validator or polygon team. It's hard to predict what kind of effects on the network there long periods of no transaction processing would have on the network, so it's certainly possible this would have network level effects on other participants.

Could I suggest putting a cap on this value enforced inside bor? I think the old value of .5 seconds would make a lot of sense. I can understand why a node operator might want to lower the value, but it seems unlikely that a they would want to place set this value higher than the old default. If there's a concern over resource load, I'd suggest other settings to be considered, as this parameter would have almost no effect on resources given how little the TxFetcher is used by nodes with any reasonable number of peers.

In the PFL whitepaper, the number one design objective is that the system "Incurs no transaction delay beyond the limit of what is already possible.". The suggested value would certainly help to achieve this. There might be some perverse incentive for PFL / validators to want to delay transactions with this parameter in order to extract more MEV; this kind of thing being supported opens up a lot of other questions / issues that I would need to be clearfully considered.

thogard785 · 2023-04-05T06:03:06Z

@CitizenSearcher that's good feedback, and you're right about our objective of not delaying regular users past the "default" of 500ms. I'm 100% ok with a 500ms ceiling.

@temaniarpit27 thoughts on adding a cap?

temaniarpit27 · 2023-04-05T06:04:20Z

@CitizenSearcher thanks a lot for the feedback
@thogard785 I think we should add a upper cap on this with 500ms as the val

…longer durations than maxTxArrivalWait will default to the maxTxArrivalWait duration

thogard785 · 2023-04-05T06:20:52Z

@CitizenSearcher thanks a lot for the feedback
@thogard785 I think we should add a upper cap on this with 500ms as the val
added:
a5b49c3

thogard785 · 2023-04-05T07:00:21Z

@thogard785 we would like to go ahead with your PR. Can you please pull in the latest changes and resolve conflicts?

@temaniarpit27 I updated the branch to resolve conflicts:
thogard785@1e926cf

temaniarpit27 · 2023-04-13T07:34:54Z

@thogard785 can you please fix the failing CI?

thogard785 · 2023-04-28T23:52:32Z

@thogard785 can you please fix the failing CI?

Hey @temaniarpit27 - sorry for the delay on this. I was sidelined for health reasons, but I'm back to 100% now.

These changes should fix the cuddling issues the linter was having w/ the codebase.

7d765e0
and
63f79e3

The core/tx_list.go change seemed to be totally unrelated to what i was working on. It looks like that cuddle issue was fixed separately in the develop branch.. so i overwrote my own fix w/ the one from develop branch when i updated this branch w/ the changes to develop.

e62cf51

thogard785 · 2023-04-30T16:11:24Z

Looks like merging w/ dev again added some new checks to fail - will fix on Monday.

temaniarpit27 · 2023-05-01T11:44:01Z

Looks like merging w/ dev again added some new checks to fail - will fix on Monday.

Ya will need to fix testcases as well

codecov · 2023-05-01T18:58:38Z

Codecov Report

Patch coverage: 85.71% and project coverage change: -0.04 ⚠️

Comparison is base (3607151) 56.64% compared to head (62d8cc0) 56.60%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #734      +/-   ##
===========================================
- Coverage    56.64%   56.60%   -0.04%     
===========================================
  Files          611      611              
  Lines        72238    72447     +209     
===========================================
+ Hits         40919    41012      +93     
- Misses       27823    27926     +103     
- Partials      3496     3509      +13

Impacted Files	Coverage Δ
eth/backend.go	`0.00% <0.00%> (ø)`
internal/cli/dumpconfig.go	`0.00% <0.00%> (ø)`
node/defaults.go	`24.13% <ø> (ø)`
p2p/server.go	`65.48% <ø> (ø)`
eth/fetcher/tx_fetcher.go	`88.35% <83.33%> (-0.96%)`	⬇️
eth/handler.go	`54.31% <100.00%> (ø)`
internal/cli/server/config.go	`69.44% <100.00%> (+0.26%)`	⬆️
internal/cli/server/flags.go	`100.00% <100.00%> (ø)`

... and 31 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

thogard785 · 2023-05-01T21:55:00Z

62d8cc0

updated the fuzzer.

@temaniarpit27 codecov/project is still failing but all the other tests are passing. I'm unsure how to resolve the codecov/project fail - any advice?

thogard785 added 2 commits February 4, 2023 18:15

added flags, carried flag var through p2p config to backend start fun…

f6903b4

…c to handler to fetcher. TODO verify default, expose var to api

updated type conversion point to simplify - tested flag and no-flag (…

5496007

…default) settings

thogard785 marked this pull request as draft February 9, 2023 02:15

thogard785 marked this pull request as ready for review February 9, 2023 02:18

JekaMas requested a review from temaniarpit27 February 9, 2023 07:26

JekaMas added the MEV label Feb 9, 2023

JekaMas requested review from JekaMas and 0xsharma February 9, 2023 07:26