Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a way to change capacity of the network on the fly in FT benchmark #11460

Open
aborg-dev opened this issue Jun 3, 2024 · 6 comments
Assignees

Comments

@aborg-dev
Copy link
Contributor

aborg-dev commented Jun 3, 2024

At the moment we can only easily change the capacity of the network at the startup time by increasing the gas_limit.

For the FT throughput experiment, we want to find the largest capacity at which the network can still operate in a stable way. One way to achieve this is to be able to gradually increase the gas_limit over the course of the experiment until we reach a critical point.

It is already possible to increase gas_limit a bit at every block height - we should try to leverage that for this experiment.

@mooori
Copy link
Contributor

mooori commented Jun 18, 2024

Current situation

It seems like gas_limit is read from genesis config and then passed on without modification into structs that determine the gas_limit such as ChunkExtra, ShardChunkHeader, ApplyState, .... In line with that, GAS_LIMIT_ADJUSTMENT_FACTOR is currently not used to change the gas_limit.

Proposal for approach to on the fly changes

  • When applying a chunk that reaches the gas_limit in less than 1 second - margin, the chunk producer increases gas_limit by GAS_LIMIT_ADJUSTMENT_FACTOR.
    • When the chunk does not reach the gas_limit, predictions regarding capacity are more difficult to make, hence the gas_limit is not changed for now.
  • When applying a chunk takes more than 1 second + margin, the chunk producer reduces the gas_limit by GAS_LIMIT_ADJUSTMENT_FACTOR.
    • Apply times larger than 1 second should be avoided and hence gas_limit is decreased without checking further conditions.
  • The apply time of 1 second is chosen since that is the current target for mainnet. The purpose of margin is to prevent gas_limit from flip-flopping around an equilibrium.

Underlying assumption

When running multi-node benchmarks, all nodes have the same hardware and configuration and should therefore be able to handle the same load. Hence if one node increases the gas_limit because it can handle more load, other nodes should be able to keep up.

Discussion

This is a rough heuristic, but it has the advantage of being independent of specific hardware and traffic. Therefore I think it can be a starting point for on the fly adjustments of the gas_limit.

Next steps

  • Implement a proof of concept for gas_limit adjustments and check if it achieves reasonable gas_limits.
    • In a single node setup.
    • In a multi node setup.
  • First it can be on a separate branch to verify the approach. If it succeeds, this functionality must be separated from production code.
  • Allow configuring GAS_LIMIT_ADJUSTMENT_FACTOR to reach the equilibrium gas_limit in benchmark runs quickly. This might be possible since congestion in benchmark runs does not hurt real world users. Still, it should be verified that congestion does not pollute benchmark results.

Disclaimer

Adjusting the gas_limit on the fly touches on many concepts and places in the code base, so it might well be that I'm missing something here. I think it would be good to get started with something and build a better understanding of the topic while working on it. The heuristics for gas_limit adjustments can be refined later on in case this approach works.


@Akashin what do you think about this approach and the plan for the next steps?

@aborg-dev
Copy link
Contributor Author

@pugachAG , @Longarithm - can you please suggest what is the right place to change the gas limit that the chunk producer proposes to use in the next chunk?

@mooori
Copy link
Contributor

mooori commented Jun 19, 2024

Notes from the offline discussion:

  • This feature is for benchmarks only, it's not intended to make it into production.
  • Due to periodic workloads (e.g. writing to disk) looking at a single block is not sufficient to always make reasonable gas_limit adjustments. More refined approaches could be:
    • Looking at the last n blocks.
    • Looking at apply chunk latency histograms.

@pugachAG
Copy link
Contributor

@Akashin I suggest implementing that as part of runtime if possible. So currently we set the same gas_limit to NewChunkResult. Instead you can add gas_limit to ApplyChunkResult and then use that. That value then will be picked up by chunk producer when creating a new chunk.

@mooori
Copy link
Contributor

mooori commented Jul 18, 2024

cc #11808

@mooori
Copy link
Contributor

mooori commented Aug 2, 2024

Status update 2024-08-02

  • A draft implementation is available here.
    • The approach based on prometheus quantiles is too sluggish as buckets contain metrics for the node's entire uptime. However, the decision to increase/decrease the gas limit depends mostly on recent chunk apply times.
    • Looking only at the most recent chunk apply time and delayed receipts gas works well for me in local runs. I can start with a gas limit that is either too high or too low and the adjustment mechanism brings it to a reasonable stable state that maxes out node performance.
    • Parameters need fine tuning, but before getting into that I suggest merging the feature into master.
  • Reviewing these changes will probably be easier when the work is split into two PRs:
  • As alternative adjusting the gas limit in RuntimeAdapter might be more future proof, e.g. when benchmarks will be run in a lightweight environment. However, as I understand it, currently the gas limit is an external parameter for the runtime which is passed in here for each chunk. Doing the adjustment in RuntimeAdapter would require adding a mechanism for RuntimeAdapter to maintain or pass back the gas limit that it adjusted internally. That might be more intrusive on critical code paths compared to the current approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants