Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider increasing max_semi_space_size #2115

Closed
dapplion opened this issue Feb 28, 2021 · 11 comments
Closed

Consider increasing max_semi_space_size #2115

dapplion opened this issue Feb 28, 2021 · 11 comments
Labels
prio-low This is nice to have. scope-performance Performance issue and ideas to improve performance.

Comments

@dapplion
Copy link
Contributor

dapplion commented Feb 28, 2021

According to GC metrics on master the beacon node spends 10% of the time doing Scavenge GC runs.

Screenshot from 2021-02-28 20-45-51

If the metrics are correct this article suggests increasing max_semi_space_size with the flag --max_semi_space_size to reduce the frequency of these runs.

https://www.alibabacloud.com/blog/better-node-application-performance-through-gc-optimization_595119

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the bot:stale label Jun 2, 2021
@dapplion dapplion self-assigned this Jun 3, 2021
@stale stale bot removed the bot:stale label Jun 3, 2021
@dapplion dapplion added the scope-performance Performance issue and ideas to improve performance. label May 12, 2022
@dapplion dapplion removed their assignment May 12, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta-stale label Sep 21, 2022
@dapplion dapplion added the prio-low This is nice to have. label Sep 29, 2022
@wemeetagain
Copy link
Member

cc @matthewkeil re #5829

@matthewkeil
Copy link
Member

@wemeetagain I had this set on beta when we discovered the memory leak in #5851 so it was hard to see full effect but it did cut the Scavenge time in line with the worker new space adjustment before things started to get wonky when the heap grew from the leak. It will most definitely be a nice tune-up though and I will get this set correctly once we get the leak addressed and can see the fruits of the change!!

@matthewkeil
Copy link
Member

As a note, the docs are not super clear about what "semi" space is. But after digging in the codebase its related to young generation and kNewLargeObjectSpaceToSemiSpaceRatio which always equals 1;

size_t Heap::YoungGenerationSizeFromSemiSpaceSize(size_t semi_space_size) {
  return semi_space_size * (2 + kNewLargeObjectSpaceToSemiSpaceRatio);
}

@matthewkeil
Copy link
Member

matthewkeil commented Aug 10, 2023

Memory leak #5851 resolved. Deploying unstable to feat2 with --max-semi-space-size=64. The max scavenge witnessed currently was on mainnet in the unstable group. All other instances were lower and some substantially lower.
Unsure how the larger value will affect performance on the smaller instances so will let run for a few days without the worker to see what turns up.

unstable-mainnet-hzax41
Screenshot 2023-08-10 at 1 23 13 PM

unstable-novc-ctvpss
Screenshot 2023-08-10 at 1 25 03 PM

unstable-sm1v-ctvpss
Screenshot 2023-08-10 at 1 25 16 PM

@matthewkeil
Copy link
Member

matthewkeil commented Aug 13, 2023

Results

feat2 with --max-semi-space-size=64
feat1 with --max-semi-space-size=128
beta with --max-semi-space-size=256

all runs are with use_worker=false

group-mainnet-hzax41

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691814600000&to=1691901000000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cfeat2-mainnet-hzax41&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691814600000&to=1691901000000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cfeat1-mainnet-hzax41&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691832600000&to=1691904600000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cbeta-mainnet-hzax41&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

feat2

Screenshot 2023-08-12 at 10 05 09 PM

feat1

Screenshot 2023-08-12 at 10 53 57 PM

beta

Screenshot 2023-08-12 at 11 10 09 PM

group-lg1k-hzax41

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691814600000&to=1691901000000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cfeat2-lg1k-hzax41&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691814600000&to=1691901000000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cfeat1-lg1k-hzax41&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691832600000&to=1691904600000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cbeta-lg1k-hzax41&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

feat2

Screenshot 2023-08-12 at 11 22 24 PM

feat1

Screenshot 2023-08-12 at 11 22 34 PM

beta

Screenshot 2023-08-12 at 11 22 45 PM

feat2-md16-ctvpsm

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691814600000&to=1691901000000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cfeat2-md16-ctvpsm&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691814600000&to=1691901000000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cfeat1-md16-ctvpsm&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691832600000&to=1691904600000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cbeta-md16-ctvpsm&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

feat2

Screenshot 2023-08-12 at 11 23 46 PM

feat1

Screenshot 2023-08-12 at 11 23 53 PM

beta

Screenshot 2023-08-12 at 11 24 01 PM

feat2-sm1v-ctvpss

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691814600000&to=1691901000000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cfeat2-sm1v-ctvpss&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691814600000&to=1691901000000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cfeat1-sm1v-ctvpss&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

https://grafana-lodestar.chainsafe.io/d/lodestar_vm_host/lodestar-vm-host?from=1691832600000&to=1691904600000&var-DS_PROMETHEUS=default&var-rate_interval=1h&var-Filters=instance%7C%3D%7Cbeta-sm1v-ctvpss&orgId=1&var-beacon_job=$%7BVAR_BEACON_JOB%7D

feat2

Screenshot 2023-08-12 at 11 24 34 PM

feat1

Screenshot 2023-08-12 at 11 24 40 PM

beta

Screenshot 2023-08-12 at 11 24 46 PM

@matthewkeil
Copy link
Member

deployed 512mb to beta at timestamp below. Will pull metrics in a day after it stabilizes to see how they compare to 256mb that was also on beta

Screenshot 2023-08-16 at 12 00 09 AM

@matthewkeil
Copy link
Member

Moving the new space to 512mb was detrimental. On mainnet the scavenge dropped but for some reason the mark-and-sweep started to climb considerably. I am not sure of the phenomena here but it's not worth investigating.

Screenshot 2023-08-22 at 5 27 38 AM

The sweet spot for setting new space is at a value similar to the net rate at which objects are created/collected such that they exist in only the from space and do not get moved to the to space (as a net average) when collection occurs. GC tends to drop a bit further (as a % of CPU time) up to roughly as threshold of two times the rate of object creation. At this point though the space is so large that it tends to affect performance of the node, likely do to searching for objects during runtime.

Current creation rate on unstable-mainnet-hzax41on a 30 day timeline with 7d $rate_interval is roughly 150mb. To keep the number to an even page interval a setting of 148mb or 152mb is recommended.

Screenshot 2023-08-22 at 5 38 07 AM

Using a 6h $rate_interval to confirm makes those values seem appropriate for a first go. We can reassess in another couple of weeks after setting to see how things proceed.

Screenshot 2023-08-22 at 5 42 33 AM

This value is set at the command line however it is possible to programmatically adjust it during startup from historical data. The same is true of the new space adjustment for the network worker and assumptions made here apply to the worker as most of the scavenged garbage is network related. This investigation started with #5829 and that fact was proven out there. As a note once the heap size is set at startup it is not possible to change the value in either case.

@matthewkeil
Copy link
Member

Final not for reference:

  • A value that is too low results in excess scavenge GC time.
  • A value that is marginally too high results in node performance degradation from increased variable lookup time
  • A value that is very high results in mark-and-sweep collection (unknown reason because not researched further)

To set this value in the future run the node without setting a maxYoungGeneration size and see what the net scavenge collection rate is and set this to the same value. See example in images above. That value seems to be near the sweet spot. If this methodology is refined in the future another note will be added below.

@matthewkeil
Copy link
Member

Closed via #5829

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prio-low This is nice to have. scope-performance Performance issue and ideas to improve performance.
Projects
None yet
Development

No branches or pull requests

4 participants