Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull gossip queues for better throughput #5195

Merged
merged 11 commits into from
Mar 1, 2023

Conversation

dapplion
Copy link
Contributor

Motivation

Lodestar's gossip queues are too restrictive to handle big volumes of objects such when subscribed to all subnets

Our existing queues throttle by limiting concurrency, irrespective of downstream worker's capacity. They have the authority to push jobs, and do so independently from each other

Description

This PR updates the queue design to:

  • Queues just hold data
  • NetworkProcessor has the authority to get data from the queues and execute it
  • NetworkProcessor throttles based on downstream workers availability to take more work
  • Since there's now a central scheduling function it can prioritize topics over others, for example process blocks before attestations

Marking as draft until testing shows all good

@github-actions
Copy link
Contributor

github-actions bot commented Feb 22, 2023

Performance Report

✔️ no performance regression detected

Full benchmark results
Benchmark suite Current: 36bab22 Previous: ec78af1 Ratio
getPubkeys - index2pubkey - req 1000 vs - 250000 vc 708.98 us/op 561.77 us/op 1.26
getPubkeys - validatorsArr - req 1000 vs - 250000 vc 44.954 us/op 47.246 us/op 0.95
BLS verify - blst-native 1.2122 ms/op 1.2165 ms/op 1.00
BLS verifyMultipleSignatures 3 - blst-native 2.4684 ms/op 2.4828 ms/op 0.99
BLS verifyMultipleSignatures 8 - blst-native 5.2943 ms/op 5.3460 ms/op 0.99
BLS verifyMultipleSignatures 32 - blst-native 19.150 ms/op 19.246 ms/op 1.00
BLS aggregatePubkeys 32 - blst-native 25.538 us/op 25.855 us/op 0.99
BLS aggregatePubkeys 128 - blst-native 100.35 us/op 100.60 us/op 1.00
getAttestationsForBlock 54.912 ms/op 57.051 ms/op 0.96
isKnown best case - 1 super set check 274.00 ns/op 267.00 ns/op 1.03
isKnown normal case - 2 super set checks 270.00 ns/op 264.00 ns/op 1.02
isKnown worse case - 16 super set checks 273.00 ns/op 253.00 ns/op 1.08
CheckpointStateCache - add get delete 5.1450 us/op 5.8040 us/op 0.89
validate gossip signedAggregateAndProof - struct 2.7478 ms/op 2.7830 ms/op 0.99
validate gossip attestation - struct 1.3015 ms/op 1.3270 ms/op 0.98
pickEth1Vote - no votes 1.2328 ms/op 1.3372 ms/op 0.92
pickEth1Vote - max votes 8.1158 ms/op 9.9704 ms/op 0.81
pickEth1Vote - Eth1Data hashTreeRoot value x2048 8.4273 ms/op 9.0075 ms/op 0.94
pickEth1Vote - Eth1Data hashTreeRoot tree x2048 12.889 ms/op 14.644 ms/op 0.88
pickEth1Vote - Eth1Data fastSerialize value x2048 624.27 us/op 670.85 us/op 0.93
pickEth1Vote - Eth1Data fastSerialize tree x2048 4.5012 ms/op 7.2826 ms/op 0.62
bytes32 toHexString 487.00 ns/op 502.00 ns/op 0.97
bytes32 Buffer.toString(hex) 334.00 ns/op 350.00 ns/op 0.95
bytes32 Buffer.toString(hex) from Uint8Array 535.00 ns/op 581.00 ns/op 0.92
bytes32 Buffer.toString(hex) + 0x 330.00 ns/op 345.00 ns/op 0.96
Object access 1 prop 0.16800 ns/op 0.16700 ns/op 1.01
Map access 1 prop 0.16200 ns/op 0.16800 ns/op 0.96
Object get x1000 6.4600 ns/op 6.4880 ns/op 1.00
Map get x1000 0.56400 ns/op 0.64100 ns/op 0.88
Object set x1000 51.214 ns/op 54.609 ns/op 0.94
Map set x1000 43.240 ns/op 45.160 ns/op 0.96
Return object 10000 times 0.23530 ns/op 0.24940 ns/op 0.94
Throw Error 10000 times 4.0809 us/op 4.2335 us/op 0.96
fastMsgIdFn sha256 / 200 bytes 3.3670 us/op 3.4850 us/op 0.97
fastMsgIdFn h32 xxhash / 200 bytes 278.00 ns/op 281.00 ns/op 0.99
fastMsgIdFn h64 xxhash / 200 bytes 382.00 ns/op 386.00 ns/op 0.99
fastMsgIdFn sha256 / 1000 bytes 11.299 us/op 11.728 us/op 0.96
fastMsgIdFn h32 xxhash / 1000 bytes 401.00 ns/op 420.00 ns/op 0.95
fastMsgIdFn h64 xxhash / 1000 bytes 450.00 ns/op 477.00 ns/op 0.94
fastMsgIdFn sha256 / 10000 bytes 101.78 us/op 104.18 us/op 0.98
fastMsgIdFn h32 xxhash / 10000 bytes 1.8700 us/op 1.9900 us/op 0.94
fastMsgIdFn h64 xxhash / 10000 bytes 1.3080 us/op 1.4080 us/op 0.93
enrSubnets - fastDeserialize 64 bits 1.2670 us/op 1.3230 us/op 0.96
enrSubnets - ssz BitVector 64 bits 486.00 ns/op 491.00 ns/op 0.99
enrSubnets - fastDeserialize 4 bits 175.00 ns/op 175.00 ns/op 1.00
enrSubnets - ssz BitVector 4 bits 471.00 ns/op 516.00 ns/op 0.91
prioritizePeers score -10:0 att 32-0.1 sync 2-0 91.768 us/op 105.90 us/op 0.87
prioritizePeers score 0:0 att 32-0.25 sync 2-0.25 119.39 us/op 133.26 us/op 0.90
prioritizePeers score 0:0 att 32-0.5 sync 2-0.5 163.73 us/op 194.72 us/op 0.84
prioritizePeers score 0:0 att 64-0.75 sync 4-0.75 293.41 us/op 360.89 us/op 0.81
prioritizePeers score 0:0 att 64-1 sync 4-1 350.71 us/op 427.19 us/op 0.82
array of 16000 items push then shift 1.6015 us/op 1.6700 us/op 0.96
LinkedList of 16000 items push then shift 8.6710 ns/op 9.0440 ns/op 0.96
array of 16000 items push then pop 73.828 ns/op 113.36 ns/op 0.65
LinkedList of 16000 items push then pop 8.4770 ns/op 8.9670 ns/op 0.95
array of 24000 items push then shift 2.3320 us/op 2.3933 us/op 0.97
LinkedList of 24000 items push then shift 8.7870 ns/op 9.0640 ns/op 0.97
array of 24000 items push then pop 74.469 ns/op 85.442 ns/op 0.87
LinkedList of 24000 items push then pop 8.5240 ns/op 8.9200 ns/op 0.96
intersect bitArray bitLen 8 13.287 ns/op 13.523 ns/op 0.98
intersect array and set length 8 78.027 ns/op 85.537 ns/op 0.91
intersect bitArray bitLen 128 43.972 ns/op 44.771 ns/op 0.98
intersect array and set length 128 1.0786 us/op 1.1676 us/op 0.92
Buffer.concat 32 items 2.8860 us/op 3.0110 us/op 0.96
Uint8Array.set 32 items 2.5730 us/op 2.6580 us/op 0.97
pass gossip attestations to forkchoice per slot 3.3672 ms/op 3.4182 ms/op 0.99
computeDeltas 3.0582 ms/op 2.9936 ms/op 1.02
computeProposerBoostScoreFromBalances 1.7896 ms/op 1.7953 ms/op 1.00
altair processAttestation - 250000 vs - 7PWei normalcase 2.1366 ms/op 2.4709 ms/op 0.86
altair processAttestation - 250000 vs - 7PWei worstcase 3.4676 ms/op 3.8026 ms/op 0.91
altair processAttestation - setStatus - 1/6 committees join 140.28 us/op 143.13 us/op 0.98
altair processAttestation - setStatus - 1/3 committees join 277.55 us/op 277.61 us/op 1.00
altair processAttestation - setStatus - 1/2 committees join 363.44 us/op 374.68 us/op 0.97
altair processAttestation - setStatus - 2/3 committees join 464.74 us/op 471.64 us/op 0.99
altair processAttestation - setStatus - 4/5 committees join 651.93 us/op 658.81 us/op 0.99
altair processAttestation - setStatus - 100% committees join 765.36 us/op 764.08 us/op 1.00
altair processBlock - 250000 vs - 7PWei normalcase 15.677 ms/op 17.536 ms/op 0.89
altair processBlock - 250000 vs - 7PWei normalcase hashState 23.198 ms/op 27.269 ms/op 0.85
altair processBlock - 250000 vs - 7PWei worstcase 45.169 ms/op 52.513 ms/op 0.86
altair processBlock - 250000 vs - 7PWei worstcase hashState 65.601 ms/op 73.430 ms/op 0.89
phase0 processBlock - 250000 vs - 7PWei normalcase 1.9368 ms/op 2.1754 ms/op 0.89
phase0 processBlock - 250000 vs - 7PWei worstcase 29.290 ms/op 29.569 ms/op 0.99
altair processEth1Data - 250000 vs - 7PWei normalcase 469.97 us/op 495.61 us/op 0.95
vc - 250000 eb 1 eth1 1 we 0 wn 0 - smpl 15 8.0620 us/op 9.0240 us/op 0.89
vc - 250000 eb 0.95 eth1 0.1 we 0.05 wn 0 - smpl 219 22.588 us/op 28.640 us/op 0.79
vc - 250000 eb 0.95 eth1 0.3 we 0.05 wn 0 - smpl 42 10.773 us/op 11.659 us/op 0.92
vc - 250000 eb 0.95 eth1 0.7 we 0.05 wn 0 - smpl 18 8.2580 us/op 8.6220 us/op 0.96
vc - 250000 eb 0.1 eth1 0.1 we 0 wn 0 - smpl 1020 91.765 us/op 111.52 us/op 0.82
vc - 250000 eb 0.03 eth1 0.03 we 0 wn 0 - smpl 11777 654.15 us/op 653.44 us/op 1.00
vc - 250000 eb 0.01 eth1 0.01 we 0 wn 0 - smpl 16384 907.83 us/op 918.87 us/op 0.99
vc - 250000 eb 0 eth1 0 we 0 wn 0 - smpl 16384 846.36 us/op 887.03 us/op 0.95
vc - 250000 eb 0 eth1 0 we 0 wn 0 nocache - smpl 16384 2.2435 ms/op 2.4370 ms/op 0.92
vc - 250000 eb 0 eth1 1 we 0 wn 0 - smpl 16384 1.4518 ms/op 1.5014 ms/op 0.97
vc - 250000 eb 0 eth1 1 we 0 wn 0 nocache - smpl 16384 3.6569 ms/op 3.9518 ms/op 0.93
Tree 40 250000 create 299.30 ms/op 342.41 ms/op 0.87
Tree 40 250000 get(125000) 178.95 ns/op 197.64 ns/op 0.91
Tree 40 250000 set(125000) 1.0280 us/op 1.0334 us/op 0.99
Tree 40 250000 toArray() 20.154 ms/op 21.456 ms/op 0.94
Tree 40 250000 iterate all - toArray() + loop 20.067 ms/op 22.216 ms/op 0.90
Tree 40 250000 iterate all - get(i) 72.250 ms/op 73.074 ms/op 0.99
MutableVector 250000 create 10.802 ms/op 10.786 ms/op 1.00
MutableVector 250000 get(125000) 6.4970 ns/op 8.3130 ns/op 0.78
MutableVector 250000 set(125000) 264.49 ns/op 273.52 ns/op 0.97
MutableVector 250000 toArray() 2.8241 ms/op 3.2361 ms/op 0.87
MutableVector 250000 iterate all - toArray() + loop 2.9546 ms/op 3.2199 ms/op 0.92
MutableVector 250000 iterate all - get(i) 1.4920 ms/op 1.5485 ms/op 0.96
Array 250000 create 2.5915 ms/op 2.6946 ms/op 0.96
Array 250000 clone - spread 1.2281 ms/op 1.1564 ms/op 1.06
Array 250000 get(125000) 0.59400 ns/op 0.59800 ns/op 0.99
Array 250000 set(125000) 0.66700 ns/op 0.68800 ns/op 0.97
Array 250000 iterate all - loop 80.705 us/op 103.56 us/op 0.78
effectiveBalanceIncrements clone Uint8Array 300000 29.594 us/op 34.899 us/op 0.85
effectiveBalanceIncrements clone MutableVector 300000 387.00 ns/op 368.00 ns/op 1.05
effectiveBalanceIncrements rw all Uint8Array 300000 164.75 us/op 170.64 us/op 0.97
effectiveBalanceIncrements rw all MutableVector 300000 81.820 ms/op 84.774 ms/op 0.97
phase0 afterProcessEpoch - 250000 vs - 7PWei 114.10 ms/op 117.61 ms/op 0.97
phase0 beforeProcessEpoch - 250000 vs - 7PWei 34.862 ms/op 41.593 ms/op 0.84
altair processEpoch - mainnet_e81889 306.68 ms/op 309.16 ms/op 0.99
mainnet_e81889 - altair beforeProcessEpoch 60.693 ms/op 51.774 ms/op 1.17
mainnet_e81889 - altair processJustificationAndFinalization 17.946 us/op 30.888 us/op 0.58
mainnet_e81889 - altair processInactivityUpdates 5.8591 ms/op 6.5172 ms/op 0.90
mainnet_e81889 - altair processRewardsAndPenalties 68.364 ms/op 71.161 ms/op 0.96
mainnet_e81889 - altair processRegistryUpdates 2.5440 us/op 2.4440 us/op 1.04
mainnet_e81889 - altair processSlashings 534.00 ns/op 542.00 ns/op 0.99
mainnet_e81889 - altair processEth1DataReset 605.00 ns/op 540.00 ns/op 1.12
mainnet_e81889 - altair processEffectiveBalanceUpdates 1.2109 ms/op 1.2670 ms/op 0.96
mainnet_e81889 - altair processSlashingsReset 4.0010 us/op 4.1020 us/op 0.98
mainnet_e81889 - altair processRandaoMixesReset 7.2970 us/op 4.6280 us/op 1.58
mainnet_e81889 - altair processHistoricalRootsUpdate 687.00 ns/op 731.00 ns/op 0.94
mainnet_e81889 - altair processParticipationFlagUpdates 2.2780 us/op 2.8830 us/op 0.79
mainnet_e81889 - altair processSyncCommitteeUpdates 537.00 ns/op 486.00 ns/op 1.10
mainnet_e81889 - altair afterProcessEpoch 125.07 ms/op 128.57 ms/op 0.97
phase0 processEpoch - mainnet_e58758 404.83 ms/op 369.13 ms/op 1.10
mainnet_e58758 - phase0 beforeProcessEpoch 154.82 ms/op 129.18 ms/op 1.20
mainnet_e58758 - phase0 processJustificationAndFinalization 19.875 us/op 16.714 us/op 1.19
mainnet_e58758 - phase0 processRewardsAndPenalties 65.016 ms/op 62.154 ms/op 1.05
mainnet_e58758 - phase0 processRegistryUpdates 7.7210 us/op 7.4570 us/op 1.04
mainnet_e58758 - phase0 processSlashings 544.00 ns/op 438.00 ns/op 1.24
mainnet_e58758 - phase0 processEth1DataReset 560.00 ns/op 461.00 ns/op 1.21
mainnet_e58758 - phase0 processEffectiveBalanceUpdates 1.1882 ms/op 1.0003 ms/op 1.19
mainnet_e58758 - phase0 processSlashingsReset 5.7000 us/op 3.7600 us/op 1.52
mainnet_e58758 - phase0 processRandaoMixesReset 6.4740 us/op 3.9490 us/op 1.64
mainnet_e58758 - phase0 processHistoricalRootsUpdate 703.00 ns/op 543.00 ns/op 1.29
mainnet_e58758 - phase0 processParticipationRecordUpdates 4.7550 us/op 4.2900 us/op 1.11
mainnet_e58758 - phase0 afterProcessEpoch 101.59 ms/op 99.409 ms/op 1.02
phase0 processEffectiveBalanceUpdates - 250000 normalcase 1.2513 ms/op 1.2587 ms/op 0.99
phase0 processEffectiveBalanceUpdates - 250000 worstcase 0.5 1.5092 ms/op 1.5505 ms/op 0.97
altair processInactivityUpdates - 250000 normalcase 33.615 ms/op 24.038 ms/op 1.40
altair processInactivityUpdates - 250000 worstcase 31.669 ms/op 28.974 ms/op 1.09
phase0 processRegistryUpdates - 250000 normalcase 11.660 us/op 7.2410 us/op 1.61
phase0 processRegistryUpdates - 250000 badcase_full_deposits 251.90 us/op 268.30 us/op 0.94
phase0 processRegistryUpdates - 250000 worstcase 0.5 137.39 ms/op 124.30 ms/op 1.11
altair processRewardsAndPenalties - 250000 normalcase 66.941 ms/op 70.938 ms/op 0.94
altair processRewardsAndPenalties - 250000 worstcase 74.310 ms/op 72.080 ms/op 1.03
phase0 getAttestationDeltas - 250000 normalcase 6.9965 ms/op 6.6074 ms/op 1.06
phase0 getAttestationDeltas - 250000 worstcase 7.4012 ms/op 6.5565 ms/op 1.13
phase0 processSlashings - 250000 worstcase 3.4188 ms/op 3.5393 ms/op 0.97
altair processSyncCommitteeUpdates - 250000 184.19 ms/op 177.19 ms/op 1.04
BeaconState.hashTreeRoot - No change 343.00 ns/op 350.00 ns/op 0.98
BeaconState.hashTreeRoot - 1 full validator 54.178 us/op 53.448 us/op 1.01
BeaconState.hashTreeRoot - 32 full validator 506.87 us/op 475.64 us/op 1.07
BeaconState.hashTreeRoot - 512 full validator 5.3527 ms/op 5.5657 ms/op 0.96
BeaconState.hashTreeRoot - 1 validator.effectiveBalance 62.121 us/op 62.201 us/op 1.00
BeaconState.hashTreeRoot - 32 validator.effectiveBalance 996.49 us/op 887.62 us/op 1.12
BeaconState.hashTreeRoot - 512 validator.effectiveBalance 10.793 ms/op 10.954 ms/op 0.99
BeaconState.hashTreeRoot - 1 balances 47.590 us/op 46.914 us/op 1.01
BeaconState.hashTreeRoot - 32 balances 455.35 us/op 460.53 us/op 0.99
BeaconState.hashTreeRoot - 512 balances 4.2209 ms/op 4.5365 ms/op 0.93
BeaconState.hashTreeRoot - 250000 balances 74.845 ms/op 71.983 ms/op 1.04
aggregationBits - 2048 els - zipIndexesInBitList 14.885 us/op 15.497 us/op 0.96
regular array get 100000 times 41.784 us/op 32.714 us/op 1.28
wrappedArray get 100000 times 32.332 us/op 43.072 us/op 0.75
arrayWithProxy get 100000 times 15.424 ms/op 15.425 ms/op 1.00
ssz.Root.equals 551.00 ns/op 570.00 ns/op 0.97
byteArrayEquals 548.00 ns/op 529.00 ns/op 1.04
shuffle list - 16384 els 6.9417 ms/op 6.7689 ms/op 1.03
shuffle list - 250000 els 99.976 ms/op 99.875 ms/op 1.00
processSlot - 1 slots 8.7430 us/op 8.4910 us/op 1.03
processSlot - 32 slots 1.3430 ms/op 1.3940 ms/op 0.96
getEffectiveBalanceIncrementsZeroInactive - 250000 vs - 7PWei 204.52 us/op 189.68 us/op 1.08
getCommitteeAssignments - req 1 vs - 250000 vc 2.9087 ms/op 2.9080 ms/op 1.00
getCommitteeAssignments - req 100 vs - 250000 vc 4.1745 ms/op 4.1601 ms/op 1.00
getCommitteeAssignments - req 1000 vs - 250000 vc 4.4877 ms/op 4.4887 ms/op 1.00
RootCache.getBlockRootAtSlot - 250000 vs - 7PWei 4.7400 ns/op 4.3300 ns/op 1.09
state getBlockRootAtSlot - 250000 vs - 7PWei 980.43 ns/op 549.24 ns/op 1.79
computeProposers - vc 250000 10.403 ms/op 10.139 ms/op 1.03
computeEpochShuffling - vc 250000 101.31 ms/op 107.69 ms/op 0.94
getNextSyncCommittee - vc 250000 169.54 ms/op 169.63 ms/op 1.00

by benchmarkbot/action

@dapplion
Copy link
Contributor Author

dapplion commented Feb 23, 2023

Deployed to feat3, results show good queue efficiency but mesh health is bad.

Queues are never full, message drop rate is 0, and job wait time is almost 0

Screenshot from 2023-02-23 12-15-14

But mesh peers is now at minimum and don't have peers on most subnets

Screenshot from 2023-02-23 12-16-36

We receive way less messages in the beacon_attestation topic

Screenshot from 2023-02-23 12-17-51

Peer quality is much different, we have mostly nodes with no validators connected

Screenshot from 2023-02-23 12-20-10

EDIT: From Tuyen

feat3-lg1k-hzax41 has same issue to beta-lg1k-hzax41, this node contains only low long lived subnets peer, peer count is consistently >=56 peers so there is no chance for new peer. low long lived subnets peer causes low mesh peers see #5198

  • feat3-novc-ctvpss: Queue times for aggregates reduced from 1s to 50ms. Other metrics good.
  • feat3-sm1v-ctvpss: Queue times for aggregates reduced from 1s to 50ms. Other metrics good.
  • feat3-md16-ctvpsm: Queue times for aggregates + attestations reduced from 1s to 50ms. Other metrics good.
  • feat3-md64-hzax41: Affected to issue of peers with low subnet count
  • feat3-lg1k-hzax41: Affected to issue of peers with low subnet count

@twoeths
Copy link
Contributor

twoeths commented Feb 23, 2023

peers are not pruned because:

  1. most/all peers have duties
  2. most/all peers subscribe to >= 1 subnet
  3. most/all peers have good score
  4. not too grouped to a subnet

this is the same issue to #5198

The reason it happens more to this branch maybe because gossip score it so great (point 3 above - I compared gossipsub score in feat3 and other groups)

@dapplion dapplion force-pushed the dapplion/gossip-queue-pull branch from 9a5b31b to 99cb497 Compare February 28, 2023 06:10
@dapplion dapplion marked this pull request as ready for review March 1, 2023 02:21
@dapplion dapplion requested a review from a team as a code owner March 1, 2023 02:21
@@ -11,7 +11,7 @@
"bugs": {
"url": "https://github.com/ChainSafe/lodestar/issues"
},
"version": "1.4.3",
"version": "1.5.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why it shows the diff here while unstable really have version as 1.5.0, should not be an issue anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased on unstable and fixed the bad diff

@twoeths
Copy link
Contributor

twoeths commented Mar 1, 2023

This is awesome work, I can see good metrics like "Job Wait Time" or "Dropped jobs %". However I also notice this push some pressure to I/O lag issue on #4002:

  • A little increasement on Event Loop Lag
  • A little increasement on Lodestar CPU time (160% -> 170%)
  • Especially the submitPoolAttestations at validator side (500ms vs 100vs as in stable 1k validator node)

Screen Shot 2023-03-01 at 10 25 47

@twoeths
Copy link
Contributor

twoeths commented Mar 1, 2023

@dapplion is it possible to introduce some flags to mitigate the issue?

@dapplion dapplion force-pushed the dapplion/gossip-queue-pull branch from 0e803d0 to d04463a Compare March 1, 2023 04:03
@dapplion
Copy link
Contributor Author

dapplion commented Mar 1, 2023

@dapplion is it possible to introduce some flags to mitigate the issue?

If any this PR is allowing nodes to process all network objects they are required to process as per the spec. "The issue" is actually correct behavior, where current stable drops too many messages. To mitigate "the issue" we should force the node into processing less messages than they can, sort of re-introducing a forced limiting throughput. I'm softly against introducing data, and instead spend our energy understanding what's taking CPU time.

If we were to add flags we could limit the max concurrency per topic.

@philknows philknows added this to the v1.6.0 milestone Mar 1, 2023
@twoeths
Copy link
Contributor

twoeths commented Mar 1, 2023

0301_gossip-queue-pull_1k.cpuprofile.zip

I'm attaching the profile, there is no special thing except for handleReceivedMessages() (then executeWork) was called many times inside runMicroTasks

In the current implementation we yield to the macro queue every 50ms through yieldEveryMs inside JobItemQueue, do we have equivalent mechanism in this implementation @dapplion ?

@dapplion
Copy link
Contributor Author

dapplion commented Mar 1, 2023

In the current implementation we yield to the macro queue every 50ms through yieldEveryMs inside JobItemQueue, do we have equivalent mechanism in this implementation @dapplion ?

That should not be necessary in theory, I added metrics and each gossip packet tends to trigger work on itself. From looking at the profile I don't see issues related with this code specifically. Since everything is run sync now it makes sense to show up on the call stack of handleReceivedMessages and executeWork but their self time is small

gossipHandlers?: GossipHandlers;
};

export class NetworkWorker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding a tsdoc comment here describing the purpose or mentioning that this isn't related to thread workers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah naming sucks, I think this specific class should be merged into the NetworkProcessor

maxGossipTopicConcurrency?: number;
};

const executeGossipWorkOrderObj: Record<GossipType, true> = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth comment here mentioning the order here defines the order that gossip topics will be processed

};

export type NetworkProcessorOpts = GossipHandlerOpts & {
maxGossipTopicConcurrency?: number;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to have a tsdoc here

Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Can you follow up w a PR to add all new metrics to dashboard?
It seems we'll have to tune things based on what we see there.

@wemeetagain wemeetagain merged commit f03f911 into unstable Mar 1, 2023
@wemeetagain wemeetagain deleted the dapplion/gossip-queue-pull branch March 1, 2023 15:52
nazarhussain added a commit that referenced this pull request Mar 14, 2023
nazarhussain added a commit that referenced this pull request Mar 14, 2023
@wemeetagain
Copy link
Member

🎉 This PR is included in v1.6.0 🎉

twoeths added a commit that referenced this pull request Mar 27, 2023
twoeths added a commit that referenced this pull request Mar 27, 2023
twoeths added a commit that referenced this pull request Mar 27, 2023
@wemeetagain
Copy link
Member

🎉 This PR is included in v1.7.0 🎉

twoeths added a commit that referenced this pull request Apr 7, 2023
wemeetagain pushed a commit that referenced this pull request Apr 7, 2023
@wemeetagain
Copy link
Member

🎉 This PR is included in v1.8.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants