Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor performance of MpoolSelect #11233

Closed
Tracked by #11251
arajasek opened this issue Sep 1, 2023 · 5 comments
Closed
Tracked by #11251

Monitor performance of MpoolSelect #11233

arajasek opened this issue Sep 1, 2023 · 5 comments
Assignees

Comments

@arajasek
Copy link
Contributor

arajasek commented Sep 1, 2023

The MpoolSelect method draws messages from the Lotus daemon's message pool, optimizing for maximum gas reward for the miner that might include those messages. Although it's rarely called by users other than the lotus-miner process, it is a very important method for the network as a whole.

The performance of this method can be somewhat variable, depending on (among other things):

  • the number of messages in the mempool
  • the nature of the message "chains"
  • the ticket quality of the prospective miner?

We've historically been somewhat blind to the performance of this method, since it's pretty much only called by miners. This makes it hard to investigate reports that this method might be slow, or find ways to optimize it.

In order to do so, we should create a simple tool (perhaps in lotus-shed) that calls MpoolSelect every 30 seconds, and records how long it takes. It should also log a few other stats, such as:

  • the size of the mempool at the time
  • the number of "selected" messages.

We will then want to profile the method itself (if it does appear to take dangerously long in certain scenarios), and look for potential optimizations.

@rjan90
Copy link
Contributor

rjan90 commented Sep 4, 2023

I expanded the lotus-shed mpool miner-select-messages command a bit here, and ran over the weekend calling it every 30 seconds. Some notes from those logs show that:

  • MpoolSelect usually takes less then 100ms.
    • On occasions MpoolSelect takes a signifcantly longer time (> 2 seconds). This happened approximaltely 80 times in a day.
    • On 3/4 occassions it spiked higher then 10 seconds and up towards 50 seconds. Not sure if this is an anaomly in my enviroment

The daemon-logs indicate that creating the message chains is what is taking up the most time when this happens:

2023-09-04T09:11:59.454+0200	INFO	messagepool	messagepool/selection.go:712	get pending messages done	{"took": 0.008006982}
2023-09-04T09:12:02.478+0200	INFO	messagepool	messagepool/selection.go:500	create message chains done	{"took": 3.024239818}
2023-09-04T09:12:02.478+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 3.043547436}
2023-09-04T09:12:35.175+0200	INFO	messagepool	messagepool/selection.go:500	create message chains done	{"took": 2.490213777}
2023-09-04T09:12:35.176+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 2.497071303}

There has not been any period of time with a large rise in mpool over the weekend, so I have not been able to check if this puts additional strain on MpoolSelect. Some additional extended logging when MpoolSelect was taking a lot of time:

Message selection took 3.120149628s
Size of the mempool: 146
Number of selected messages: 58
0: ...etwp5yoq -> ...hubgx3qa, method 3844450837, gasFeecap 200398, gasPremium 200271, gasLimit 156920681, val 0 FIL
1: ...32ot3wyq -> ...01771508, method 5, gasFeecap 404676, gasPremium 200230, gasLimit 25347526, val 0 FIL
2: ...43je5tlq -> ...01879880, method 5, gasFeecap 403712, gasPremium 199748, gasLimit 24031477, val 0 FIL
3: ...333ktu7q -> ...r4gdwhrq, method 0, gasFeecap 149959, gasPremium 149833, gasLimit 7203546, val 5.179 FIL
4: ...rcup55ia -> ...02131855, method 7, gasFeecap 102336, gasPremium 101282, gasLimit 90881275, val 4.12648756862114328 FIL
5: ...wqkzsrnq -> ...02368321, method 7, gasFeecap 101914, gasPremium 100860, gasLimit 71647396, val 0 FIL
6: ...zj7dhpfa -> f042558, method 5, gasFeecap 101879, gasPremium 100825, gasLimit 62397726, val 0 FIL
7: ...5msf3ifa -> ...01930832, method 6, gasFeecap 1000000000, gasPremium 100752, gasLimit 25992698, val 0.048440428639671218 FIL
8: ...jwj5xm2q -> ...02002688, method 7, gasFeecap 5000000000, gasPremium 100705, gasLimit 85740167, val 1.994402441214354576 FIL
9: ...cgv4gkma -> ...02131855, method 6, gasFeecap 101691, gasPremium 100637, gasLimit 62259672, val 0.096880857279342438 FIL
10: ...6q3pfruq -> f0123931, method 6, gasFeecap 101674, gasPremium 100620, gasLimit 68025651, val 0 FIL
11: ...xza7xvha -> ...02220982, method 7, gasFeecap 101570, gasPremium 100516, gasLimit 73165271, val 0 FIL
12: ...4utuamna -> ...01864657, method 7, gasFeecap 101521, gasPremium 100467, gasLimit 89751577, val 0 FIL
13: ...lhuqxwzq -> ...02244924, method 7, gasFeecap 100921, gasPremium 99867, gasLimit 75424993, val 0 FIL
14: ...lhuqxwzq -> ...02244924, method 7, gasFeecap 102007, gasPremium 100953, gasLimit 85141758, val 0 FIL
15: ...iz6pyu7a -> ...01792960, method 5, gasFeecap 101457, gasPremium 100403, gasLimit 34695848, val 0 FIL
16: ...pltx3viq -> ...02243688, method 7, gasFeecap 100966, gasPremium 99912, gasLimit 100087551, val 0.167484374760680697 FIL
17: ...pltx3viq -> ...02243688, method 7, gasFeecap 101400, gasPremium 100346, gasLimit 109442637, val 0.167483318561486767 FIL
18: ...pltx3viq -> ...02243688, method 7, gasFeecap 101752, gasPremium 100698, gasLimit 118494166, val 0.167482865716475839 FIL
19: ...jtydtyaa -> ...01946104, method 6, gasFeecap 101287, gasPremium 100233, gasLimit 68248637, val 0 FIL
20: ...jtydtyaa -> ...01946104, method 6, gasFeecap 101456, gasPremium 100402, gasLimit 69039182, val 0 FIL
21: ...an3sk4bq -> ...02100383, method 7, gasFeecap 101357, gasPremium 100303, gasLimit 115235587, val 0 FIL
22: ...66wicpoq -> ...01923312, method 8, gasFeecap 101341, gasPremium 100287, gasLimit 1112405952, val 0 FIL
23: ...5hh36ssq -> ...02129988, method 6, gasFeecap 101340, gasPremium 100286, gasLimit 65722276, val 0.096880857279342438 FIL
24: ...agobn3ka -> ...02232088, method 6, gasFeecap 101330, gasPremium 100276, gasLimit 69947252, val 0 FIL
25: ...o4wvyyya -> ...01690631, method 6, gasFeecap 101313, gasPremium 100259, gasLimit 49303318, val 0.048440428639671218 FIL
26: ...r523sd5q -> ...01392893, method 5, gasFeecap 101303, gasPremium 100249, gasLimit 50202805, val 0 FIL
27: ...jwj5xm2q -> ...02002688, method 7, gasFeecap 5000000000, gasPremium 100203, gasLimit 95751733, val 1.994402441214354576 FIL
28: ...pltx3viq -> ...02243688, method 7, gasFeecap 101256, gasPremium 100202, gasLimit 131386213, val 0.167483734516277576 FIL
29: ...ccgeet2a -> ...01918123, method 5, gasFeecap 101244, gasPremium 100190, gasLimit 27261888, val 0 FIL
30: ...aboc7kia -> f081990, method 7, gasFeecap 101236, gasPremium 100182, gasLimit 76958696, val 2.060773987800137493 FIL
31: ...kz4lp25a -> ...02320312, method 7, gasFeecap 101216, gasPremium 100162, gasLimit 76424398, val 2.054144097607318953 FIL
32: ...b5d2jlaq -> ...01042229, method 5, gasFeecap 101189, gasPremium 100135, gasLimit 29423485, val 0 FIL
33: ...ubrmjngq -> ...02146033, method 6, gasFeecap 101132, gasPremium 100078, gasLimit 60629287, val 0.048440428639671218 FIL
34: ...4utuamna -> ...01864657, method 7, gasFeecap 101128, gasPremium 100074, gasLimit 102445287, val 0 FIL
35: ...zhjkoz6q -> ...02221110, method 7, gasFeecap 101098, gasPremium 100044, gasLimit 77601521, val 0 FIL
36: ...is32oilq -> ...01052311, method 5, gasFeecap 101088, gasPremium 100034, gasLimit 33304848, val 0 FIL
37: ...akfdhbbq -> ...01916712, method 7, gasFeecap 699417727, gasPremium 100000, gasLimit 100083251, val 0.334967393219880372 FIL
38: ...akfdhbbq -> ...01916712, method 7, gasFeecap 629384500, gasPremium 100000, gasLimit 111219771, val 0.33496628215784652 FIL
39: ...akfdhbbq -> ...01916712, method 7, gasFeecap 584444479, gasPremium 100000, gasLimit 119771856, val 0.334967258608749519 FIL
40: ...akfdhbbq -> ...01916712, method 7, gasFeecap 544044248, gasPremium 100000, gasLimit 128666005, val 0.334967469032555155 FIL
41: ...66wicpoq -> ...01923312, method 6, gasFeecap 510576880, gasPremium 100000, gasLimit 137099823, val 0.096880857279342438 FIL
42: ...kvhqb5aq -> ...01916712, method 6, gasFeecap 1000000000, gasPremium 100000, gasLimit 58366393, val 0.096880857279342438 FIL
43: ...axfztafa -> ...02221111, method 6, gasFeecap 101011, gasPremium 99957, gasLimit 59714987, val 0 FIL
44: ...an3sk4bq -> ...02100383, method 6, gasFeecap 100974, gasPremium 99920, gasLimit 75149367, val 0 FIL
45: ...o4wef4wa -> f020522, method 27, gasFeecap 5000000000, gasPremium 99887, gasLimit 382742246, val 1.75605074815646274 FIL
46: ...63wmoxtq -> ...01937995, method 5, gasFeecap 100925, gasPremium 99871, gasLimit 27862196, val 0 FIL
47: ...mbub5xyq -> ...01164500, method 5, gasFeecap 100882, gasPremium 99828, gasLimit 26862781, val 0 FIL
48: ...2krij7ya -> ...02123612, method 5, gasFeecap 100880, gasPremium 99826, gasLimit 26670960, val 0 FIL
49: ...xfmme2ha -> f05, method 4, gasFeecap 101081, gasPremium 99996, gasLimit 1389799456, val 0 FIL
50: ...xfmme2ha -> f05, method 4, gasFeecap 100736, gasPremium 99651, gasLimit 1413394368, val 0 FIL
51: ...xfmme2ha -> f05, method 4, gasFeecap 100861, gasPremium 99776, gasLimit 1396430872, val 0 FIL
52: ...thhme5gq -> ...02095132, method 7, gasFeecap 100789, gasPremium 99735, gasLimit 72711876, val 0 FIL
53: ...k3nogthq -> ...02232007, method 6, gasFeecap 100693, gasPremium 99639, gasLimit 72806061, val 0 FIL
54: ...fwnheqkq -> ...01122339, method 6, gasFeecap 100687, gasPremium 99633, gasLimit 87453331, val 0.096880857279342438 FIL
55: ...pltx3viq -> ...02243688, method 6, gasFeecap 100419, gasPremium 99365, gasLimit 89329876, val 0.048440428639671218 FIL
56: ...r7bysmqa -> ...02246008, method 7, gasFeecap 99940, gasPremium 98886, gasLimit 105869866, val 4.082373699450743686 FIL
57: ...wx56zjbq -> ...01315130, method 27, gasFeecap 99561, gasPremium 98507, gasLimit 423538280, val 0 FIL
selected messages:  58
total gas limit of selected messages: 9983487603 / 10000000000 (99.83%)

And a second one:

Message selection took 2.532508589s
Size of the mempool: 111
Number of selected messages: 16
0: ...ufbajana -> ...02519843, method 7, gasFeecap 102075, gasPremium 101021, gasLimit 91130281, val 2.040006495176975313 FIL
1: ...xkxe6u7a -> ...02194766, method 5, gasFeecap 101952, gasPremium 100898, gasLimit 24305667, val 0 FIL
2: ...ibycstqa -> f094374, method 5, gasFeecap 101890, gasPremium 100836, gasLimit 27657436, val 0 FIL
3: ...n4chxmva -> ...01854600, method 5, gasFeecap 101700, gasPremium 100646, gasLimit 31042050, val 0 FIL
4: ...5hh36ssq -> ...02129988, method 7, gasFeecap 101657, gasPremium 100603, gasLimit 73478892, val 4.093815386397165778 FIL
5: ...5tuognra -> f0143110, method 5, gasFeecap 101613, gasPremium 100559, gasLimit 28640148, val 0 FIL
6: ...qx62fqwq -> f022352, method 28, gasFeecap 101600, gasPremium 100546, gasLimit 42551462, val 0.052959678224181834 FIL
7: ...rmlxxfha -> ...01270096, method 7, gasFeecap 101560, gasPremium 100506, gasLimit 72970462, val 0 FIL
8: ...fkgkcmga -> ...01789666, method 5, gasFeecap 101536, gasPremium 100482, gasLimit 49208912, val 0 FIL
9: ...nv2unb2a -> ...01357002, method 8, gasFeecap 101591, gasPremium 100463, gasLimit 8681372832, val 0 FIL
10: ...ppf6jh4a -> f053446, method 5, gasFeecap 101515, gasPremium 100461, gasLimit 28666653, val 0 FIL
11: ...cwt2yd6a -> ...01084913, method 6, gasFeecap 101507, gasPremium 100453, gasLimit 169688326, val 0 FIL
12: ...74cq24aq -> ...01224705, method 5, gasFeecap 101435, gasPremium 100381, gasLimit 28524601, val 0 FIL
13: ...qaqrohsa -> f05, method 4, gasFeecap 101427, gasPremium 100373, gasLimit 488394123, val 0 FIL
14: ...zxmdv7zq -> ...02290991, method 6, gasFeecap 101407, gasPremium 100353, gasLimit 74409952, val 0 FIL
15: ...mw4kiibq -> ...02239698, method 7, gasFeecap 101375, gasPremium 100321, gasLimit 71899030, val 0 FIL
selected messages:  16
total gas limit of selected messages: 9983940827 / 10000000000 (99.84%)

@rjan90
Copy link
Contributor

rjan90 commented Sep 4, 2023

Upon further time-logging of createMessageChains, it seems like GetActorAfter is taking a suspiciously long time sometimes:

2023-09-04T13:14:12.940+0200	DEBUG	messagepool	messagepool/selection.go:780	Time taken to collect all messages: 1.273µs
2023-09-04T13:14:12.940+0200	DEBUG	messagepool	messagepool/selection.go:787	Time taken to sort by nonce: 1.626µs
2023-09-04T13:14:15.329+0200	DEBUG	messagepool	messagepool/selection.go:802	Time taken to load actor state: 2.388853921s
--------
2023-09-04T13:14:15.356+0200	DEBUG	messagepool	messagepool/selection.go:950	Time taken to link dependent chains: 89ns
2023-09-04T13:14:15.356+0200	INFO	messagepool	messagepool/selection.go:500	create message chains done	{"took": 2.415713845}
2023-09-04T13:14:15.356+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 2.427738929} 

@rjan90
Copy link
Contributor

rjan90 commented Sep 5, 2023

There has not been any period of time with a large rise in mpool over the weekend, so I have not been able to check if this puts additional strain on MpoolSelect. Some additional extended logging when MpoolSelect was taking a lot of time:

  • The chain just had period with a smaller but fast rise in the amount of messages in the mpool (in 1000-1500 messages range). What is obvious from the logs is that the increase in the mpool-size also increases the average MpoolSelect-times. In this case 1000-1500 messages in the pool moved the average MpoolSelect towards 3/4/500ms, while the spikes still happens on occasions.
2023-09-05T11:43:59.526+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.391530645}
2023-09-05T11:44:30.144+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.401559869}
2023-09-05T11:45:00.792+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.426182197}
2023-09-05T11:45:34.677+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 3.674303277}
2023-09-05T11:46:05.753+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.386374766}
2023-09-05T11:46:36.319+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.37554891}
2023-09-05T11:47:07.590+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.42030188}
2023-09-05T11:47:38.501+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.361041131}
2023-09-05T11:48:09.025+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.338459937}
2023-09-05T11:48:39.681+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.457577846}
2023-09-05T11:49:10.232+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.363475516}
2023-09-05T11:49:40.797+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 0.378885376}

@rjan90
Copy link
Contributor

rjan90 commented Sep 5, 2023

During yesterdays standup, we discussed that it would be good to run the node with the refinements to the locking mechanism in #10865, togehether with the additional metrics, to see if the PR helped towards the spikes in MpoolSelect.

  • With the refinements to the locking mechansims in the Mpool, larger spikes in the time MpoolSelect took was still observed similar to when it was running without the refinements.

When running the node with addtional time logging around GetActorAfter (0cf99f0), one can observe that when the MpoolSelect timing is spiking, it is happening because TipSetState in the GetActorAfter function:

func (mpp *mpoolProvider) GetActorAfter(addr address.Address, ts *types.TipSet) (*types.Actor, error) {
if mpp.IsLite() {
return mpp.getActorLite(addr, ts)
}
stcid, _, err := mpp.sm.TipSetState(context.TODO(), ts)
if err != nil {
return nil, xerrors.Errorf("computing tipset state for GetActor: %w", err)
}
st, err := mpp.sm.StateTree(stcid)
if err != nil {
return nil, xerrors.Errorf("failed to load state tree: %w", err)
}
return st.GetActor(addr)
}
is waiting for ApplyBlocks to finish.

Example:

2023-09-05T12:16:31.775+0200	INFO	consensus-common	consensus/compute_state.go:305	ApplyBlocks stats	{"early": 0.000003686, "earlyCronGas": 0, "vmMsg": 0.567750166, "msgGas": 7640400229, "vmCron": 1.133696072, "cronGas": 40755536714, "vmFlush": 0.182164105, "epoch": "3186753", "tsk": "{bafy2bzacecrv6bqqo2fyd2n76rceqtm67ialwttuai5x4mopjrmxelfriwczi}"}
2023-09-05T12:16:31.775+0200	DEBUG	messagepool	messagepool/provider.go:119	TipSetState took 1.884736422s
2023-09-05T12:16:31.775+0200	DEBUG	messagepool	messagepool/provider.go:126	StateTree took 97.39µs
2023-09-05T12:16:31.776+0200	DEBUG	messagepool	messagepool/provider.go:130	GetActor took 796.957µs
2023-09-05T12:16:31.776+0200	DEBUG	messagepool	messagepool/provider.go:119	TipSetState took 9.508µs
2023-09-05T12:16:31.833+0200	INFO	messagepool	messagepool/selection.go:500	create message chains done	{"took": 1.942789728}
2023-09-05T12:16:31.834+0200	INFO	messagepool	messagepool/selection.go:480	message selection done	{"took": 1.9458595}

@rjan90
Copy link
Contributor

rjan90 commented Sep 19, 2023

Closing this issue, now that the initial monitoring phase has been completed:

  • Additional metrics in lotus-shed mpool miner-select-messages has been merged.
  • Found that mpool-select on occasions take a lot longer then expected.

Based on these findings, a follow-up issue has been added here: #11251, and work on optimizations has started.

@rjan90 rjan90 closed this as completed Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants