Blocking messages in mpool might cause lagging synchronization #10518

mostcute · 2023-03-20T17:02:05Z

Checklist

This is not a security-related bug/issue. If it is, please follow please follow the security policy.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.

Lotus component

lotus daemon - chain sync
lotus fvm/fevm - Lotus FVM and FEVM interactions
lotus miner/worker - sealing
lotus miner - proving(WindowPoSt/WinningPoSt)
lotus JSON-RPC API
lotus message management (mpool)
Other

Lotus Version

Daemon:  1.20.4+mainnet+git.977a53d91.dirty+api1.5.0
Local: lotus version 1.20.4+mainnet+git.977a53d91.dirty

little code change on myside

Repro Steps

Run '...'
Do '...'
See error '...'
...

Describe the Bug

After the last upgrade time, I ran into some sync lagging issues
I'm using an older version of the datastore, and it's taking up about 7TB of space
Daemon Node suddenly couldn't keep up with the chain，I thought this was a problem with the database being too large, so I was ready to resync from snap import

At the same time I noticed the splitstore feature, which I used to think was immature, after all this time, it should be time to use it, and according to his characteristic, he will maintain a smaller hotstore, so I may get two advantages

Smaller datasets can provide better read and write performance, perhaps helping with chain synchronization
Regular trimming can use less storage space and does not require frequent maintenance

So I turned on the splitstore feature with the following configuration

[Chainstore]
  # type: bool
  # env var: LOTUS_CHAINSTORE_ENABLESPLITSTORE
  EnableSplitstore = true

[Chainstore.Splitstore]
  ColdStoreType = "discard"

After restarting, it seems that the sync is getting better，But nodes still lag behind from time to time

I searched for a lot of solutions and ended up adding the following configuration

sysctl -w net.core.rmem_max=2500000
export LOTUS_CHAIN_BADGERSTORE_DISABLE_FSYNC=1

At the same time, I also found that when some messages in mpool are blocked, the synchronization lag problem will be exacerbated ，I suspect that the large number of messages blocked in the mpool due to the rise of basefee also affects chain synchronization

therefore, I deliberately used a single node for the growing miner, so that I could confirm that the messages in the mpool also affected synchronization (As a result of my testing, nodes with local messages（about 50-500 msgs） blocking in the mempool will experience more frequent and severe sync lag issues)

At present, the situation of chain synchronization lagging has been greatly reduced utill i met the hotstore GC, it almost kill my miner because daemon sync behind 30+min during the GC time

@ZenGround0 provide a experimental patch: #10392 ,i will try it if i met the hotstore GC cause sync problem Again .

and @ZenGround0 want me to share my problem,so that someone who know a bit more about that part of the system can try to discover the relationship between synchronization performance and mpool message blocking

Logging Information

tell me if you need more detail

The text was updated successfully, but these errors were encountered:

mostcute · 2023-03-21T03:08:00Z

Let me add some details

I now have two nodes, the bigger one with about 7TB /chain Dir
start splitstore without reimport from snapshot。After one day, I realized that doing so might cause some performance issues, since I could still see the data under /chain with lsof -p

But I still don't want to reimport because it causes the service to be unavailable for up to hours

I just restart it with raname/move the chain dir to chainold dir.

The other node is imported from a brand new snapshot.

let me name the node nodeA and nodeB
nodeA have 64core cpus and nodeB have 24cpus

i move the miner (still growing power) to nodeB

NodeB is experiencing more severe and frequent synchronization lag problems(Dozens of times per hour,behind about 1-3min,50-500 msgs in mpool sometimes ) without hotstoreGC

with hotstoreGC, the nodeA is worse,it behind up to 40min,Node B is significantly less affected, it behind about 2-6 min.

I saw the discussion from the team,I think the impact of "msgs in Mpool" may have come from Miner's call to State

By the way, the way I track sync lag is the same way in lotus-miner info, I push an MSG to phone when it is behind

yangyueqiangyy · 2023-03-22T05:05:12Z

I have the same problem.

jennijuju · 2023-03-23T01:26:04Z

@SBudo observed the same behaviour when he pushes 223 commit message to the node and msg got stuck due to base fee, the node sync freezes /slow down

SBudo · 2023-03-23T01:26:41Z

Getting the logs, will attach them here shortly
daemon.log

Issue might not be related to this one, so will open a separate one (can merge them later if needed)

mostcute added kind/bug Kind: Bug need/triage labels Mar 20, 2023

rjan90 removed the need/triage label Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking messages in mpool might cause lagging synchronization #10518

Blocking messages in mpool might cause lagging synchronization #10518

mostcute commented Mar 20, 2023

mostcute commented Mar 21, 2023

yangyueqiangyy commented Mar 22, 2023 •

edited

Loading

jennijuju commented Mar 23, 2023

SBudo commented Mar 23, 2023 •

edited

Loading

Blocking messages in mpool might cause lagging synchronization #10518

Blocking messages in mpool might cause lagging synchronization #10518

Comments

mostcute commented Mar 20, 2023

Checklist

Lotus component

Lotus Version

Repro Steps

Describe the Bug

Logging Information

mostcute commented Mar 21, 2023

yangyueqiangyy commented Mar 22, 2023 • edited Loading

jennijuju commented Mar 23, 2023

SBudo commented Mar 23, 2023 • edited Loading

yangyueqiangyy commented Mar 22, 2023 •

edited

Loading

SBudo commented Mar 23, 2023 •

edited

Loading