Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking messages in mpool might cause lagging synchronization #10518

Open
6 of 11 tasks
mostcute opened this issue Mar 20, 2023 · 4 comments
Open
6 of 11 tasks

Blocking messages in mpool might cause lagging synchronization #10518

mostcute opened this issue Mar 20, 2023 · 4 comments
Labels
kind/bug Kind: Bug

Comments

@mostcute
Copy link

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus fvm/fevm - Lotus FVM and FEVM interactions
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt/WinningPoSt)
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon:  1.20.4+mainnet+git.977a53d91.dirty+api1.5.0
Local: lotus version 1.20.4+mainnet+git.977a53d91.dirty

little code change on myside

Repro Steps

  1. Run '...'
  2. Do '...'
  3. See error '...'
    ...

Describe the Bug

After the last upgrade time, I ran into some sync lagging issues
I'm using an older version of the datastore, and it's taking up about 7TB of space
Daemon Node suddenly couldn't keep up with the chain,I thought this was a problem with the database being too large, so I was ready to resync from snap import

At the same time I noticed the splitstore feature, which I used to think was immature, after all this time, it should be time to use it, and according to his characteristic, he will maintain a smaller hotstore, so I may get two advantages

  1. Smaller datasets can provide better read and write performance, perhaps helping with chain synchronization
  2. Regular trimming can use less storage space and does not require frequent maintenance

So I turned on the splitstore feature with the following configuration

[Chainstore]
  # type: bool
  # env var: LOTUS_CHAINSTORE_ENABLESPLITSTORE
  EnableSplitstore = true

[Chainstore.Splitstore]
  ColdStoreType = "discard"

After restarting, it seems that the sync is getting better,But nodes still lag behind from time to time

I searched for a lot of solutions and ended up adding the following configuration

sysctl -w net.core.rmem_max=2500000
export LOTUS_CHAIN_BADGERSTORE_DISABLE_FSYNC=1

At the same time, I also found that when some messages in mpool are blocked, the synchronization lag problem will be exacerbated ,I suspect that the large number of messages blocked in the mpool due to the rise of basefee also affects chain synchronization

therefore, I deliberately used a single node for the growing miner, so that I could confirm that the messages in the mpool also affected synchronization (As a result of my testing, nodes with local messages(about 50-500 msgs) blocking in the mempool will experience more frequent and severe sync lag issues)

At present, the situation of chain synchronization lagging has been greatly reduced utill i met the hotstore GC, it almost kill my miner because daemon sync behind 30+min during the GC time

image (1)

@ZenGround0 provide a experimental patch: #10392 ,i will try it if i met the hotstore GC cause sync problem Again .

and @ZenGround0 want me to share my problem,so that someone who know a bit more about that part of the system can try to discover the relationship between synchronization performance and mpool message blocking

Logging Information

tell me if you need more detail
@mostcute
Copy link
Author

Let me add some details

I now have two nodes, the bigger one with about 7TB /chain Dir
start splitstore without reimport from snapshot。After one day, I realized that doing so might cause some performance issues, since I could still see the data under /chain with lsof -p

But I still don't want to reimport because it causes the service to be unavailable for up to hours

I just restart it with raname/move the chain dir to chainold dir.

The other node is imported from a brand new snapshot.

let me name the node nodeA and nodeB
nodeA have 64core cpus and nodeB have 24cpus

i move the miner (still growing power) to nodeB

NodeB is experiencing more severe and frequent synchronization lag problems(Dozens of times per hour,behind about 1-3min,50-500 msgs in mpool sometimes ) without hotstoreGC

with hotstoreGC, the nodeA is worse,it behind up to 40min,Node B is significantly less affected, it behind about 2-6 min.

I saw the discussion from the team,I think the impact of "msgs in Mpool" may have come from Miner's call to State

By the way, the way I track sync lag is the same way in lotus-miner info, I push an MSG to phone when it is behind

@yangyueqiangyy
Copy link

yangyueqiangyy commented Mar 22, 2023

I have the same problem.

@jennijuju
Copy link
Member

@SBudo observed the same behaviour when he pushes 223 commit message to the node and msg got stuck due to base fee, the node sync freezes /slow down

@SBudo
Copy link

SBudo commented Mar 23, 2023

Getting the logs, will attach them here shortly
daemon.log

Issue might not be related to this one, so will open a separate one (can merge them later if needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Kind: Bug
Projects
None yet
Development

No branches or pull requests

5 participants