Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent small reorgs caused by heimdall <-> bor interaction #1118

Closed
CitizenSearcher opened this issue Dec 30, 2023 · 23 comments
Closed

Frequent small reorgs caused by heimdall <-> bor interaction #1118

CitizenSearcher opened this issue Dec 30, 2023 · 23 comments

Comments

@CitizenSearcher
Copy link

CitizenSearcher commented Dec 30, 2023

System information

Bor client version: v1.1.0

Heimdall client version: v1.0.3

OS & Version: Linux

Environment: Polygon Mainnet

Type of node: Full

Additional Information: Client modified to add telemetry to block imports.

Overview of the problem

I noticed that my client was experiencing many small reorgs (1-2 blocks), so decided to dig in and diagnose the problem. I run several nodes, and notice this problem on all of them. My setup is that I run a local heimdall client, communicating with bor via the HTTP client.

I narrowed down the issue to the communication between bor and heimdall, specifically in the FetchMilestoneID method.

The problem is as following:

Concurrently, the GetVoteOnHash function is called by heimdall at an arbitary interval. This function also calls down and obtains a full lock on the downloaders whitelist, and the FetchMilestoneID function is called on the HeimdallClient that's been loaded.

isLocked := downloader.LockMutex(endBlockNr)

m.finality.Lock()

The issue is that when using either the GRPC or HTTP clients, the FetchMilestoneID function will go into a retry loop, and block until a successful request is made. This keeps the whitelist locked during the retry, stopping insertChain from obtaining it's read lock, thereby stopping any blocks from being inserted.

func (bc *BlockChain) WriteBlockAndSetHead(ctx context.Context, block *types.Block, receipts []*types.Receipt, logs []*types.Log, state *state.StateDB, emitHeadEvent bool) (status WriteStatus, err error) {

if !bc.chainmu.TryLock() {

_, err = w.chain.WriteBlockAndSetHead(ctx, block, receipts, logs, task.state, true)

Using the heimdall HTTP client, the first request will almost always fail, with the following error:
WARN [12-29|04:05:03.529] an error while trying fetching from Heimdall path="/milestone/ID/4fb6bfe2-eec4-4366-9f18-ef272ee4c338 - 0x4951435518fa374ac10951fee3aeffa1d04aded8" attempt=1 error="error while fetching data from Heimdall: response code 500
but the second will be successful. With the retry delay at 5 seconds, this causes a 5 second import delay on blocks, or a 5 second production delay for validators. When this occurs it can cause a reorg on client-side, as it's enough time for a backup's block to have been propagated, and then a race occurs between the backup and primary validators block (see logs below).

Reproduction Steps

Run the setup as bor and heimdall as separate processes running via http. Looking at the normal bor logs you should notice several small reorgs (1-2 blocks) per hour.

Here are some telemetry logs I added for block imports. Here is an example of a client-side reorg:

{
  "level": "info",
  "timestamp": 1703870667,
  "caller": "cmd/logBlocks.go:29",
  "msg": "New head",
  "hash": "0x36694a3442b3bd7d6145de297ebafe53f03e60c6d6412f9690d332ca7cc31025",
  "blockNumber": 51706222,
  "author": "0xeEDBa2484aAF940f37cd3CD21a5D7C4A7DAfbfC0",
  "telemetry": {
    "InsertChainLen": "1",
    "ReceivedTime": "2023-12-29T12:24:27.09947154-05:00",
    "EnqueuedTime": "2023-12-29T12:24:27.100187185-05:00",
    "InjectRecvTime": "2023-12-29T12:24:27.100192861-05:00",
    "FirstQueuedTime": "2023-12-29T12:24:27.100195255-05:00",
    "LastQueuedTime": "0001-01-01T00:00:00Z",
    "FirstDequeuedTime": "2023-12-29T12:24:27.100197855-05:00",
    "LastDequeuedTime": "2023-12-29T12:24:27.100197978-05:00",
    "NumQueues": "1",
    "ImportTime": "2023-12-29T12:24:27.100315563-05:00",
    "VerifyHeaderTime": "2023-12-29T12:24:27.100321931-05:00",
    "InserterTime": "2023-12-29T12:24:27.10036528-05:00",
    "InsertChainTime": "2023-12-29T12:24:27.100366294-05:00",
    "InsertChainInnerTime": "2023-12-29T12:24:27.100367848-05:00",
    "ValidateReorgTime": "2023-12-29T12:24:27.100802394-05:00",
    "ServiceIsValidChainTime": "2023-12-29T12:24:27.100803294-05:00",
    "CheckpointServiceTime": "2023-12-29T12:24:27.100803724-05:00",
    "CheckpointLockAcqTime": "2023-12-29T12:24:27.100803972-05:00",
    "MilestoneServiceTime": "2023-12-29T12:24:27.100805386-05:00",
    "MilestoneLockAcqTime": "2023-12-29T12:24:31.869313386-05:00",
    "MilestoneIsAllowedTime": "2023-12-29T12:24:31.869314849-05:00",
    "MilestoneIsFutureTime": "2023-12-29T12:24:31.869314953-05:00",
    "ValidateStateTime": "2023-12-29T12:24:31.908511631-05:00",
    "LeftTrimTime": "2023-12-29T12:24:31.869316652-05:00",
    "ProcessBlockTime": "2023-12-29T12:24:31.869366778-05:00",
    "WriteBlockTime": "2023-12-29T12:24:31.912158406-05:00",
    "WriteBlockWithStateTime": "2023-12-29T12:24:31.912159553-05:00",
    "ReorgNeededTime": "2023-12-29T12:24:31.943953991-05:00",
    "ReorgTime": "0001-01-01T00:00:00Z",
    "WriteHeadBlockTime": "2023-12-29T12:24:31.943998547-05:00",
    "ChainFeedTime": "2023-12-29T12:24:31.944797285-05:00"
  }
}
{
  "level": "info",
  "timestamp": 1703870671,
  "caller": "cmd/logBlocks.go:29",
  "msg": "New head",
  "hash": "0xae2d28e511a229aaf1250ae0bbb8ffbe631b2ca0d4dcddd95671e2acd554c1f5",
  "blockNumber": 51706223,
  "author": "0xF0245F6251Bef9447A08766b9DA2B07b28aD80B0",
  "telemetry": {
    "InsertChainLen": "1",
    "ReceivedTime": "2023-12-29T12:24:31.711396918-05:00",
    "EnqueuedTime": "2023-12-29T12:24:31.711955247-05:00",
    "InjectRecvTime": "2023-12-29T12:24:31.711964272-05:00",
    "FirstQueuedTime": "2023-12-29T12:24:31.711967314-05:00",
    "LastQueuedTime": "2023-12-29T12:24:31.943700431-05:00",
    "FirstDequeuedTime": "2023-12-29T12:24:31.716341666-05:00",
    "LastDequeuedTime": "2023-12-29T12:24:31.944918357-05:00",
    "NumQueues": "1006",
    "ImportTime": "2023-12-29T12:24:31.945015569-05:00",
    "VerifyHeaderTime": "2023-12-29T12:24:31.94540054-05:00",
    "InserterTime": "2023-12-29T12:24:31.946034297-05:00",
    "InsertChainTime": "2023-12-29T12:24:31.946035543-05:00",
    "InsertChainInnerTime": "2023-12-29T12:24:31.946036707-05:00",
    "ValidateReorgTime": "2023-12-29T12:24:31.947188323-05:00",
    "ServiceIsValidChainTime": "2023-12-29T12:24:31.947188851-05:00",
    "CheckpointServiceTime": "2023-12-29T12:24:31.947189217-05:00",
    "CheckpointLockAcqTime": "2023-12-29T12:24:31.947189408-05:00",
    "MilestoneServiceTime": "2023-12-29T12:24:31.947190228-05:00",
    "MilestoneLockAcqTime": "2023-12-29T12:24:31.947190416-05:00",
    "MilestoneIsAllowedTime": "2023-12-29T12:24:31.947190541-05:00",
    "MilestoneIsFutureTime": "2023-12-29T12:24:31.947190703-05:00",
    "ValidateStateTime": "2023-12-29T12:24:31.974696108-05:00",
    "LeftTrimTime": "2023-12-29T12:24:31.947191161-05:00",
    "ProcessBlockTime": "2023-12-29T12:24:31.947210556-05:00",
    "WriteBlockTime": "2023-12-29T12:24:31.976563424-05:00",
    "WriteBlockWithStateTime": "2023-12-29T12:24:31.976564255-05:00",
    "ReorgNeededTime": "2023-12-29T12:24:32.007908041-05:00",
    "ReorgTime": "0001-01-01T00:00:00Z",
    "WriteHeadBlockTime": "2023-12-29T12:24:32.007943268-05:00",
    "ChainFeedTime": "2023-12-29T12:24:32.008521583-05:00"
  }
}
{
  "level": "info",
  "timestamp": 1703870669,
  "caller": "cmd/logBlocks.go:29",
  "msg": "New head",
  "hash": "0xd49f8c1252fbf9da3cb891c3d51b1cb797871df9fb59e1993dab33cdf410de9e",
  "blockNumber": 51706223,
  "author": "0xeEDBa2484aAF940f37cd3CD21a5D7C4A7DAfbfC0",
  "telemetry": {
    "InsertChainLen": "1",
    "ReceivedTime": "2023-12-29T12:24:29.092560851-05:00",
    "EnqueuedTime": "2023-12-29T12:24:29.093314855-05:00",
    "InjectRecvTime": "2023-12-29T12:24:29.093323524-05:00",
    "FirstQueuedTime": "2023-12-29T12:24:29.093327191-05:00",
    "LastQueuedTime": "2023-12-29T12:24:31.94411027-05:00",
    "FirstDequeuedTime": "2023-12-29T12:24:29.09333105-05:00",
    "LastDequeuedTime": "2023-12-29T12:24:31.945017915-05:00",
    "NumQueues": "5952",
    "ImportTime": "2023-12-29T12:24:31.94507506-05:00",
    "VerifyHeaderTime": "2023-12-29T12:24:31.945572545-05:00",
    "InserterTime": "2023-12-29T12:24:31.946408067-05:00",
    "InsertChainTime": "2023-12-29T12:24:31.946408758-05:00",
    "InsertChainInnerTime": "2023-12-29T12:24:32.008647383-05:00",
    "ValidateReorgTime": "2023-12-29T12:24:32.009664241-05:00",
    "ServiceIsValidChainTime": "2023-12-29T12:24:32.00966504-05:00",
    "CheckpointServiceTime": "2023-12-29T12:24:32.009665319-05:00",
    "CheckpointLockAcqTime": "2023-12-29T12:24:32.009665502-05:00",
    "MilestoneServiceTime": "2023-12-29T12:24:32.0096665-05:00",
    "MilestoneLockAcqTime": "2023-12-29T12:24:32.009666769-05:00",
    "MilestoneIsAllowedTime": "2023-12-29T12:24:32.009666936-05:00",
    "MilestoneIsFutureTime": "2023-12-29T12:24:32.009667081-05:00",
    "ValidateStateTime": "2023-12-29T12:24:32.039850807-05:00",
    "LeftTrimTime": "2023-12-29T12:24:32.00966756-05:00",
    "ProcessBlockTime": "2023-12-29T12:24:32.0096878-05:00",
    "WriteBlockTime": "2023-12-29T12:24:32.041935052-05:00",
    "WriteBlockWithStateTime": "2023-12-29T12:24:32.041936502-05:00",
    "ReorgNeededTime": "2023-12-29T12:24:32.050803623-05:00",
    "ReorgTime": "2023-12-29T12:24:32.050865604-05:00",
    "WriteHeadBlockTime": "2023-12-29T12:24:32.053850313-05:00",
    "ChainFeedTime": "2023-12-29T12:24:32.054247845-05:00"
  }
}

Here we see block 51706222 from the primary proposer arrive at 2023-12-29T12:24:27.099. It is blocked for 4 seconds, and obtains the milestone read lock at 2023-12-29T12:24:31.869.

In the mean time, two competing blocks for 51706223 have arrived from the primary and backup validator at 2023-12-29T12:24:29.092 and 023-12-29T12:24:31.711 respectively. Even though the primary's block correctly arrived first, the lock delayed the prior blocks import so long that both were now in the import cycle. A race condition then occurs between the primary and backup block. In this case the backups block wins, but is almost immediately reorged upon the primary's import finishing.

Additional Information

The client side problem is eliminated using the HeimdallApp in-process client, since the FetchMilestoneID function returns immediately. Even removing the client-side problem using the above method. I still observe a concerning amount of 1-2 block reorgs, occuring where a primary proposers block is around 4-6 seconds late. Given the timing and frequency, I'd expect that many are caused this problem.

@CitizenSearcher
Copy link
Author

On the nodes that I configured to use the HeimdallApp in-process client, I experiences bad merkle root issues on the bor side on block 51737759.

@VAIBHAVJINDAL3012
Copy link
Contributor

VAIBHAVJINDAL3012 commented Jan 2, 2024

If you are not running a validator node, then your chain won't be locked. GetVoteOnHash won't be called in you client.
And GetVoteOnHash is always called on the Tip-16th block, so If a primary blocks follows the Tip-16th block number chain, then there would be no delay in its import.

@CitizenSearcher
Copy link
Author

CitizenSearcher commented Jan 3, 2024

Ok got it. Edited the original post to reflect this.

The problem still exists on normal full nodes, as demonstrated by the logs. Sentries would suffer the same issue, so if a validator is only running a single sentry, propagation of a block would be halted for the duration of the call. From a network perspective, the results would be the same as the validator not producing blocks for this period.

@CitizenSearcher
Copy link
Author

Hey @VAIBHAVJINDAL3012, just wanted to follow up on this, given that the issue happening on sentry nodes would have the same effect as delayed block production on validators.

@VAIBHAVJINDAL3012
Copy link
Contributor

Hi @CitizenSearcher
Validator and Sentries behave quite differently when we talk about the lock. In our system, only validator's heimdall node call the Bor for the vote and lock it for few seconds but sentry or full node's heimdall doesn't do that.
So sentry won't get lock by this.

And even in Validator, lock is acquired on depth 16th block, it can't cause 1-2 blocks and won't cause any delay in block production.

And 1-2 blocks reorg occur sometime in our Bor network due to P2P latency.
I hope I was able to answer your query.

@CitizenSearcher
Copy link
Author

@VAIBHAVJINDAL3012 Thanks for the explanation. As I posted in the logs, this function was getting called on my normal full node, and was causing reorgs on it client side. If it's not supposed to be getting called on the full node, perhaps I had something misconfigured?

@kaber2
Copy link

kaber2 commented Jan 11, 2024

Just a data point, while I'm also seeing way more reorgs than what could be considered acceptable (many per hour), I've tried your patch on one of my nodes it made no difference for me.

@VAIBHAVJINDAL3012
Copy link
Contributor

VAIBHAVJINDAL3012 commented Jan 11, 2024

@VAIBHAVJINDAL3012 Thanks for the explanation. As I posted in the logs, this function was getting called on my normal full node, and was causing reorgs on it client side. If it's not supposed to be getting called on the full node, perhaps I had something misconfigured?

Your Heimdall shouldn't call this fn, Can you send me your Heimdall logs related to milestone module.

@VAIBHAVJINDAL3012
Copy link
Contributor

Just a data point, while I'm also seeing way more reorgs than what could be considered acceptable (many per hour), I've tried your patch on one of my nodes it made no difference for me.

What is on an average length of reorgs you are facing?

@kaber2
Copy link

kaber2 commented Jan 11, 2024

Just a data point, while I'm also seeing way more reorgs than what could be considered acceptable (many per hour), I've tried your patch on one of my nodes it made no difference for me.

What is on an average length of reorgs you are facing?

Looking at the past 24 hours on a random one of my nodes, the vast majority is of depth 1, maybe 10% of depth 2, and 3% of depth 3-8. Total number is around 250.

@VAIBHAVJINDAL3012
Copy link
Contributor

Can you please check the number of connected peers, as it seems to me that due to p2p latency you are facing this issue

@kaber2
Copy link

kaber2 commented Jan 14, 2024

The node in question is currently connected to 668 peers. Its running on a 64 core EPYC 7702P (but limited to 24 cores) with 512GB memory and a RAID0 of Intel NVME drives.

@CitizenSearcher
Copy link
Author

I have the same observation as kaber2, which motivated the original post / investigation. There are several small reorgs per hour. Vast majority are 1 block reorgs, with occasional 2 or 3 block ones. My nodes also have good hardware and are well connected. The network reorgs (most) I see occur on all my nodes, which generally receive the blocks within 200ms or eachother or so. The ones caused client side by the race condition are rarer and more arbitrary, not appearing on all nodes.

W/ regards to logs, related to milestone, here's a sample of recent logs I'm seeing:

Jan 15 19:23:30 node heimdalld[1791752]: ERROR[2024-01-15|19:23:30.946] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:23:24.016930929Z bufferTime=4m16s
Jan 15 19:23:37 node heimdalld[1791752]: ERROR[2024-01-15|19:23:37.453] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:23:30.63069891Z bufferTime=4m16s
Jan 15 19:23:58 node heimdalld[1791752]: ERROR[2024-01-15|19:23:58.681] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:23:51.612074119Z bufferTime=4m16s
Jan 15 19:24:05 node heimdalld[1791752]: ERROR[2024-01-15|19:24:05.001] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:23:58.311327311Z bufferTime=4m16s
Jan 15 19:24:11 node heimdalld[1791752]: ERROR[2024-01-15|19:24:11.188] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:04.615182401Z bufferTime=4m16s
Jan 15 19:24:11 node heimdalld[1791752]: ERROR[2024-01-15|19:24:11.188] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:04.615182401Z bufferTime=4m16s
Jan 15 19:24:11 node heimdalld[1791752]: ERROR[2024-01-15|19:24:11.188] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:04.615182401Z bufferTime=4m16s
Jan 15 19:24:18 node heimdalld[1791752]: ERROR[2024-01-15|19:24:18.803] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:10.819151655Z bufferTime=4m16s
Jan 15 19:24:18 node heimdalld[1791752]: ERROR[2024-01-15|19:24:18.804] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:10.819151655Z bufferTime=4m16s
Jan 15 19:24:31 node heimdalld[1791752]: ERROR[2024-01-15|19:24:31.623] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:24.732185051Z bufferTime=4m16s
Jan 15 19:24:38 node heimdalld[1791752]: ERROR[2024-01-15|19:24:38.254] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:31.238306931Z bufferTime=4m16s
Jan 15 19:24:57 node heimdalld[1791752]: ERROR[2024-01-15|19:24:57.437] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:50.82679849Z bufferTime=4m16s
Jan 15 19:25:03 node heimdalld[1791752]: ERROR[2024-01-15|19:25:03.753] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:24:57.073691819Z bufferTime=4m16s
Jan 15 19:25:10 node heimdalld[1791752]: ERROR[2024-01-15|19:25:10.157] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:25:03.320538475Z bufferTime=4m16s
Jan 15 19:25:10 node heimdalld[1791752]: ERROR[2024-01-15|19:25:10.158] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:25:03.320538475Z bufferTime=4m16s
Jan 15 19:25:10 node heimdalld[1791752]: ERROR[2024-01-15|19:25:10.158] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:25:03.320538475Z bufferTime=4m16s
Jan 15 19:25:17 node heimdalld[1791752]: ERROR[2024-01-15|19:25:17.456] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:25:09.714158271Z bufferTime=4m16s
Jan 15 19:25:17 node heimdalld[1791752]: ERROR[2024-01-15|19:25:17.457] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:25:09.714158271Z bufferTime=4m16s
Jan 15 19:25:29 node heimdalld[1791752]: ERROR[2024-01-15|19:25:29.916] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:25:23.426314183Z bufferTime=4m16s
Jan 15 19:25:36 node heimdalld[1791752]: ERROR[2024-01-15|19:25:36.398] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:21:24Z currenttime=2024-01-15T19:25:29.532229514Z bufferTime=4m16s
Jan 15 19:26:15 node heimdalld[1791752]: ERROR[2024-01-15|19:26:15.472] Error validating milestone                   module=Milestone startBlock=52363110 endBlock=52363240 hash=0x9afef3f68852bcbe213fd330d98a4573d2d62b14ed952e3586c5efa7a8c4cbff milestoneId="e8efe3ef-9868-4332-835a-6fe127610373 - 0xd98a4573d2d62b14ed952e3586c5efa7a8c4cbff" error="Error in fetching vote from matic chainerrPost \"http://localhost:8545\": context deadline exceeded"
Jan 15 19:26:15 node heimdalld[1791752]: ERROR[2024-01-15|19:26:15.472] Hash is not valid                            module=Milestone startBlock=52363110 endBlock=52363240 hash=0x9afef3f68852bcbe213fd330d98a4573d2d62b14ed952e3586c5efa7a8c4cbff milestoneId="e8efe3ef-9868-4332-835a-6fe127610373 - 0xd98a4573d2d62b14ed952e3586c5efa7a8c4cbff"
Jan 15 19:26:23 node heimdalld[1791752]: ERROR[2024-01-15|19:26:23.071] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:26:16Z currenttime=2024-01-15T19:26:16.311740996Z bufferTime=4m16s
Jan 15 19:26:23 node heimdalld[1791752]: ERROR[2024-01-15|19:26:23.072] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:26:16Z currenttime=2024-01-15T19:26:16.311740996Z bufferTime=4m16s
Jan 15 19:26:29 node heimdalld[1791752]: ERROR[2024-01-15|19:26:29.696] Invalid Milestone Timeout msg                module=Milestone lastMilestoneTime=2024-01-15T19:26:16Z currenttime=2024-01-15T19:26:22.68177899Z bufferTime=4m16s
Jan 15 19:26:41 node heimdalld[1791752]: ERROR[2024-01-15|19:26:41.945] Error validating milestone                   module=Milestone startBlock=52363241 endBlock=52363259 hash=0xe53abc7a4d6f9b247b4bb933c1291b6e9085778b73607939fb46f95f07377dfa milestoneId="bd8ade0d-b76f-4017-b29c-9cc96b5855f9 - 0xc1291b6e9085778b73607939fb46f95f07377dfa" error="Error in fetching vote from matic chainerrPost \"http://localhost:8545\": context deadline exceeded"
Jan 15 19:26:41 node heimdalld[1791752]: ERROR[2024-01-15|19:26:41.945] Hash is not valid                            module=Milestone startBlock=52363241 endBlock=52363259 hash=0xe53abc7a4d6f9b247b4bb933c1291b6e9085778b73607939fb46f95f07377dfa milestoneId="bd8ade0d-b76f-4017-b29c-9cc96b5855f9 - 0xc1291b6e9085778b73607939fb46f95f07377dfa"
Jan 15 19:27:07 node heimdalld[1791752]: ERROR[2024-01-15|19:27:07.897] Error validating milestone                   module=Milestone startBlock=52363260 endBlock=52363272 hash=0xc6d85b189785e13397d9d7fb8d39d4723bb1f41abde6f440f6a14b4032c3f5be milestoneId="05d27e58-503a-4b02-84cd-cf34dbad939b - 0x8d39d4723bb1f41abde6f440f6a14b4032c3f5be" error="Error in fetching vote from matic chainerrPost \"http://localhost:8545\": context deadline exceeded"
Jan 15 19:27:07 node heimdalld[1791752]: ERROR[2024-01-15|19:27:07.897] Hash is not valid                            module=Milestone startBlock=52363260 endBlock=52363272 hash=0xc6d85b189785e13397d9d7fb8d39d4723bb1f41abde6f440f6a14b4032c3f5be milestoneId="05d27e58-503a-4b02-84cd-cf34dbad939b - 0x8d39d4723bb1f41abde6f440f6a14b4032c3f5be"
Jan 15 19:27:22 node heimdalld[1791752]: ERROR[2024-01-15|19:27:22.870] Milestone not in continuity                  module=Milestone lastMilestoneEndBlock=52363272 receivedMsgStartBlock=52363260

Additionally, bor logs related to milestone:

Jan 15 19:27:49 node bash[1792230]: INFO [01-15|19:27:49.744] Retrying again in 5 seconds to fetch data from Heimdall path="/milestone/ID/e348e6ec-2ef5-4f5b-8b2e-bf50fb1f1d79 - 0x20725cbb35a9257d18807e203fd36c5cddfa25b5" attempt=1
Jan 15 19:27:59 node bash[1792230]: INFO [01-15|19:27:59.482] Got new milestone from heimdall          start=52,363,260 end=52,363,272 hash=0xc6d85b189785e13397d9d7fb8d39d4723bb1f41abde6f440f6a14b4032c3f5be
Jan 15 19:28:11 node bash[1792230]: INFO [01-15|19:28:11.482] Got new milestone from heimdall          start=52,363,273 end=52,363,293 hash=0xd637d2a06ac5a74add23e41320725cbb35a9257d18807e203fd36c5cddfa25b5
Jan 15 19:28:23 node bash[1792230]: INFO [01-15|19:28:23.482] Got new milestone from heimdall          start=52,363,273 end=52,363,293 hash=0xd637d2a06ac5a74add23e41320725cbb35a9257d18807e203fd36c5cddfa25b5
Jan 15 19:28:29 node bash[1792230]: WARN [01-15|19:28:29.379] an error while trying fetching from Heimdall path="/milestone/ID/d3b7de28-3c89-45cc-8244-05c45ed67d07 - 0x8d068bc593527fc21bf63a788e263d788d33f21d" attempt=1 error="error while fetching data from Heimdall: response code 500"
Jan 15 19:28:29 node bash[1792230]: INFO [01-15|19:28:29.379] Retrying again in 5 seconds to fetch data from Heimdall path="/milestone/ID/d3b7de28-3c89-45cc-8244-05c45ed67d07 - 0x8d068bc593527fc21bf63a788e263d788d33f21d" attempt=1
Jan 15 19:28:33 node bash[1792230]: WARN [01-15|19:28:33.481] an error while trying fetching from Heimdall path=/milestone/lastNoAck               attempt=1 error="Get \"http://localhost:1317/milestone/lastNoAck\": context deadline exceeded"
Jan 15 19:28:33 node bash[1792230]: INFO [01-15|19:28:33.482] Retrying again in 5 seconds to fetch data from Heimdall path=/milestone/lastNoAck               attempt=1
Jan 15 19:28:33 node bash[1792230]: ERROR[01-15|19:28:33.482] Failed to fetch latest no-ack milestone  err="context deadline exceeded"
Jan 15 19:28:33 node bash[1792230]: WARN [01-15|19:28:33.482] unable to handle no-ack-milestone service err="failed to fetch latest milestone"
Jan 15 19:28:35 node bash[1792230]: INFO [01-15|19:28:35.482] Got new milestone from heimdall          start=52,363,273 end=52,363,293 hash=0xd637d2a06ac5a74add23e41320725cbb35a9257d18807e203fd36c5cddfa25b5
Jan 15 19:28:47 node bash[1792230]: INFO [01-15|19:28:47.482] Got new milestone from heimdall          start=52,363,294 end=52,363,312 hash=0xb4edee2970973c1cce3a70818d068bc593527fc21bf63a788e263d788d33f21d
Jan 15 19:28:59 node bash[1792230]: INFO [01-15|19:28:59.482] Got new milestone from heimdall          start=52,363,294 end=52,363,312 hash=0xb4edee2970973c1cce3a70818d068bc593527fc21bf63a788e263d788d33f21d
Jan 15 19:29:11 node bash[1792230]: INFO [01-15|19:29:11.482] Got new milestone from heimdall          start=52,363,294 end=52,363,312 hash=0xb4edee2970973c1cce3a70818d068bc593527fc21bf63a788e263d788d33f21d
Jan 15 19:29:14 node bash[1792230]: WARN [01-15|19:29:14.669] an error while trying fetching from Heimdall path="/milestone/ID/87b68a5a-525a-4751-8114-91ea81ef5758 - 0xa3080eb98579c53576569e9305e2f38cf155587e" attempt=1 error="error while fetching data from Heimdall: response code 500"
Jan 15 19:29:14 node bash[1792230]: INFO [01-15|19:29:14.669] Retrying again in 5 seconds to fetch data from Heimdall path="/milestone/ID/87b68a5a-525a-4751-8114-91ea81ef5758 - 0xa3080eb98579c53576569e9305e2f38cf155587e" attempt=1
Jan 15 19:29:19 node bash[1792230]: INFO [01-15|19:29:19.670] Retrying again in 5 seconds to fetch data from Heimdall path="/milestone/ID/87b68a5a-525a-4751-8114-91ea81ef5758 - 0xa3080eb98579c53576569e9305e2f38cf155587e" attempt=2
Jan 15 19:29:23 node bash[1792230]: INFO [01-15|19:29:23.482] Got new milestone from heimdall          start=52,363,294 end=52,363,312 hash=0xb4edee2970973c1cce3a70818d068bc593527fc21bf63a788e263d788d33f21d
Jan 15 19:29:35 node bash[1792230]: INFO [01-15|19:29:35.482] Got new milestone from heimdall          start=52,363,313 end=52,363,332 hash=0x3d567c92cfd4a3a324d3a7eba3080eb98579c53576569e9305e2f38cf155587e
Jan 15 19:29:40 node bash[1792230]: WARN [01-15|19:29:40.993] an error while trying fetching from Heimdall path="/milestone/ID/101a0390-77fa-477e-8893-6ff4088b9fdf - 0x28fb0bf92ba0d396fbb5dab9addfe0d8cdbb3943" attempt=1 error="error while fetching data from Heimdall: response code 500"
Jan 15 19:29:40 node bash[1792230]: INFO [01-15|19:29:40.993] Retrying again in 5 seconds to fetch data from Heimdall path="/milestone/ID/101a0390-77fa-477e-8893-6ff4088b9fdf - 0x28fb0bf92ba0d396fbb5dab9addfe0d8cdbb3943" attempt=1
Jan 15 19:29:45 node bash[1792230]: WARN [01-15|19:29:45.481] an error while trying fetching from Heimdall path=/milestone/lastNoAck               attempt=1 error="Get \"http://localhost:1317/milestone/lastNoAck\": context deadline exceeded"
Jan 15 19:29:45 node bash[1792230]: INFO [01-15|19:29:45.481] Retrying again in 5 seconds to fetch data from Heimdall path=/milestone/lastNoAck               attempt=1
Jan 15 19:29:45 node bash[1792230]: ERROR[01-15|19:29:45.481] Failed to fetch latest no-ack milestone  err="context deadline exceeded"
Jan 15 19:29:45 node bash[1792230]: WARN [01-15|19:29:45.481] unable to handle no-ack-milestone service err="failed to fetch latest milestone"
Jan 15 19:29:47 node bash[1792230]: INFO [01-15|19:29:47.482] Got new milestone from heimdall          start=52,363,313 end=52,363,332 hash=0x3d567c92cfd4a3a324d3a7eba3080eb98579c53576569e9305e2f38cf155587e
Jan 15 19:29:59 node bash[1792230]: INFO [01-15|19:29:59.482] Got new milestone from heimdall          start=52,363,333 end=52,363,346 hash=0xb4a90873137585ed121c262e28fb0bf92ba0d396fbb5dab9addfe0d8cdbb3943
Jan 15 19:30:11 node bash[1792230]: INFO [01-15|19:30:11.482] Got new milestone from heimdall          start=52,363,333 end=52,363,346 hash=0xb4a90873137585ed121c262e28fb0bf92ba0d396fbb5dab9addfe0d8cdbb3943
Jan 15 19:30:13 node bash[1792230]: WARN [01-15|19:30:13.817] an error while trying fetching from Heimdall path="/milestone/ID/c993dc9b-a79f-4562-aeae-cf7ef1d95e7b - 0x82396480ecb6225eae74d405f615ebab133737ca" attempt=1 error="error while fetching data from Heimdall: response code 500"
Jan 15 19:30:13 node bash[1792230]: INFO [01-15|19:30:13.817] Retrying again in 5 seconds to fetch data from Heimdall path="/milestone/ID/c993dc9b-a79f-4562-aeae-cf7ef1d95e7b - 0x82396480ecb6225eae74d405f615ebab133737ca" attempt=1

@CitizenSearcher
Copy link
Author

@VAIBHAVJINDAL3012

@VAIBHAVJINDAL3012
Copy link
Contributor

Hi @CitizenSearcher You can ignore this Heimdall for now, they don't have any impact on reorgs. I am saying that sometime it happens that block proposed by miner doesn't reach you on time, and meanwhile secondary block reaches you earlier. It cause 1 2 blocks reorg. 1 2 block reorgs are acceptable as P2P latency cause this with high probablity.

@CitizenSearcher
Copy link
Author

As per the logs I posted, this issues most certainly was causing a reorg on my client. The primary block arrived before the secondaries block, but the lock caused the import of both to be held up until after the secondary had arrived.

The issue I'm pointing out is that this 5 second lock causes delays in block importing, and if this issue happens on validator sentries they'd be unable to propagate the block during this time. In you're last post, you mentioned that this function shouldn't get called on regular full nodes, and I'm noticing this gets called fairly often. Either this means my nodes are misconfigured, or there's some bug in heimdall.

Either way, I'm glad you're PR to remove the milestone check on the bor side is merged in, as it should fix it. I hope that this will make it into the next release.

@VAIBHAVJINDAL3012
Copy link
Contributor

There is no locking on sentries node, and even in validators locks are done on tip-16th block, so It can't cause any issue in importing.

@VAIBHAVJINDAL3012
Copy link
Contributor

Can you please send me your Heimdall signer address which is on your machine?

@CitizenSearcher
Copy link
Author

CitizenSearcher commented Jan 23, 2024

Further examples of logs:

Jan 20 22:53:02 node bash[1792230]: WARN [01-20|22:53:02.977] Served bor_getVoteOnHash                 conn=127.0.0.1:33998    reqid=14625    duration=5.000935716s err="Milestone ID doesn't exist in Heimdall"
Jan 20 22:53:56 node bash[1792230]: WARN [01-20|22:53:56.240] Served bor_getVoteOnHash                 conn=127.0.0.1:42586    reqid=14627    duration=5.000325257s err="Milestone ID doesn't exist in Heimdall"
Jan 20 22:54:27 node bash[1792230]: WARN [01-20|22:54:27.811] Served bor_getVoteOnHash                 conn=127.0.0.1:39678    reqid=14629    duration=5.000856809s err="Milestone ID doesn't exist in Heimdall"
Jan 20 22:55:19 node bash[1792230]: WARN [01-20|22:55:19.692] Served bor_getVoteOnHash                 conn=127.0.0.1:52846    reqid=14631    duration=5.000073725s err="Milestone ID doesn't exist in Heimdall"
Jan 20 22:56:05 node bash[1792230]: WARN [01-20|22:56:05.155] Served bor_getVoteOnHash                 conn=127.0.0.1:41454    reqid=14633    duration=5.000256351s err="Milestone ID doesn't exist in Heimdall"
Jan 20 22:56:43 node bash[1792230]: WARN [01-20|22:56:43.823] Served bor_getVoteOnHash                 conn=127.0.0.1:48968    reqid=14635    duration=4.999867794s err="Milestone ID doesn't exist in Heimdall"
Jan 20 22:57:24 node bash[1792230]: WARN [01-20|22:57:24.511] Served bor_getVoteOnHash                 conn=127.0.0.1:45092    reqid=14637    duration=5.000941802s err="Milestone ID doesn't exist in Heimdall"

And the heimdall signer info:

    "validator_info": {
      "address": "770BDA9E243ECC7CD44DF13620643FA82357DC48",
      "pub_key": {
        "type": "tendermint/PubKeySecp256k1",
        "value": "BM7MrpUYl4CyaPqh8ChomlKBAt4ubTuOGowDQfCszUcYOghS2SbXSccnulAb2DzRhWzss0ihjv+VZdfsRWwCuD8="
      },
      "voting_power": "0"
    }

To deploy nodes, I'm just using the configs from launch/mainnet-v1/sentry/sentry/

@CitizenSearcher
Copy link
Author

@VAIBHAVJINDAL3012

Copy link

github-actions bot commented Feb 7, 2024

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 7, 2024
@VAIBHAVJINDAL3012
Copy link
Contributor

Hi @CitizenSearcher, I checked at my end.
Our new version is removing this interaction.
Thanks a lot

@github-actions github-actions bot removed the Stale label Feb 9, 2024
@CitizenSearcher
Copy link
Author

Hi @CitizenSearcher, I checked at my end. Our new version is removing this interaction. Thanks a lot

Yeh I saw, closing the thread now. The loop beginning maintained throughout the retry loop is probably suboptimal, but I doubt it matters after your change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants