Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird checkpointing behaviour #422

Closed
jsoares opened this issue Apr 20, 2023 · 3 comments
Closed

Weird checkpointing behaviour #422

jsoares opened this issue Apr 20, 2023 · 3 comments
Labels

Comments

@jsoares
Copy link
Contributor

jsoares commented Apr 20, 2023

Context: followed guide to completion. Deployed subnet t01020 with bottom-up and top-down periods of 30 blocks (default). No deployment issues.

Immediately after deployment, we get (notice checkpoint send fails):

[2023-04-20T22:42:35Z INFO  ipc_agent::manager::checkpoint] Config changed, reloading checkpointing subsystem
[2023-04-20T22:42:35Z INFO  ipc_agent::manager::bottomup] Starting bottom-up checkpoint manager for (child, parent) subnet pair (/root/t01020, /root)
[2023-04-20T22:42:35Z INFO  ipc_agent::manager::topdown] Starting top-down checkpoint manager for (child, parent) subnet pair (/root/t01020, /root)
[2023-04-20T22:43:49Z INFO  ipc_agent::manager::lotus] joined subnet: /root/t01020
[2023-04-20T22:44:19Z INFO  ipc_agent::manager::lotus] joined subnet: /root/t01020
[2023-04-20T22:44:19Z INFO  ipc_agent::manager::lotus] joined subnet: /root/t01020
[2023-04-20T22:45:45Z INFO  ipc_agent::manager::topdown] Submitting topdown checkpoint for account t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq
[2023-04-20T22:45:45Z ERROR ipc_agent::manager::topdown] error submitting top-down checkpoint for epoch 28743 in subnet: SubnetID { parent: "/root", actor: Address { payload: ID(1020) } }
[2023-04-20T22:45:45Z WARN  ipc_agent::manager::topdown] error submitting top-down checkpoint, waiting to next iteration: Err(json_rpc error: {"code":1,"message":"GasEstimateMessageGas error: gas estimation failed: CallWithGas failed: call raw get actor: resolution lookup failed (t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq): resolve address t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq: actor not found"})
[2023-04-20T22:45:55Z INFO  ipc_agent::manager::topdown] Submitting topdown checkpoint for account t1xqaib6dlictungc6zwbmk2gyz4qc7ksydyxltdq
[2023-04-20T22:45:55Z INFO  ipc_agent::manager::bottomup] Submitting checkpoint bottom-up for account t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq and epoch 30 from child /root/t01020
[2023-04-20T22:45:55Z INFO  ipc_agent::manager::bottomup] checkpoint at epoch 30 contains 0 number of cross messages
[2023-04-20T22:45:55Z INFO  ipc_agent::manager::topdown] waiting for top-down checkpoint for epoch 28743 to be committed
[2023-04-20T22:45:55Z INFO  ipc_agent::manager::bottomup] waiting bottom-up for checkpoint for epoch 30 to be committed
[2023-04-20T22:45:57Z INFO  ipc_agent::manager::topdown] successfully published top-down checkpoint submission for epoch 28743
[2023-04-20T22:45:57Z INFO  ipc_agent::manager::topdown] Submitting topdown checkpoint for account t1xqaib6dlictungc6zwbmk2gyz4qc7ksydyxltdq
[2023-04-20T22:45:57Z INFO  ipc_agent::manager::topdown] waiting for top-down checkpoint for epoch 28743 to be committed
[2023-04-20T22:45:59Z INFO  ipc_agent::manager::topdown] successfully published top-down checkpoint submission for epoch 28743
[2023-04-20T22:45:59Z INFO  ipc_agent::manager::topdown] Submitting topdown checkpoint for account t1xqaib6dlictungc6zwbmk2gyz4qc7ksydyxltdq
[2023-04-20T22:45:59Z ERROR ipc_agent::manager::topdown] error submitting top-down checkpoint for epoch 28743 in subnet: SubnetID { parent: "/root", actor: Address { payload: ID(1020) } }
[2023-04-20T22:45:59Z WARN  ipc_agent::manager::topdown] error submitting top-down checkpoint, waiting to next iteration: Err(json_rpc error: {"code":1,"message":"GasEstimateMessageGas error: message execution failed: exit 22, reason: message failed with backtrace:\n00: f064 (method 1290391472) -- already submitted (22)\n (RetCode=22)"})
[2023-04-20T22:46:04Z INFO  ipc_agent::manager::bottomup] successfully published bottom-up checkpoint submission for epoch 30
[2023-04-20T22:46:04Z INFO  ipc_agent::manager::bottomup] Submitting checkpoint bottom-up for account t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq and epoch 30 from child /root/t01020
[2023-04-20T22:46:04Z INFO  ipc_agent::manager::bottomup] checkpoint at epoch 30 contains 0 number of cross messages
[2023-04-20T22:46:04Z ERROR ipc_agent::manager::bottomup] error submitting bottom-up checkpoint for epoch 30 in subnet: SubnetID { parent: "/root", actor: Address { payload: ID(1020) } }
[2023-04-20T22:46:04Z WARN  ipc_agent::manager::bottomup] error submitting bottom-up checkpoint, waiting to next iteration: Err(json_rpc error: {"code":1,"message":"GasEstimateMessageGas error: message execution failed: exit 20, reason: message failed with backtrace:\n00: f01020 (method 1487690799) -- already submitted (20)\n (RetCode=20)"})

Now it's 1000 block periods later. The subnet validator logs look pretty normal, but not anything else.

Agent:

[2023-04-20T23:12:01Z INFO  ipc_agent::manager::topdown] Submitting topdown checkpoint for account t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq
[2023-04-20T23:12:01Z ERROR ipc_agent::manager::topdown] error submitting top-down checkpoint for epoch 29073 in subnet: SubnetID { parent: "/root", actor: Address { payload: ID(1020) } }
[2023-04-20T23:12:01Z WARN  ipc_agent::manager::topdown] error submitting top-down checkpoint, waiting to next iteration: Err(json_rpc error: {"code":1,"message":"GasEstimateMessageGas error: message execution failed: exit 22, reason: message failed with backtrace:\n00: f064 (method 1290391472) -- already submitted (22)\n (RetCode=22)"})
[2023-04-20T23:12:11Z INFO  ipc_agent::manager::topdown] Submitting topdown checkpoint for account t1xqaib6dlictungc6zwbmk2gyz4qc7ksydyxltdq
[2023-04-20T23:12:11Z ERROR ipc_agent::manager::topdown] error submitting top-down checkpoint for epoch 29073 in subnet: SubnetID { parent: "/root", actor: Address { payload: ID(1020) } }
[2023-04-20T23:12:11Z WARN  ipc_agent::manager::topdown] error submitting top-down checkpoint, waiting to next iteration: Err(json_rpc error: {"code":1,"message":"GasEstimateMessageGas error: message execution failed: exit 22, reason: message failed with backtrace:\n00: f064 (method 1290391472) -- already submitted (22)\n (RetCode=22)"})
[2023-04-20T23:12:20Z INFO  ipc_agent::manager::bottomup] successfully published bottom-up checkpoint submission for epoch 210
[2023-04-20T23:12:20Z INFO  ipc_agent::manager::bottomup] Submitting checkpoint bottom-up for account t1yihlouxw3sxwbk2sqcnde3hekh6qwbhmlc2cjwy and epoch 210 from child /root/t01020
[2023-04-20T23:12:20Z INFO  ipc_agent::manager::bottomup] checkpoint at epoch 210 contains 0 number of cross messages
[2023-04-20T23:12:20Z ERROR ipc_agent::manager::bottomup] error submitting bottom-up checkpoint for epoch 210 in subnet: SubnetID { parent: "/root", actor: Address { payload: ID(1020) } }
[2023-04-20T23:12:20Z WARN  ipc_agent::manager::bottomup] error submitting bottom-up checkpoint, waiting to next iteration: Err(json_rpc error: {"code":1,"message":"GasEstimateMessageGas error: message execution failed: exit 20, reason: message failed with backtrace:\n00: f01020 (method 1487690799) -- already submitted (20)\n (RetCode=20)"})

(note that it's trying and failing to submit checkpoint for epoch 210)

Checkpoints are in fact getting in, but lagging way behind:

ubuntu@ip-172-26-28-130:~/ipc$ ./ipc-agent/bin/ipc-agent checkpoint list-bottomup --from-epoch 1 --to-epoch 28000 --subnet /root/t01020
[2023-04-20T23:13:53Z INFO  ipc_agent::cli::commands::checkpoint::list_checkpoints] epoch 30 - prev_check="bafy2bzacedkoa623kvi5gfis2yks7xxjl73vg7xwbojz4tpq63dd5jpfz757i", cross_msgs={"cross_msgs":[],"fee":[]}, child_checks=[]
[2023-04-20T23:13:53Z INFO  ipc_agent::cli::commands::checkpoint::list_checkpoints] epoch 60 - prev_check="bafy2bzacebxuxxjwfmdwhpnwvqo3tvcijtnmilfgik2r25ptvqtzsknpzxwlc", cross_msgs={"cross_msgs":[],"fee":[]}, child_checks=[]
[2023-04-20T23:13:53Z INFO  ipc_agent::cli::commands::checkpoint::list_checkpoints] epoch 90 - prev_check="bafy2bzacedl77x7rsvcnhrkiynyrqggkdcypjekoosvrmgooz5c32ey5mdyhg", cross_msgs={"cross_msgs":[],"fee":[]}, child_checks=[]
[2023-04-20T23:13:53Z INFO  ipc_agent::cli::commands::checkpoint::list_checkpoints] epoch 120 - prev_check="bafy2bzaceae6g5akfzqcye3mrziihefqkaoskbkiyu7seotn4eeciibkofjas", cross_msgs={"cross_msgs":[],"fee":[]}, child_checks=[]
[2023-04-20T23:13:53Z INFO  ipc_agent::cli::commands::checkpoint::list_checkpoints] epoch 150 - prev_check="bafy2bzaceaf4pyuqfluwj2xopcm774ug5gjrqw52ictsbqis4bkb7uts6mcdg", cross_msgs={"cross_msgs":[],"fee":[]}, child_checks=[]
[2023-04-20T23:13:53Z INFO  ipc_agent::cli::commands::checkpoint::list_checkpoints] epoch 180 - prev_check="bafy2bzaceals7ybcw5cisibz3bwxzzoa3tjauxkfmruhk5vjbx6ytjcmatgh6", cross_msgs={"cross_msgs":[],"fee":[]}, child_checks=[]

Because, at this time, this is where the subnet miner is:

2023-04-20T23:15:08.622Z        INFO    mir-consensus   mir/state_manager.go:536        Snapshot for epoch 1756 started {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.622Z        INFO    mir-consensus   mir/state_manager.go:544        Snapshot started: epoch - 1756, height - 5268   {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.622Z        INFO    mir-consensus   mir/state_manager.go:559        waiting for latest block (5268) before checkpoint to be synced to assemble the snapshot {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.622Z        INFO    mir-consensus   mir/state_manager.go:769        waitForHeight 5268 started      {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.622Z        INFO    mir-consensus   mir/state_manager.go:775        waitForHeight 5268 finished     {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.623Z        INFO    mir-consensus   mir/state_manager.go:572        Getting Cid for block height 5268 and cid bafy2bzacealbuwsudfl52oaxioppjw3loiffhrx2rmaekcakeonoyaj5fcada to include in snapshot        {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.624Z        INFO    mir-consensus   mir/state_manager.go:572        Getting Cid for block height 5267 and cid bafy2bzacecndmni6abthskuyxnsmlgduis2yxn77tlw2kzisubdkmeupn3cgo to include in snapshot        {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.624Z        INFO    mir-consensus   mir/state_manager.go:572        Getting Cid for block height 5266 and cid bafy2bzacebmbzclgjlsrocxhj5d55ilijgntdlhy2ux3vkgpzzoco32augdyw to include in snapshot        {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.624Z        INFO    mir-consensus   mir/state_manager.go:580        Snapshot finished: epoch - 1756, height - 5268  {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.624Z        INFO    mir-consensus   mir/state_manager.go:581        Snapshot for epoch 1756 finished        {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.632Z        INFO    mir-consensus   mir/state_manager.go:591        Checkpoint for epoch 1756 started       {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.632Z        INFO    mir-consensus   mir/state_manager.go:599        Mir generated new checkpoint for height: 5269   {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.632Z        INFO    mir-consensus   mir/state_manager.go:614        deliverCheckpoint for height 5269 started       {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.641Z        INFO    mir-consensus   mir/state_manager.go:667        deliverCheckpoint for height 5269 finished      {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}
2023-04-20T23:15:08.641Z        INFO    mir-consensus   mir/state_manager.go:608        Checkpoint for epoch 1756 finished      {"validator": "t1zn6zwtlx5tvqwplsuwtz6hjaewkokgq373i4bqq"}

And the root eudico shows:

2023-04-20T23:16:51.039Z        WARN    rpc     go-jsonrpc@v0.2.3/handler.go:406        error in RPC call to 'Filecoin.MpoolPushMessage': GasEstimateMessageGas error:
    github.com/filecoin-project/lotus/node/impl/full.(*MpoolAPI).MpoolPushMessage
        /home/ubuntu/ipc/lotus/node/impl/full/mpool.go:196
  - message execution failed: exit 20, reason: message failed with backtrace:
    00: f01020 (method 1487690799) -- already submitted (20)
     (RetCode=20):
    github.com/filecoin-project/lotus/node/impl/full.gasEstimateGasLimit
        /home/ubuntu/ipc/lotus/node/impl/full/gas.go:327
2023-04-20T23:17:05.873Z        WARN    sub     sub/incoming.go:102     received block with large delay from miner      {"block": "bafy2bzacebumm3wemlw5qc3p3j4fv7n36hpfb4dfphbzvshcisjqjckgtc7f4", "delay": 1682003360, "miner": "t00"}

That was bottom-up. Top-down seem to be getting through (and so do funds), even if a little delayed? At height 29315, a query shows:

ubuntu@ip-172-26-28-130:~/ipc$ ./ipc-agent/bin/ipc-agent checkpoint last-topdown --subnet /root/t01020
[2023-04-20T23:18:52Z INFO  ipc_agent::cli::commands::checkpoint::topdown_executed] Last top-down checkpoint executed in epoch: 29153
ubuntu@ip-172-26-28-130:~/ipc$ ./ipc-agent/bin/ipc-agent checkpoint last-topdown --subnet /root/t01020
[2023-04-20T23:19:38Z INFO  ipc_agent::cli::commands::checkpoint::topdown_executed] Last top-down checkpoint executed in epoch: 29163

But that isn't the weird part. That would be topdown checkpoints being committed every 10 seconds, despite the actor having been deployed with 30s period.

I feel like I tried things one too many times and might be making stupid mistakes, but I don't know what that would be in this case. it really did all seem to go well.

@adlrocha
Copy link
Contributor

adlrocha commented Apr 24, 2023

This is related with: consensus-shipyard/ipc-actors#97
And is the expected behavior after introducing this horrible patch. It will be fixed soon. Thank you for reporting it 🙏

@jsoares
Copy link
Contributor Author

jsoares commented Apr 24, 2023

So then I guess my dogefooding 🟥 was a 🟩 :)

@adlrocha
Copy link
Contributor

Fixed in #185, #182, #184

@jsoares jsoares transferred this issue from consensus-shipyard/ipc-libs Dec 19, 2023
@jsoares jsoares added the s:ipc label Dec 19, 2023
@jsoares jsoares closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants