Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track sealing processes across lotus-miner restarts #3618

Merged
merged 78 commits into from
Oct 28, 2020

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented Sep 7, 2020

Summary of changes here:

  • Worker calls are now async
    • When called worker methods
      • Generate and persist a CallID (basically a random UUID)
      • Start the work in a goroutine
      • Return the CallID to manager
      • After work is finished call Return[...] with the CallID and the result on the Manager
  • Manager is now tracking work and calls across restarts
    • When a method is called on the manager, it generates WorkID (tuple of task type (like PreCommit1) and params)
    • If we see that this WorkID is not running we persist it in work statestore (set work state as started)
    • We schedule the task as before, after the scheduler allocates us to a worker, we execute the worker call
    • We connect the CallID we got from the worker to the WorkID, and set the work state to running
    • After the worker returns, we translate the CallID to WorkID, and pipe the result to the correct place, mark the Work as done, and the Call as returned
  • Manager <-> Worker RPCs are now using HTTP instead of websockets, not having to rely on a single TCP connection serviving for days is much more robust

@magik6k magik6k force-pushed the feat/async-restartable-workers branch from 948d76d to a596ea4 Compare September 14, 2020 07:45
@magik6k magik6k force-pushed the feat/async-restartable-workers branch from a596ea4 to 1ebca8f Compare September 14, 2020 17:11
@magik6k magik6k force-pushed the feat/async-restartable-workers branch from 491e604 to 17680ff Compare September 16, 2020 22:36
@magik6k magik6k force-pushed the feat/async-restartable-workers branch from 4115dca to 6185e15 Compare September 22, 2020 22:29
Copy link
Member

@whyrusleeping whyrusleeping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so i left a number of comments, and wasnt able to do a good job reviewing worker_local.go, but overall this LGTM.

Copy link
Contributor

@arajasek arajasek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll call this "approved" after having spent a couple more hours on it, but that basically means "I think I understand how this is supposed to work". Ish.

@magik6k magik6k force-pushed the feat/async-restartable-workers branch from 844a308 to e0f7b19 Compare October 28, 2020 14:12
@magik6k magik6k force-pushed the feat/async-restartable-workers branch from e0f7b19 to 4100f6e Compare October 28, 2020 14:23
Copy link
Contributor

@Kubuxu Kubuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGWM, but if there is a subtle bug here, I've for sure missed it.

@magik6k magik6k linked an issue Oct 28, 2020 that may be closed by this pull request
@magik6k
Copy link
Contributor Author

magik6k commented Oct 28, 2020

Closes #3453
Closes #3585
Closes #3597
Closes #3672
Closes #3744
Closes #3814
Closes #3815
Closes #3825
Closes #4262

@magik6k magik6k closed this Oct 28, 2020
@magik6k magik6k reopened this Oct 28, 2020
@magik6k magik6k merged commit 32ea060 into master Oct 28, 2020
@magik6k magik6k deleted the feat/async-restartable-workers branch October 28, 2020 20:55
@cryptowhizzard
Copy link

Seems my issue is not resolved yet ( on the latest master build ).

2020-11-10T13:42:13.729 INFO filecoin_proofs::api > generate_piece_commitment:start
2020-11-10T13:42:14.093 INFO filecoin_proofs::api > generate_piece_commitment:finish
2020-11-10T13:42:14.098 INFO filecoin_proofs::api > generate_piece_commitment:start
2020-11-10T13:42:14.173 INFO filecoin_proofs::api > generate_piece_commitment:finish
2020-11-10T13:42:14.180 INFO filecoin_proofs::api > generate_piece_commitment:start
2020-11-10T13:42:17.729Z ^[[34mINFO^[[0m miner miner/miner.go:382 Time delta between now and our mining base: 47s (nulls: 0)
2020-11-10T13:42:18.176Z ^[[34mINFO^[[0m rpc go-jsonrpc@v0.1.2-0.20201008195726-68c6a2704e49/client.go:346 rpc output message buffer {"n": 4}
2020-11-10T13:42:18.367Z ^[[34mINFO^[[0m miner miner/miner.go:382 Time delta between now and our mining base: 48s (nulls: 1)
2020-11-10T13:42:18.419Z ^[[34mINFO^[[0m rpc go-jsonrpc@v0.1.2-0.20201008195726-68c6a2704e49/client.go:346 rpc output message buffer {"n": 5}
2020-11-10T13:43:13.667Z ^[[31mERROR^[[0m rpc go-jsonrpc@v0.1.2-0.20201008195726-68c6a2704e49/websocket.go:654 Connection timeout {"remote": "127.0.0.1:1234"}
2020-11-10T13:43:13.667Z ^[[31mERROR^[[0m storageminer storage/wdpost_run.go:751 handler: websocket connection closed
2020-11-10T13:43:13.667Z ^[[33mWARN^[[0m storageminer storage/wdpost_sched.go:106 window post scheduler notifs channel closed
2020-11-10T13:43:13.667Z ^[[31mERROR^[[0m events events/events_called.go:382 event diff fn failed: handler: websocket connection closed
2020-11-10T13:43:13.667Z ^[[33mWARN^[[0m events events/events.go:97 listenHeadChanges quit
2020-11-10T13:43:13.667Z ^[[33mWARN^[[0m rpc go-jsonrpc@v0.1.2-0.20201008195726-68c6a2704e49/websocket.go:289 failed to send request {"method": "xrpc.cancel", "id": 2361, "error": "write tcp 127.0.0.1:38560->127.0.0.1:1234: use of closed network connection"}
2020-11-10T13:43:13.669Z ^[[31mERROR^[[0m storageminer storage/wdpost_sched.go:94 ChainNotify error: RPC client error: sendRequest failed: websocket routine exiting
2020-11-10T13:43:13.669Z ^[[31mERROR^[[0m events events/events_called.go:478 getting parent tipset in checkNewCalls: RPC client error: sendRequest failed: websocket routine exiting
2020-11-10T13:43:13.669Z ^[[33mWARN^[[0m events events/events.go:155 headChange failed: RPC client error: sendRequest failed: websocket routine exiting
2020-11-10T13:43:13.669Z ^[[31mERROR^[[0m events events/events_called.go:382 event diff fn failed: RPC client error: sendRequest failed: websocket routine exiting
2020-11-10T13:43:13.669Z ^[[31mERROR^[[0m events events/events_called.go:478 getting parent tipset in checkNewCalls: RPC client error: sendRequest failed: websocket routine exiting
2020-11-10T13:43:13.669Z ^[[33mWARN^[[0m events events/events.go:155 headChange failed: RPC client error: sendRequest failed: websocket routine exiting
2020-11-10T13:43:13.669Z ^[[33mWARN^[[0m events events/events.go:97 listenHeadChanges quit
2020-11-10T13:43:13.669Z ^[[33mWARN^[[0m rpc go-jsonrpc@v0.1.2-0.20201008195726-68c6a2704e49/websocket.go:289 failed to send request {"method": "xrpc.cancel", "id": 2349, "error": "write tcp 127.0.0.1:38560->127.0.0.1:1234: use of closed network connection"}
2020-11-10T13:43:14.669Z ^[[34mINFO^[[0m events events/events.go:104 restarting listenHeadChanges
2020-11-10T13:43:14.669Z ^[[31mERROR^[[0m events events/events.go:95 listen head changes errored: listenHeadChanges ChainNotify call failed: RPC client error: sendRequest failed: websocket routine exiting
2020-11-10T13:43:14.669Z ^[[34mINFO^[[0m events events/events.go:104 restarting listenHeadChanges
2020-11-10T13:43:14.669Z ^[[31mERROR^[[0m events events/events.go:95 listen head changes errored: listenHeadChanges ChainNotify call failed: RPC client error: sendRequest failed: websocket routine exiting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment