Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on restart miner shouldn't dial client #463

Merged
merged 3 commits into from
Dec 16, 2020
Merged

Conversation

dirkmc
Copy link
Contributor

@dirkmc dirkmc commented Dec 2, 2020

Addresses filecoin-project/lotus#4991
Depends on filecoin-project/go-data-transfer#127

I think it may make more sense to let go-data-transfer take care of restarts because it already has retry logic, so

  • it keeps retry logic in one place (just in go-data-transfer)
  • it makes the go-fil-markets FSM simpler
  • it prevents retries in go-data-transfer and go-fil-markets from getting out of sync

To that end in this PR:

  • When the miner restarts, don't try to dial the client (let the client attempt to reconnect to the miner)
  • When the connection from the client to the miner goes down, rely on go-data-transfer to attempt to redial
  • If the connection fails, just error out the deal

This PR does not change the existing behaviour on the miner when the connection (but not the miner itself) goes down:

  • stay in StorageDealTransferring (don't error out the deal)
  • go-data-transfer attempts to retry the connection

This PR does not change the existing behaviour on the client when the client restarts:

  • attempt to reconnect to the miner

Current retry behaviour in go-data-transfer is:

  • 5 attempts, with exponential back of 5x
  • starts with 1s
  • maximum 5 min

I suggest

  • leaving retry behaviour as it is now on the miner (ie try to reconnect for a few minutes then give up and wait for client to reconnect)
  • changing retry behaviour at the network level on the client to a maximum of 15 attempts with exponential backoff of 5x:
    1s + 5s + 25s + 125s + 5m + (10 x 5m) ~= 1 hour
    Note: This will be configured in lotus (not in go-fil-markets)
  • changing go-data-transfer to restart push transfers
    Implemented in Automatically restart push channel go-data-transfer#127

Notes:

  • While testing I discovered that on the provider side we perform validation of incoming data-transfer requests before the provider has completed starting up, which means that when the provider is restarted, it may reject a voucher because it doesn't find the deal in its local store. So I modified the validation code to block until the provider has completed starting up.

TODO:

  • update tests

@dirkmc dirkmc changed the title on disconnect miner shouldn't dial client on restart miner shouldn't dial client Dec 2, 2020
@dirkmc dirkmc force-pushed the feat/miner-dont-dial-client branch from 75e7831 to dc6bb03 Compare December 2, 2020 15:13
Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely have no problem removing the provider restart

Regarding the "move restart to data transfer" -- this will be more clear when you're in the guts of data transfer, but there is a qualitative difference between restarting in the libp2p protocol and the concept of "stalled".

The current retry logic you identify is for the data transfer Libp2p protocol. However, data transfer is both its own libp2p protocol and go-graphsync. Go graphsync has it's own retry logic (which, side note, needs some work). A "Stall" occurs when go graphsync experiences a network error -- which occurs when it expends it's own internal retry logic, which is among other things, fairly short. At that point, the graphsync request fails and to "restart" data transfer actually has to internally make a new graphsync request. (this is when you call "Restart Data Transfer").

So we have:

  • retrying the data transfer libp2p protocol (internal to go-data-transfer)
  • retrying the go-graphsync libp2p protocol (internal to go-graphsync
  • retrying the go-graphsync request (internal to go-data-transfer but currently ONLY when you externally call RestartDataTransfer)

I just want to identify that there's some layers of complexity here -- not saying they are neccesary, but to make a clean solution you probably need to udnerstand what they are :)

@@ -265,6 +265,7 @@ const (
ProviderEventRestart

// ProviderEventDataTransferRestartFailed means a data transfer that was restarted by the provider failed
// Deprecated: this event is no longer used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to flag a thing for the future: at some point we are going to want to re-arrange these events and clean up deprecated ones.

When this happens, two things will be needed:

  • a migration - see our whole migrations process using go-ds-migrations
  • a protocol upgrade - meaning that we will need to define a new version on the protocol, and translate even numbers back and forth for older protocol versions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this is the markets libp2p protocol here FWIW

@dirkmc
Copy link
Contributor Author

dirkmc commented Dec 3, 2020

Thanks for the detailed explanation @hannahhoward

Looking at the code for go-data-transfer it seems like it will fire the Disconnected event whenever:

  • there's a network error from the transport layer (graphsync)
  • there's a network error at the go-data-transfer layer

As you mentioned, in both go-data-transfer and go-graphsync there is retry logic at the network (libp2p) layer. So I suggest we simply adjust the parameters to add more network-level retry attempts in both protocols, on the client side. If after all those retries we still can't reach the provider, then I think it's reasonable to error out the deal.

Edit: My mistake, there is retry logic in go-data-transfer and go-fil-markets at the network layer, but the retry logic in go-graphsync is at a higher layer

@dirkmc
Copy link
Contributor Author

dirkmc commented Dec 4, 2020

I had misunderstood how the retry logic works in go-graphysyc, in summary:

  • go-graphsync makes 10 attempts to send a message
  • each attempt waits up to 10 minutes for a successful connection
  • the attempt may fail immediately if the other side rejects the connection
  • if the attempt fails go-graphsync closes the connection and waits 100ms before trying again

Therefore

  • if the other end rejects connection attempts, it's possible a send will fail within 10 x 100ms = 1s
  • if every connection attempt times out, it's possible send will take up to 10 x 10 minutes = 1h 40m

I think to keep things moving forward we should keep the original intention of this PR: to prevent the miner from failing the deal if the miner can't contact the client on restart. We can tackle the case of the client reconnecting to the miner in a separate PR.

@dirkmc
Copy link
Contributor Author

dirkmc commented Dec 4, 2020

This PR is currently blocked by an issue with go-graphsync: ipfs/go-graphsync#124

Essentially graphsync is not propagating some network errors up the stack, which means that when the client attempts to make a data transfer, the transfer fails silently and the deal gets stuck in StorageDealStartDataTransfer or StorageDealTransferring.

This PR is currently blocked because we need to write a test that

  • starts a data transfer
  • shuts down the provider
  • restarts the provider
  • verifies that the data transfer completes

@dirkmc
Copy link
Contributor Author

dirkmc commented Dec 10, 2020

When the client wants to send data to a provider:

  1. go-data-transfer client opens a "push" channel to the provider
  2. when go-data-transfer provider receives the "push" message, it opens a go-graphsync query to the client for the file
  3. on the client go-graphsync responds to the query from the provider by sending the file data

If the miner goes down in step 3, currently the client will not automatically try to restart the channel.

Proposal

After the client opens a "push" channel to the provider

Default config params:

  • minimum data rate: 1MB / minute
  • restart after stall:
    • 10 attempts
    • min backoff: 1s
    • max backoff: 5m
    • backoff factor: 5

Edit: This proposal is implemented in filecoin-project/go-data-transfer#127

@codecov-io
Copy link

codecov-io commented Dec 15, 2020

Codecov Report

Merging #463 (e9d31f4) into master (aa8533f) will decrease coverage by 0.65%.
The diff coverage is 43.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   65.58%   64.93%   -0.64%     
==========================================
  Files          46       49       +3     
  Lines        3198     3259      +61     
==========================================
+ Hits         2097     2116      +19     
- Misses        875      916      +41     
- Partials      226      227       +1     
Impacted Files Coverage Δ
storagemarket/impl/clientstates/client_fsm.go 82.73% <0.00%> (ø)
storagemarket/impl/provider_environments.go 0.00% <0.00%> (-2.12%) ⬇️
storagemarket/impl/providerstates/provider_fsm.go 73.95% <0.00%> (+2.41%) ⬆️
...oragemarket/impl/providerstates/provider_states.go 85.12% <ø> (+0.93%) ⬆️
storagemarket/impl/requestvalidation/types.go 0.00% <ø> (ø)
storagemarket/impl/provider.go 27.21% <60.00%> (-0.20%) ⬇️
shared/ready.go 60.47% <62.86%> (ø)
storagemarket/impl/clientstates/client_states.go 89.93% <100.00%> (ø)
storagemarket/impl/requestvalidation/common.go 86.21% <100.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa8533f...e9d31f4. Read the comment docs.

@dirkmc dirkmc marked this pull request as ready for review December 15, 2020 15:18
Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, couple suggestions. Like the ready manager refactor -- shoulda added this a bit a go!

go.mod Outdated Show resolved Hide resolved
FromMany(storagemarket.StorageDealTransferring, storagemarket.StorageDealProviderTransferAwaitRestart).
ToJustRecord().
Action(func(deal *storagemarket.MinerDeal) error {
deal.Message = "data transfer appears to be stalled, awaiting reconnect from client"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to flag that based on how I'm reading this, it looks like you want to delete the miner restart data transfer commands in Lotus when you merge this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I guess they can still get called if it's manual -- at least the miner bears responsibility if they do. also probably they would never actually do that anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point, we will need to think through if we still need those commands

@dirkmc dirkmc merged commit b0c1f6d into master Dec 16, 2020
@dirkmc dirkmc deleted the feat/miner-dont-dial-client branch December 16, 2020 14:25
@dirkmc dirkmc mentioned this pull request Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants