Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deal Making Issue] Storage AND Retrieval deals time out waiting for an Accept message #6343

Closed
aarshkshah1992 opened this issue May 27, 2021 · 6 comments · Fixed by #6369
Closed
Labels
area/markets Area: Markets

Comments

@aarshkshah1992
Copy link
Contributor

Basic Information

  • In go-fil-markets v1.3.0 and go-data-transfer v1.5.0, we ‘ve introduced a “channel monitor” that times out storage and retrieval deals if the other peer dosen’t send an Accept message before a stipulated timeout. That happens here.
  • We’ve begun seeing storage deal errors where the client (Estuary) hits the timeout after proposing a storage deal to the miner.
  • We’re also seeing similar errors in the dealbot retrieval deals.
  • We even saw the same error for retrieval on a CI test failure but unfortunately the relevant parts of the code weren’t well logged and the DEBUG logs weren’t enabled in the test. We’ve tried reproducing the same error by running the failing test multiple times on the CI and locally but with no luck. It only occurs sporadically.
  • We tried moving the retrieval bot to v1.9.0 which doesn't have the Markets v1.3.0 dep but even then , it runs into the same problems. This is likely a regression on the miner because of Markets v1.3.0 which contains some heavy refactoring in the data transfer module related to retrieval restarts.

Describe the problem

For Storage Deals

This describes an occurrence of this problem when @whyrusleeping ran a storage deal via Estuary(depending on go-data-transfer v1.5.0) against @magik6k 's miner(depending on Markets v1.3.0 and data-transfer v1.5.0):

event	{"name": "ProviderEventOpen", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealValidating", "message": ""}
2021-05-22T02:40:40.982+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventDealDeciding", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealAcceptWait", "message": ""}
2021-05-22T02:40:40.994+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventDataRequested", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealWaitingForData", "message": ""}
2021-05-22T02:40:41.573+0200	�[34mINFO�[0m	dt-impl	impl/events.go:298	received new channel request from 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw
2021-05-22T02:40:41.585+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealTransferring", "message": ""}
2021-05-22T02:40:41.589+0200	�[34mINFO�[0m	dt_graphsync	graphsync/graphsync.go:189	Opening graphsync request to 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw for root QmRKPnQDpiBAv7NVroBKHrwbVDax9yyq4bPfC9DMKx24wK with 0 CIDs already received
2021-05-22T02:40:41.589+0200	�[34mINFO�[0m	dt-impl	impl/events.go:19	channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: opened
2021-05-22T02:41:06.045+0200	�[34mINFO�[0m	miner	miner/miner.go:462	Time delta between now and our mining base: 6s (nulls: 0)
2021-05-22T02:41:06.047+0200	�[34mINFO�[0m	gen	gen/gen.go:638	completed winAttemptVRF	{"beaconRound": 873767, "beaconDataB64": "hftXggo3D73RHhDfYNlmyy5nwSsNF-zzgAxoU0yqbFJyWXOJ2_t9A0XDQna6IJd3D9GgkHVzwJwMUUPEnEAjWkY8cE80LrWlIovh45PYT7Y-AZf2yG9GeqVZcsBshsR7", "electionRandB64": "NCNg7FelNzrXJYYN4XnIAdxMUQWVvDioBoui2tF1t0o", "vrfB64": "mMqdxEU4_ABOVHo2-Abkv3fjBGpeL9Ef4r1DBljT0v4HqDD6ucdRxXzrDaLr8B37F2Pd2AKrGqTpYwzCz0hyTWvgWkeEh-GhgnWt8kjPymb9DQ5L_7t1fnFKN5kLzQSX", "winCount": 0}
2021-05-22T02:41:06.048+0200	�[34mINFO�[0m	miner	miner/miner.go:436	completed mineOne	{"forRound": 777923, "baseEpoch": 777922, "lookbackEpochs": 900, "networkPowerAtLookback": "6371887650916007936", "minerPowerAtLookback": "46751934185472", "isEligible": true, "isWinner": false}
2021-05-22T02:41:11.575+0200	�[34mINFO�[0m	dt-impl	impl/events.go:137	channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: received cancel request, cleaning up channel
2021-05-22T02:41:11.575+0200	�[31mERROR�[0m	dt_graphsync	graphsync/graphsync.go:731	failed to unregister persistence option data-transfer-12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: cannot unregister while requests are in progress
2021-05-22T02:41:11.576+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventDataTransferCancelled", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealFailing", "message": "data transfer cancelled"}
2021-05-22T02:41:11.590+0200	�[33mWARN�[0m	providerstates	providerstates/provider_states.go:536	deal bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq failed: data transfer cancelled
2021-05-22T02:41:11.684+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventFailed", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealError", "message": "data transfer cancelled"}
  • The estuary logs simply show the timeout:
transfer failed: 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWKG9mQfyAAyCc3huFVMMzpKCdbgLhED6PYFhwM7q8Ab3W-1621617623978777810: timed out waiting 30s for Accept message from remote peer
  • We don't know if the Provider didn't send an Accept OR if the client failed to process it correctly after receiving it.
  • The next step here is for us to add better logging to this part of the code on both the client and miner, turn on debug logging on both and repeat deal making till we see this error again. There is a WIP PR for the logging at [WIP] Feat/debug accept message error go-data-transfer#210.

For Retrieval Deals

  • Please see @mgoelzer 's first reporting of the issue at [BUG] Retrieval Error: error generated by data transfer: deal data transfer failed #6299 (comment).
  • The interesting thing here is that the client starts receiving the blocks from the Provider but NOT the "Accept" message. This means that the Provider Market was able to access an Unsealed copy of the Piece and send across blocks via the Graphsync protocol on the channel opened by the Miner but the Accept response is either NOT sent by the Miner in the initial Graphsync response or is NOT processed correctly by the Client.
ubuntu@dealbot-mainnet:~$ lotus client retrieve --miner f0215497 mAXCg5AIgcwUbnEkjzoZgSVAbcO3KXldc/i+2Q3IoGYojEnq1w4s /dev/null
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
> Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)
> Recv: 52.01 KiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 1.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 2.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 3.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 4.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 5.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 6.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
  • We thought increasing the timeout for receiving the Accept message would solve the issue but we've seen instances where the Accept message simply does NOT arrive from the miner.
  • We don't have access to the miner logs here as the Dealbot is running against miners in the wild. We've managed to reproduce it ONLY once against the Sofia miner but the part of the code where we send/receive the accept for retrieval isn't well logged.
  • I've managed to run successful retrievals against Piknik's miner using the same Lotus client version as the Dealbot with the Piknik Miner too running a recent latest tip of master. This problem too occurs sporadically.
  • The code where the Retrieval Miner sends the Accept message as part of the Graphsync response to the clien's Graphsync request is at https://github.com/filecoin-project/go-data-transfer/blob/3a130c3f4d33e422b08a3175748e0c718156b6a5/transport/graphsync/graphsync.go#L446.
  • The code where the Retrieval client receives and processes the Accept message is at https://github.com/filecoin-project/go-data-transfer/blob/3a130c3f4d33e422b08a3175748e0c718156b6a5/impl/events.go#L185.
  • Please note that moving the Dealbot client to v1.9.0 which does NOT depend on Markets v1.3.0 does NOT solve the issue suggesting that the Miner is probably a part of the problem and is running a Lotus version that depends on Markets v1.3.0.
  • The next step here is to deploy the PR that logs this code flow on miners that we have access to and run the Dealbot against those miners giving us access to the Debug logs for both the Dealbot and the Miner for when we see this problem again.
@aarshkshah1992
Copy link
Contributor Author

cc @jacobheun @whyrusleeping @mgoelzer @dirkmc .

@aarshkshah1992
Copy link
Contributor Author

Note: Need to confirm that this is indeed a Markets v1.3.0/data-transfer v1.5.0 regression by seeing repeated success running a Lotus v1.9.0 client against a Lotus v1.9.0 miner.

@jacobheun jacobheun added this to the 🤝 Deal Success milestone May 27, 2021
@aarshkshah1992
Copy link
Contributor Author

aarshkshah1992 commented Jun 1, 2021

@dirkmc

I think I've identified a strong root cause for this bug:

For Storage deals:

  • The Storage client starts the channel monitor thus starting the timer to wait for an "Accept" message and then sends a open push channel request on the data-transfer protocol to the miner. The timer will NOT time out ONLY if it receives an “Accept” event from the data-transfer event stream for this channel.
  • The Miner then opens a Graphsync request to the client and sends the "Accept" response as a part of the Graphsync request which is added to the Graphsync task queue on the Storage client.
  • In a happy world, the client's Graphsync module would pop the Graphsync request sent by the miner from it's task queue, synchronously pass on the "Accept" message in that request to the data-transfer client and then start sending blocks to the Miner. Note that the "Accept" message sent to the data-transfer client is then published to the data-transfer event stream from where it will be asynchronously consumed by the channel monitor later to then stop the "Accept" timer. This async consumption of the data-transfer event stream for messages such as "Accept" can also cause issues such as
    Race between OnDataQueued and Accept event go-data-transfer#209. However, that is NOT relevant to this discussion.
  • Here comes the interesting part. The Graphsync responder on the Storage client has been configured to ONLY process a fixed number of requests concurrently. So, it's highly plausible that the Graphsync request sent by the miner containing the "Accept" message spends some time sitting on the queue.
  • Now, say the storage client’s channel monitor sees a disconnect with the miner and sends a "Restart" request to the miner on the data-transfer protocol. This is when things go bad.
  • On getting the "Restart" message, the Miner cancels the original Graphsync request and sends a new Graphsync request to the client. As part of the cancellation, it sends a Grapshync "Cancel" message for the original Graphsync request to the Graphsync module on the Storage client which removes the original request from the list of the tasks it’s processing/will process. The data-transfer client will never see the "Accept" message that was a part of this original request.
  • As a part of this new restarted Graphsync request, the miner ALSO sends a "Restart" response. However, a "Restart" response is NOT currently accounted as an “Accept” by the client. The data-transfer will now occur between the client and the miner but the hope of ever seeing an “Accept” is now irremediably lost. We will eventually hit the Accept timeout because that timer keeps ticking even upon a restart.

Here's the logs a miner sent me on #minerx which shows something like this happening on the mainnet:

{"level":"info","ts":"2021-05-28T07:16:53.869Z","logger":"dt-impl","caller":"impl/events.go:19","msg":"channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWQKi3L61jRSYfwDMAA2xJVWxhXFUchAg15uk6fFQEK6S3-1622169858106665755: opened"}
{"level":"info","ts":"2021-05-28T07:26:54.970Z","logger":"dt-impl","caller":"impl/events.go:242","msg":"channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWQKi3L61jRSYfwDMAA2xJVWxhXFUchAg15uk6fFQEK6S3-1622169858106665755: received restart request"}
{"level":"warn","ts":"2021-05-28T07:26:55.366Z","logger":"dt_graphsync","caller":"graphsync/graphsync.go:160","msg":"channel id 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWQKi3L61jRSYfwDMAA2xJVWxhXFUchAg15uk6fFQEK6S3-1622169858106665755: graphsync request context cancelled"}
{"level":"warn","ts":"2021-05-28T07:26:55.366Z","logger":"dt-impl","caller":"impl/events.go:178","msg":"channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWQKi3L61jRSYfwDMAA2xJVWxhXFUchAg15uk6fFQEK6S3-1622169858106665755 has timed out: graphsync request context cancelled"}
{"level":"info","ts":"2021-05-28T07:26:55.392Z","logger":"dt-impl","caller":"impl/events.go:19","msg":"channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWQKi3L61jRSYfwDMAA2xJVWxhXFUchAg15uk6fFQEK6S3-1622169858106665755: opened"}
{"level":"debug","ts":"2021-05-28T07:29:10.327Z","logger":"dt_graphsync","caller":"graphsync/graphsync.go:184","msg":"finished executing graphsync request for channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWQKi3L61jRSYfwDMAA2xJVWxhXFUchAg15uk6fFQEK6S3-1622169858106665755"}
{"level":"info","ts":"2021-05-28T07:29:10.327Z","logger":"dt-impl","caller":"impl/events.go:206","msg":"channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWQKi3L61jRSYfwDMAA2xJVWxhXFUchAg15uk6fFQEK6S3-1622169858106665755: sending completion message to initiator"}

@aarshkshah1992
Copy link
Contributor Author

As an aside, we should also fix the following problems I saw in the code while debugging the above:

  • The channel monitor timers for "Accept" and "Restart response" are currently started before the corresponding messages have been sent on the libp2p stream to the remote peer. These timers should be started after the messages have been sent.
  • We should look into when the disconnect event that leads to a channel monitor Restart is fired. The disconnect event should be fired by Graphsync only when it's lost connectedness to the remote peer and NOT a connection.
  • We shouldn't timeout Restarts and re-attempt them if we don't see Restart responses in a stipulated time frame. The "seeing a response" part is really hard to get right with the data-transfer protocol as seen above given that the response flow is something like [Remote peer sends Graphsync request] -> [local peer's Graphsync eventually gives it to data-transfer] -> [data-transfer publishes it to event stream] -> [we read async from the event stream].

@dirkmc
Copy link
Contributor

dirkmc commented Jun 1, 2021

Great work Aarsh 👍

The theory that graphsync requests for data are getting backed up on the client makes sense, and is behaviour that we've seen cause other issues in the past, so I have high confidence that is the root cause here.

@hannahhoward
Copy link
Contributor

hannahhoward commented Jun 1, 2021

@aarshkshah1992

Yes, this theory makes sense to me, I think, for Storage Deals.

Something to know about the simultaneous response limit:

  1. Its configurable and I believe already is set to 20 in Lotus on the client
  2. The reason for the simultaneous response limit is that graphsync is originally designed for an IPFS world -- where most requests are going to be accepted and served by default, and are not authenticated, and we need to keep ourselves protected from a DOS attack. In a Filecoin world, the simultaneous response limit can and should go higher -- any non-authenticated requests are going to be rejected quickly making DOS less of a concern.
  3. @whyrusleeping I see that in FilClient in Estuary it's left at the default (6) -- https://github.com/application-research/estuary/blob/master/filclient/filclient.go#L109 -- I would put it up higher possibly. You can see the Lotus code that does it here: https://github.com/filecoin-project/lotus/blob/master/node/modules/graphsync.go#L23

General thoughts:

  • We could probably modify Graphsync to process the authentication logic immediately, even if it didn't start processing the response right away (cause of the simultaneous response limit). That way you'd get the ACCEPT message. But I don't know if that creates a different problem if you don't start seeing the data immediately.

  • Still not sure sure what's going on with retrievals. We're seeing 100% failure with dealbots on 1.50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/markets Area: Markets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants