Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deal Making Issue] Retrieval blocks forever in ResponderPaused state #5901

Closed
ghost opened this issue Mar 29, 2021 · 14 comments
Closed

[Deal Making Issue] Retrieval blocks forever in ResponderPaused state #5901

ghost opened this issue Mar 29, 2021 · 14 comments
Labels
area/markets Area: Markets team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs

Comments

@ghost
Copy link

ghost commented Mar 29, 2021

Basic Information
Here I describe a reproducible retrieval failure in which a previously stored CID (verified, fast retrieval, 32 GiB) gets "stuck" during retrieval in a ResponderPaused state.

The indefinite hang appears on the client, but I've also included logs from the miner to help debug.

Describe the problem

Here's the info needed to reproduce the problem:

  • Miner: f01240
  • CID: bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546
  • The CID was previously stored in a verified/datacap deal with the --fast-retrieval flag set.
  • CID size: 32 GiB

Here's the problem as I observe it. When I try to retrieve this CID from a full node on another machine, the retrieval hangs forever at this point:

$ lotus client retrieve --miner f01240 bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 baf-oe546.bin
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
> Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)
> Recv: 0 B, Paid 0 FIL, ClientEventDealAccepted (DealStatusAccepted)

Running lotus client list-transfers gives this output during the hang:

$ lotus client list-transfers
Sending Channels



Receiving Channels

ID  Status           Receiving From  Root Cid     Initiated?  Transferred  Voucher
4   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
$ lotus-miner retrieval-deals list
12D3KooWNvDJE1PVFxEQY2kmF5Y92YgLb3pBgpkBJy65DH1TaFNo 12584  ...cy7rxsui DealStatusCompleted  2       65536     
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 108   ...wdrdo246 DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 120   ...skcayvvc DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 130   ...skcayvvc DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 133   ...skcayvvc DealStatusFundsNeeded 2       2755818467  
12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6 1    ...2n4oe546 DealStatusCancelled  2       1102797   Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 114   ...wdrdo246 DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 115   ...duzwtsw6 DealStatusCancelled  2       1038144461  Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 116   ...wdrdo246 DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 122   ...wdrdo246 DealStatusErrored   2       0      reading piece from sealed sector: closing
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 126   ...wdrdo246 DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 138   ...wdrdo246 DealStatusUnsealed   2       0       
12D3KooWSrVTGNQP7UaXJYB8r6a2bfEcHzt2gnrHrrqwu5zYK1AD 1409  ...86jZhi7Q DealStatusCompleted  2       125032704   
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 107   ...duzwtsw6 DealStatusErrored   2       6216279044  deal data transfer stalled (peer hungup)

Version

Client: lotus version 1.5.3-rc2+mainnet+git.9afb5ff94

Miner: also 1.5.3, but built from master so the version string is wrong. The build has all the merged PRs in 1.5.3.

Setup
Miner hardware unknown.

To Reproduce
Repro steps are above. This probably should be reproducible from any full node client.

Deal status

$ lotus client list-transfers -v
Sending Channels



Receiving Channels

ID  Status           Receiving From                                        Root Cid                                                        Initiated?  Transferred  Voucher
4   ResponderPaused  12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f  bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546  Y           1.052MiB     {"PayloadCID":{"/":"bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546"},"ID":3,"Selector":{"Raw":"oWFSomFsoWRub25loGI6PqFhYaFhPqFhQKA="},"PieceCID":null,"PricePerByte":"2","PaymentInterval":1048576,"PaymentIntervalIncrease":1048576,"UnsealPrice":"0"}

Lotus daemon and miner logs

Initially, right after the lotus client retrieve command was issued, we saw this in the logs:

il>} todo:[]} {allocated:{memUsedMin:0 memUsedMax:0 gpuUsed:false cpuUse:0 cond:<nil>} todo:[]}]
2021-03-29T18:08:02.202+0200	DEBUG	advmgr	sector-storage/sched.go:450	SCHED Acceptable win: [[] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] []]
2021-03-29T18:08:06.467+0200	INFO	miner	miner/miner.go:383	Time delta between now and our mining base: 6s (nulls: 0)
2021-03-29T18:08:27.971+0200	INFO	markets	loggers/loggers.go:30	retrieval provider event	{"name": "ProviderEventUnsealComplete", "deal ID": "3", "receiver": "12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6", "state": "DealStatusUnsealed", "message": ""}
2021-03-29T18:08:27.971+0200	INFO	dt-impl	impl/impl.go:370	resume channel 12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-4
2021-03-29T18:08:27.972+0200	INFO	markets	loggers/loggers.go:30	retrieval provider event	{"name": "ProviderEventBlockSent", "deal ID": "3", "receiver": "12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6", "state": "DealStatusOngoing", "message": ""}
2021-03-29T18:08:27.976+0200	INFO	markets	loggers/loggers.go:30	retrieval provider event	{"name": "ProviderEventBlockSent", "deal ID": "3", "receiver": "12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6", "state": "DealStatusOngoing", "message": ""}
2021-03-29T18:08:27.980+0200	INFO	markets	loggers/loggers.go:30	retrieval provider event	{"name": "ProviderEventPaymentRequested", "deal ID": "3", "receiver": "12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6", "state": "DealStatusFundsNeeded", "message": ""}
2021-03-29T18:08:31.138+0200	WARN	storagemarket_impl	impl/provider.go:517	failed to write deal status response: stream reset
2021-03-29T18:08:36.273+0200	INFO	miner	miner/miner.go:383	Time delta between now and our mining base: 6s (nulls: 0)
2021-03-29T18:09:01.598+0200	DEBUG	advmgr	sector-storage/sched.go:354	SCHED 35 queued; 40 open windows

Here is the full log spanning the entire time period in question.

lotus-miner.log.zip

Code modifications

No source code modifications.

@ghost ghost added area/markets Area: Markets hint/needs-triaging labels Mar 29, 2021
@ghost
Copy link
Author

ghost commented Mar 29, 2021

cc @dirkmc @pooja

@dirkmc
Copy link
Contributor

dirkmc commented Mar 30, 2021

Thanks for the detailed bug report @mgoelzer ❤️

This log line suggests that there was a connection error between client and provider:

2021-03-29T18:08:31.138+0200	WARN	storagemarket_impl	impl/provider.go:517	failed to write deal status response: stream reset

We are doing some work at the moment to improve connectivity issues, basically the client should try to reconnect to the provider when there's a connectivity problem.

This work has landed in release v1.2.3 of go-fil-markets for storage, and is in progress for retrieval.

@dirkmc dirkmc added the team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs label Mar 30, 2021
@dirkmc
Copy link
Contributor

dirkmc commented Mar 30, 2021

Getting stuck in ResponderPaused may also be a symptom of this underlying issue: filecoin-project/go-data-transfer#184

@dirkmc
Copy link
Contributor

dirkmc commented Mar 30, 2021

@mgoelzer points out this is reproducible so it's unlikely to be caused by intermittent connection issues

@dirkmc
Copy link
Contributor

dirkmc commented Mar 31, 2021

@mgoelzer One possibility is that your client is getting stuck trying to create a payment channel.

Could you check for stuck messages in your local mpool:

./lotus mpool pending --local

Could you also run the following to increase the logging on your client:

lotus log set-level --system dt-impl debug
lotus log set-level --system dt_graphsync debug
lotus log set-level --system markets debug
lotus log set-level --system data_transfer_network debug

Try both of these, depending on your version one of them should work:

lotus log set-level --system dt-pushchanmon debug
lotus log set-level --system dt-chanmon debug

@dirkmc
Copy link
Contributor

dirkmc commented Mar 31, 2021

@mgoelzer I was able to retrieve the deal successfully from my client. I'm running the staging/minerx branch so that may have a fix in it that you don't have on your client?

@ghost
Copy link
Author

ghost commented Mar 31, 2021

Ok, some new testing results onthis.

First tried building the tip of master (version tag lotus version 1.5.3+mainnet+git.358773e2b): same result as original issue description

Then tried building minerx/staging (also version tag lotus version 1.5.3+mainnet+git.358773e2b). But I got this error:

$ lotus client retrieve --miner f01240 bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 baf-oe546.bin
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
ERROR: retrieval failed: Retrieve failed: there is an active retrieval deal with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f for payload CID bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 (retrieval deal ID 2, state DealStatusAccepted) - existing deal must be cancelled before starting a new retrieval deal

Doing list-transfers shows a bunch of stalled retrievals with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f:

$ lotus client list-transfers
Sending Channels



Receiving Channels

ID  Status           Receiving From  Root Cid     Initiated?  Transferred  Voucher
3   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
4   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
5   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
6   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}

Next step: blow away my .lotus directory and try again with the minerx branch...

Unless there is a way to kill a phantom transfer while Lotus thinks it's in progress?

@ghost
Copy link
Author

ghost commented Mar 31, 2021

existing deal must be cancelled before starting a new retrieval deal implies there is actually some way to cancel in-progress retrievals. But I can't find anything in the CLI help that looks like it would do this.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 1, 2021

@mgoelzer does lotus client cancel-retrieval --deal-id=2 fix the problem?

@ghost
Copy link
Author

ghost commented Apr 2, 2021

@dirkmc Yes, your cancel-transfer command let me cancel all my transfers. I then repeated but hit the same problem.

But now I can't cancel the transfers anymore either! Before they had incremental integer ids like 1, 2, 3, 4, etc. Now it is a huge number and I get ERROR: failed to cancel retrieval deal: loadOrCreate state: saving initial state: failed to write cid field t.PayloadCID: undefined cid when trying to cancel.

I'm using the tip of master (lotus version 1.7.0-dev+mainnet+git.cf4128fc7). The reason I am trying to cancel the transfer is to downgrade to minerx/staging, which I think is based on 1.5.3. I'll try the downgrade anyway...

mwg@threadripper:~$ lotus client list-transfers
Sending Channels



Receiving Channels

ID                   Status           Receiving From  Root Cid     Initiated?  Transferred  Voucher
1617328731745316884  ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
mwg@threadripper:~$ lotus client cancel-retrieval --deal-id=1617328731745316884
ERROR: failed to cancel retrieval deal: loadOrCreate state: saving initial state: failed to write cid field t.PayloadCID: undefined cid

mwg@threadripper:~$ lotus client list-transfers
Sending Channels



Receiving Channels

ID                   Status           Receiving From  Root Cid     Initiated?  Transferred  Voucher
1617328731745316884  ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}

Do you think id values like 1617328731745316884 is a potential separate bug in 1.7.x? Or an intentional change?

@ghost
Copy link
Author

ghost commented Apr 2, 2021

Tested with the staging/minerx branch. Even weirder result now:

mwg@threadripper:~/lotus$ lotus client retrieve --miner f01240 bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 baf-oe546.bin
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
ERROR: retrieval failed: Retrieve failed: there is an active retrieval deal with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f for payload CID bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 (retrieval deal ID 6, state DealStatusAccepted) - existing deal must be cancelled before starting a new retrieval deal

mwg@threadripper:~/lotus$ lotus client list-transfers
Sending Channels



Receiving Channels

I tried a lotus client cancel-retrieval --deal-id=6, but still the same result as above.

@ghost
Copy link
Author

ghost commented Apr 2, 2021

@dirkmc I think we should consider closing this issue. If you were able to successfully retrieve bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 from f01240, and there's no other bug report similar to this one, then the most likely explanation is that I'm in a corrupted state of some sort. That would probably be because I keep jumping back and forth between master and minerx/staging and 1.5.3, which isn't that likely in the wild.

@ghost
Copy link
Author

ghost commented Apr 2, 2021

I did open a bug for that bigint id thing in 1.7.0: #5938

@dirkmc
Copy link
Contributor

dirkmc commented Apr 2, 2021

@mgoelzer minerx/staging has a migration that isn't in v1.5.3, so switching between them is probably going to mess up the state. I am going to go ahead and close this ticket.

The transfer ID format changed in the last release - instead of using a number that is stored in state and increments, we're now using a number based on the current time. This is to help avoid problems when people remove all their state and try to make deals with the same provider. Details here: filecoin-project/go-data-transfer#169

In the next markets release there will be a similar change for the deal ID. This is not in any lotus branch yet, so it's safest not to wipe state at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/markets Area: Markets team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs
Projects
None yet
Development

No branches or pull requests

2 participants