Fetching cache layers times out, resets sector back to PC1 #3563

RobQuistNL · 2020-09-04T21:40:22Z

Describe the bug
Since I now have a remote PC2 worker, I've added SFP+ cards to all my machines and connected them. Ever since I did so, I have done 1 succesful PC2, and 3 failed ones which have reset the PC2 state back to PC1.

To Reproduce
Connect lotus-miner to a worker which does PC2

Expected behavior
The fetching completes normally.
On a failure, it retries, instead of deleting the cache layers and re-starting PC1.

Miner sector log

0.	2020-09-04 10:52:41 +0000 UTC:	[event;sealing.SectorStartCC]	{"User":{"ID":105,"SectorType":3,"Pieces":[{"Piece":{"Size":34359738368,"PieceCID":{"/":"baga6ea4seaqa....4ynr3d2y6x2mpq"}},"DealInfo":null}]}}
1.	2020-09-04 10:52:41 +0000 UTC:	[event;sealing.SectorPacked]	{"User":{"FillerPieces":null}}
2.	2020-09-04 11:11:44 +0000 UTC:	[event;sealing.SectorRestart]	{"User":{}}
3.	2020-09-04 16:18:43 +0000 UTC:	[event;sealing.SectorRestart]	{"User":{}}
4.	2020-09-04 20:53:38 +0000 UTC:	[event;sealing.SectorPreCommit1]	{"User":{"PreCommit1Out":"eY.......0=","TicketValue":"ui3it....mN/k=","TicketEpoch":30097}}
5.	2020-09-04 21:28:04 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]	{"User":{}}
	seal pre commit(2) failed: failed to acquire sector {2388 105} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-105] 8 true false false}]): 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-105 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-105: read tcp 192.168.2.6:41614->192.168.2.100:2345: read: connection timed out

6.	2020-09-04 21:29:04 +0000 UTC:	[event;sealing.SectorRetrySealPreCommit2]	{"User":{}}
7.	2020-09-04 21:29:04 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]	{"User":{}}
	seal pre commit(2) failed: allocate local sector for fetching: couldn't find a suitable path for a sector
8.	2020-09-04 21:30:04 +0000 UTC:	[event;sealing.SectorRetrySealPreCommit1]	{"User":{}}

Worker logs

2020-09-04T21:28:04.235Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': failed to acquire sector {2388 105} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-105] 8 true false false}]):
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).acquireFromRemote
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:217
  - 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-105 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-105: read tcp 192.168.2.6:41614->192.168.2.100:2345: read: connection timed out

2020-09-04T21:28:04.237Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/local.go:402
2020-09-04T21:29:04.024Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/local.go:402

Version 0.5.10

I'm probably going to have to reset it all back to an 1gbit connection, because slow transfers that succeed are better than fast transfers that throw away 4 hours of PC1 work :(

The text was updated successfully, but these errors were encountered:

Eliovp · 2020-09-04T21:54:43Z

I feel your pain Rob
Been in the same situation a few times as well.

PC1 completely finished, over to Precommit2, which fails due to the transfer issue.
The moment you reconnect that worker, poof, delete the entire sector, and starts PC1 all over again.

Not sure who or why wrote it like that but it's ridiculous.

RobQuistNL · 2020-09-04T22:03:52Z

And thats the 4th failure;

7.	2020-09-04 22:01:26 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]	{"User":{}}
	seal pre commit(2) failed: failed to acquire sector {2388 106} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-106] 8 true false false}]): 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-106 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer3/.lotusworker/cache/fetching/s-t02388-106: read tcp 192.168.2.8:53430->192.168.2.100:2345: read: connection timed out

RobQuistNL · 2020-09-05T00:06:59Z

Here you can clearly see something glitched up during transfer, yet the other ones were solid:

~/.lotusworker/cache/fetching/s-t02388-106$ ls -lah
total 188G
drwxr-xr-x 2 sealer3 sealer3 4.0K Sep  4 21:55 .
drwxr-xr-x 3 sealer3 sealer3 4.0K Sep  4 21:51 ..
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:51 sc-02-data-layer-1.dat
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:52 sc-02-data-layer-10.dat
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:53 sc-02-data-layer-11.dat
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:54 sc-02-data-layer-2.dat
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:55 sc-02-data-layer-3.dat
-rw-rw-r-- 1 sealer3 sealer3  28G Sep  4 22:01 sc-02-data-layer-4.dat

RobQuistNL · 2020-09-06T10:14:06Z

And again:

2020-09-06T09:14:14.557Z	INFO	stores	stores/remote.go:221	Fetch http://192.168.2.100:2345/remote/cache/s-t02388-122 -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-122
2020-09-06T09:26:08.780Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': failed to acquire sector {2388 122} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-122] 8 true false false}]):
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).acquireFromRemote
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:217
  - 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-122 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-122: read tcp 192.168.2.6:47210->192.168.2.100:2345: read: connection timed out


2020-09-06T09:27:08.015Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/local.go:402
2020-09-06T09:28:42.239Z	INFO	stores	stores/http_handler.go:120	SERVE DELETE /remote/sealed/s-t02388-122

RobQuistNL · 2020-09-08T08:23:28Z

This still happens in 0.6.0:

SectorID:	142
Status:		PreCommit1
CIDcommD:	<nil>
CIDcommR:	<nil>
Ticket:		94f9aecd74dc039e6ed2cfa75ab39e8c3ff9f5ac3a10bcf3ca287a89a4874014
TicketH:	40023
Seed:		
SeedH:		0
Precommit:	<nil>
Commit:		<nil>
Proof:		
Deals:		[91954 91952 91953 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Retries:	0
--------
Event Log:
0.	2020-09-08 00:00:04 +0000 UTC:	[event;sealing.SectorStart]	{"User":{"ID":142,"SectorType":3}}
1.	2020-09-08 00:00:15 +0000 UTC:	[event;sealing.SectorAddPiece]	{"User":{"NewPiece":{"Piece":{"Size":134217728,"PieceCID":{"/":"baga6ea4seaql4fhg5ippl5d67q7j3alcz7anrmohjj4w5hwbh5wzahtq27334pi"}},"DealInfo":{"PublishCid":{"/":"bafy2bzaceaifsnwzghj6nu5b773v7yc5mumwnc43vkdlu5qmpfqzj7fcoodlo"},"DealID":91954,"DealSchedule":{"StartEpoch":45599,"EndEpoch":746809},"KeepUnsealed":true}}}}
2.	2020-09-08 00:00:15 +0000 UTC:	[event;sealing.SectorAddPiece]	{"User":{"NewPiece":{"Piece":{"Size":131072,"PieceCID":{"/":"baga6ea4seaqfqrfrgeubeng46sklp7plmapmoh6iqm3o5wpnp5ovteexjiovohq"}},"DealInfo":{"PublishCid":{"/":"bafy2bzacebsbvh2yrej7nae6qt2djxwnuehwq2vbjgvzbjxejhvaskrft6bpy"},"DealID":91952,"DealSchedule":{"StartEpoch":45212,"EndEpoch":746809},"KeepUnsealed":true}}}}
3.	2020-09-08 00:00:16 +0000 UTC:	[event;sealing.SectorAddPiece]	{"User":{"NewPiece":{"Piece":{"Size":131072,"PieceCID":{"/":"baga6ea4seaqi5o3z474mkgivaegbj7xiookcw6qu6r4fmxswy6yxyvkod4chkma"}},"DealInfo":{"PublishCid":{"/":"bafy2bzacedbej2vfunh4lroxgqr3x23uwumaj7cwob4r6qripqk55qli4cu5m"},"DealID":91953,"DealSchedule":{"StartEpoch":45412,"EndEpoch":746809},"KeepUnsealed":true}}}}
4.	2020-09-08 03:00:04 +0000 UTC:	[event;sealing.SectorStartPacking]	{"User":{}}
5.	2020-09-08 03:08:21 +0000 UTC:	[event;sealing.SectorPacked]	{"User":{"FillerPieces":[{"Size":262144,"PieceCID":{"/":"baga6ea4seaqhklmwsp5bm5jehfkhnyyxvgcyb4aji6x3piyfidlclkjjdtasuby"}},{"Size":524288,"PieceCID":{"/":"baga6ea4seaqhaixwb57pnlp2c4ixuutbtyym5kbmnadvvxy4mz3yn3cqn3xs2gi"}},{"Size":1048576,"PieceCID":{"/":"baga6ea4seaqntgehxfzvoouw4ejzgzcsg3axwh2moa2noi6hvgpxbg5u3jqrmky"}},{"Size":2097152,"PieceCID":{"/":"baga6ea4seaqnbnjq3oylj4s4luxsukg752aiwu2bfibjghyyysm7lisubbvrgjq"}},{"Size":4194304,"PieceCID":{"/":"baga6ea4seaqijqccdoqgqwqbx54vui2eazh6ijf5kku5eq3xwokp6tclivuoqei"}},{"Size":8388608,"PieceCID":{"/":"baga6ea4seaqgl4u6lwmnerwdrm4iz7ag3mpwwaqtapc2fciabpooqmvjypweeha"}},{"Size":16777216,"PieceCID":{"/":"baga6ea4seaqkejdvbaufquewln7dgszre6ymaqvr2bdnyvcaee3we7gypgoocoq"}},{"Size":33554432,"PieceCID":{"/":"baga6ea4seaqnv7nlnwutmrctyjwtg4tlt7x6gq56r6awjhwabgvnh6x7kbqxkca"}},{"Size":67108864,"PieceCID":{"/":"baga6ea4seaqnsqov4dldcsuzlqz77pkpxzurddlt2ts72lgtd4hxzbxl3ukoobq"}},{"Size":268435456,"PieceCID":{"/":"baga6ea4seaqk2bufhfu5g7ju74eobh2wsmfevum2rhppmdf75z7b2m4byhtryny"}},{"Size":536870912,"PieceCID":{"/":"baga6ea4seaqdsvqopmj2soyhujb72jza76t4wpq5fzifvm3ctz47iyytkewnubq"}},{"Size":1073741824,"PieceCID":{"/":"baga6ea4seaqmzq6acl23axubdiv37xipnaz3qqtvwr57ekoaauviescphqnfwpi"}},{"Size":2147483648,"PieceCID":{"/":"baga6ea4seaqh34u3nf3tdgpi6k2aw54rtucikcpo25uofrzjpmprinydj7b4mla"}},{"Size":4294967296,"PieceCID":{"/":"baga6ea4seaqgntqfunthkuwpixacxtcoqojjdg66vq254l7vmjyyjdu7pntvcby"}},{"Size":8589934592,"PieceCID":{"/":"baga6ea4seaqnqyicdbbfvnpjlmokmi45fgroiigxa2uw6nz6f6ojveoxlhizwai"}},{"Size":17179869184,"PieceCID":{"/":"baga6ea4seaqg2nsld34emra2ljfgrbrdcswmbjdpaftrpzjuipudt3w7qpbikpa"}}]}}
6.	2020-09-08 06:59:12 +0000 UTC:	[event;sealing.SectorPreCommit1]	{"User":{"PreCommit1Out":"eyJy(....)QBQ=","TicketEpoch":40023}}
7.	2020-09-08 07:08:41 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]{"User":{}}
	seal pre commit(2) failed: failed to acquire sector {2388 142} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-142] 8 true false false}]): 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-142 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-142: read tcp 192.168.2.6:47566->192.168.2.100:2345: read: connection timed out


8.	2020-09-08 07:09:41 +0000 UTC:	[event;sealing.SectorRetrySealPreCommit2]	{"User":{}}
9.	2020-09-08 07:09:41 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]{"User":{}}
	seal pre commit(2) failed: allocate local sector for fetching: couldn't find a suitable path for a sector
10.	2020-09-08 07:10:41 +0000 UTC:	[event;sealing.SectorRetrySealPreCommit1]	{"User":{}}

Worker log:

2020-09-08T06:59:12.975Z	INFO	stores	stores/remote.go:221	Fetch http://192.168.2.100:2345/remote/sealed/s-t02388-142 -> /home/sealer2/.lotusworker/sealed/fetching/s-t02388-142
2020-09-08T06:59:54.635Z	DEBUG	stores	stores/util_unix.go:28	move sector data	{"from": "/home/sealer2/.lotusworker/sealed/fetching/s-t02388-142", "to": "/home/sealer2/.lotusworker/sealed/s-t02388-142"}
2020-09-08T06:59:54.637Z	INFO	stores	stores/remote.go:322	Delete http://192.168.2.100:2345/remote/sealed/s-t02388-142
2020-09-08T06:59:54.638Z	INFO	stores	stores/remote.go:221	Fetch http://192.168.2.100:2345/remote/cache/s-t02388-142 -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-142
2020-09-08T07:08:41.484Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': failed to acquire sector {2388 142} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-142] 8 true false false}]):
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).acquireFromRemote
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:217
  - 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-142 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-142: read tcp 192.168.2.6:47566->192.168.2.100:2345: read: connection timed out


2020-09-08T07:09:41.013Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/local.go:402
2020-09-08T07:11:13.935Z	INFO	stores	stores/http_handler.go:120	SERVE DELETE /remote/sealed/s-t02388-142
2020-09-08T07:11:13.936Z	INFO	stores	stores/local.go:514	remove /home/sealer2/.lotusworker/sealed/s-t02388-142

RobQuistNL · 2020-09-15T18:00:20Z

The retries don't really work in this case; see my comment here: #3720

shotcollin · 2021-09-19T18:33:26Z

I'm seeing this same thing, also after upgrading my network. Is the solution to this year-old bug really to downgrade the network? Transferring cache files between workers takes hours at 1Gbps.

RobQuistNL · 2021-09-23T15:35:54Z

transferring 400GB of data at 1Gbps (125MB/s) is supposed to take about an hour

shotcollin · 2021-09-23T18:16:38Z

Yes, I have seen cache transfer take around an hour when there's only one--but another 5-10 min each for the 32G sealed and unsealed sector files.

What I was probably thinking of is something I recently started seeing, which is that sometimes jobs end up being shuttled among workers. If multiple sectors start sealing around the same time, when one inevitably fails (I haven't yet been able to identify why), it's taken up by a different worker. And if more than one fails, there can be multiple simultaneous transfers from or to a single worker sharing the 1Gbps connection, making them each take 2+ hours.

Reiers · 2022-04-18T17:27:30Z

Hi @RobQuistNL

Thanks for the report. This issue is a bit outdated and with the new scheduler improvements in the upcoming release - I think this issue will be resolved completely.

Closing ticket for now - should it still be a issue, please let me know and will re-open the ticket and triage it again.

Thank you !

RobQuistNL mentioned this issue Sep 10, 2020

Increase the number of times precommit2 is attempted before moving back to precommit1 #3720

Merged

Reiers closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetching cache layers times out, resets sector back to PC1 #3563

Fetching cache layers times out, resets sector back to PC1 #3563

RobQuistNL commented Sep 4, 2020

Eliovp commented Sep 4, 2020

RobQuistNL commented Sep 4, 2020

RobQuistNL commented Sep 5, 2020

RobQuistNL commented Sep 6, 2020

RobQuistNL commented Sep 8, 2020 •

edited

Loading

RobQuistNL commented Sep 15, 2020

shotcollin commented Sep 19, 2021

RobQuistNL commented Sep 23, 2021

shotcollin commented Sep 23, 2021

Reiers commented Apr 18, 2022

Fetching cache layers times out, resets sector back to PC1 #3563

Fetching cache layers times out, resets sector back to PC1 #3563

Comments

RobQuistNL commented Sep 4, 2020

Eliovp commented Sep 4, 2020

RobQuistNL commented Sep 4, 2020

RobQuistNL commented Sep 5, 2020

RobQuistNL commented Sep 6, 2020

RobQuistNL commented Sep 8, 2020 • edited Loading

RobQuistNL commented Sep 15, 2020

shotcollin commented Sep 19, 2021

RobQuistNL commented Sep 23, 2021

shotcollin commented Sep 23, 2021

Reiers commented Apr 18, 2022

RobQuistNL commented Sep 8, 2020 •

edited

Loading