Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching cache layers times out, resets sector back to PC1 #3563

Closed
RobQuistNL opened this issue Sep 4, 2020 · 10 comments
Closed

Fetching cache layers times out, resets sector back to PC1 #3563

RobQuistNL opened this issue Sep 4, 2020 · 10 comments

Comments

@RobQuistNL
Copy link
Contributor

Describe the bug
Since I now have a remote PC2 worker, I've added SFP+ cards to all my machines and connected them. Ever since I did so, I have done 1 succesful PC2, and 3 failed ones which have reset the PC2 state back to PC1.

To Reproduce
Connect lotus-miner to a worker which does PC2

Expected behavior
The fetching completes normally.
On a failure, it retries, instead of deleting the cache layers and re-starting PC1.

Miner sector log

0.	2020-09-04 10:52:41 +0000 UTC:	[event;sealing.SectorStartCC]	{"User":{"ID":105,"SectorType":3,"Pieces":[{"Piece":{"Size":34359738368,"PieceCID":{"/":"baga6ea4seaqa....4ynr3d2y6x2mpq"}},"DealInfo":null}]}}
1.	2020-09-04 10:52:41 +0000 UTC:	[event;sealing.SectorPacked]	{"User":{"FillerPieces":null}}
2.	2020-09-04 11:11:44 +0000 UTC:	[event;sealing.SectorRestart]	{"User":{}}
3.	2020-09-04 16:18:43 +0000 UTC:	[event;sealing.SectorRestart]	{"User":{}}
4.	2020-09-04 20:53:38 +0000 UTC:	[event;sealing.SectorPreCommit1]	{"User":{"PreCommit1Out":"eY.......0=","TicketValue":"ui3it....mN/k=","TicketEpoch":30097}}
5.	2020-09-04 21:28:04 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]	{"User":{}}
	seal pre commit(2) failed: failed to acquire sector {2388 105} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-105] 8 true false false}]): 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-105 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-105: read tcp 192.168.2.6:41614->192.168.2.100:2345: read: connection timed out

6.	2020-09-04 21:29:04 +0000 UTC:	[event;sealing.SectorRetrySealPreCommit2]	{"User":{}}
7.	2020-09-04 21:29:04 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]	{"User":{}}
	seal pre commit(2) failed: allocate local sector for fetching: couldn't find a suitable path for a sector
8.	2020-09-04 21:30:04 +0000 UTC:	[event;sealing.SectorRetrySealPreCommit1]	{"User":{}}

Worker logs

2020-09-04T21:28:04.235Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': failed to acquire sector {2388 105} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-105] 8 true false false}]):
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).acquireFromRemote
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:217
  - 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-105 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-105: read tcp 192.168.2.6:41614->192.168.2.100:2345: read: connection timed out

2020-09-04T21:28:04.237Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/local.go:402
2020-09-04T21:29:04.024Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/local.go:402

Version 0.5.10

I'm probably going to have to reset it all back to an 1gbit connection, because slow transfers that succeed are better than fast transfers that throw away 4 hours of PC1 work :(

@Eliovp
Copy link

Eliovp commented Sep 4, 2020

I feel your pain Rob
Been in the same situation a few times as well.

PC1 completely finished, over to Precommit2, which fails due to the transfer issue.
The moment you reconnect that worker, poof, delete the entire sector, and starts PC1 all over again.

Not sure who or why wrote it like that but it's ridiculous.

@RobQuistNL
Copy link
Contributor Author

And thats the 4th failure;

7.	2020-09-04 22:01:26 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]	{"User":{}}
	seal pre commit(2) failed: failed to acquire sector {2388 106} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-106] 8 true false false}]): 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-106 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer3/.lotusworker/cache/fetching/s-t02388-106: read tcp 192.168.2.8:53430->192.168.2.100:2345: read: connection timed out

@RobQuistNL
Copy link
Contributor Author

Here you can clearly see something glitched up during transfer, yet the other ones were solid:

~/.lotusworker/cache/fetching/s-t02388-106$ ls -lah
total 188G
drwxr-xr-x 2 sealer3 sealer3 4.0K Sep  4 21:55 .
drwxr-xr-x 3 sealer3 sealer3 4.0K Sep  4 21:51 ..
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:51 sc-02-data-layer-1.dat
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:52 sc-02-data-layer-10.dat
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:53 sc-02-data-layer-11.dat
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:54 sc-02-data-layer-2.dat
-rw-rw-r-- 1 sealer3 sealer3  32G Sep  4 21:55 sc-02-data-layer-3.dat
-rw-rw-r-- 1 sealer3 sealer3  28G Sep  4 22:01 sc-02-data-layer-4.dat

@RobQuistNL
Copy link
Contributor Author

And again:

2020-09-06T09:14:14.557Z	INFO	stores	stores/remote.go:221	Fetch http://192.168.2.100:2345/remote/cache/s-t02388-122 -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-122
2020-09-06T09:26:08.780Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': failed to acquire sector {2388 122} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-122] 8 true false false}]):
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).acquireFromRemote
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:217
  - 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-122 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-122: read tcp 192.168.2.6:47210->192.168.2.100:2345: read: connection timed out


2020-09-06T09:27:08.015Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/local.go:402
2020-09-06T09:28:42.239Z	INFO	stores	stores/http_handler.go:120	SERVE DELETE /remote/sealed/s-t02388-122

@RobQuistNL
Copy link
Contributor Author

RobQuistNL commented Sep 8, 2020

This still happens in 0.6.0:

SectorID:	142
Status:		PreCommit1
CIDcommD:	<nil>
CIDcommR:	<nil>
Ticket:		94f9aecd74dc039e6ed2cfa75ab39e8c3ff9f5ac3a10bcf3ca287a89a4874014
TicketH:	40023
Seed:		
SeedH:		0
Precommit:	<nil>
Commit:		<nil>
Proof:		
Deals:		[91954 91952 91953 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Retries:	0
--------
Event Log:
0.	2020-09-08 00:00:04 +0000 UTC:	[event;sealing.SectorStart]	{"User":{"ID":142,"SectorType":3}}
1.	2020-09-08 00:00:15 +0000 UTC:	[event;sealing.SectorAddPiece]	{"User":{"NewPiece":{"Piece":{"Size":134217728,"PieceCID":{"/":"baga6ea4seaql4fhg5ippl5d67q7j3alcz7anrmohjj4w5hwbh5wzahtq27334pi"}},"DealInfo":{"PublishCid":{"/":"bafy2bzaceaifsnwzghj6nu5b773v7yc5mumwnc43vkdlu5qmpfqzj7fcoodlo"},"DealID":91954,"DealSchedule":{"StartEpoch":45599,"EndEpoch":746809},"KeepUnsealed":true}}}}
2.	2020-09-08 00:00:15 +0000 UTC:	[event;sealing.SectorAddPiece]	{"User":{"NewPiece":{"Piece":{"Size":131072,"PieceCID":{"/":"baga6ea4seaqfqrfrgeubeng46sklp7plmapmoh6iqm3o5wpnp5ovteexjiovohq"}},"DealInfo":{"PublishCid":{"/":"bafy2bzacebsbvh2yrej7nae6qt2djxwnuehwq2vbjgvzbjxejhvaskrft6bpy"},"DealID":91952,"DealSchedule":{"StartEpoch":45212,"EndEpoch":746809},"KeepUnsealed":true}}}}
3.	2020-09-08 00:00:16 +0000 UTC:	[event;sealing.SectorAddPiece]	{"User":{"NewPiece":{"Piece":{"Size":131072,"PieceCID":{"/":"baga6ea4seaqi5o3z474mkgivaegbj7xiookcw6qu6r4fmxswy6yxyvkod4chkma"}},"DealInfo":{"PublishCid":{"/":"bafy2bzacedbej2vfunh4lroxgqr3x23uwumaj7cwob4r6qripqk55qli4cu5m"},"DealID":91953,"DealSchedule":{"StartEpoch":45412,"EndEpoch":746809},"KeepUnsealed":true}}}}
4.	2020-09-08 03:00:04 +0000 UTC:	[event;sealing.SectorStartPacking]	{"User":{}}
5.	2020-09-08 03:08:21 +0000 UTC:	[event;sealing.SectorPacked]	{"User":{"FillerPieces":[{"Size":262144,"PieceCID":{"/":"baga6ea4seaqhklmwsp5bm5jehfkhnyyxvgcyb4aji6x3piyfidlclkjjdtasuby"}},{"Size":524288,"PieceCID":{"/":"baga6ea4seaqhaixwb57pnlp2c4ixuutbtyym5kbmnadvvxy4mz3yn3cqn3xs2gi"}},{"Size":1048576,"PieceCID":{"/":"baga6ea4seaqntgehxfzvoouw4ejzgzcsg3axwh2moa2noi6hvgpxbg5u3jqrmky"}},{"Size":2097152,"PieceCID":{"/":"baga6ea4seaqnbnjq3oylj4s4luxsukg752aiwu2bfibjghyyysm7lisubbvrgjq"}},{"Size":4194304,"PieceCID":{"/":"baga6ea4seaqijqccdoqgqwqbx54vui2eazh6ijf5kku5eq3xwokp6tclivuoqei"}},{"Size":8388608,"PieceCID":{"/":"baga6ea4seaqgl4u6lwmnerwdrm4iz7ag3mpwwaqtapc2fciabpooqmvjypweeha"}},{"Size":16777216,"PieceCID":{"/":"baga6ea4seaqkejdvbaufquewln7dgszre6ymaqvr2bdnyvcaee3we7gypgoocoq"}},{"Size":33554432,"PieceCID":{"/":"baga6ea4seaqnv7nlnwutmrctyjwtg4tlt7x6gq56r6awjhwabgvnh6x7kbqxkca"}},{"Size":67108864,"PieceCID":{"/":"baga6ea4seaqnsqov4dldcsuzlqz77pkpxzurddlt2ts72lgtd4hxzbxl3ukoobq"}},{"Size":268435456,"PieceCID":{"/":"baga6ea4seaqk2bufhfu5g7ju74eobh2wsmfevum2rhppmdf75z7b2m4byhtryny"}},{"Size":536870912,"PieceCID":{"/":"baga6ea4seaqdsvqopmj2soyhujb72jza76t4wpq5fzifvm3ctz47iyytkewnubq"}},{"Size":1073741824,"PieceCID":{"/":"baga6ea4seaqmzq6acl23axubdiv37xipnaz3qqtvwr57ekoaauviescphqnfwpi"}},{"Size":2147483648,"PieceCID":{"/":"baga6ea4seaqh34u3nf3tdgpi6k2aw54rtucikcpo25uofrzjpmprinydj7b4mla"}},{"Size":4294967296,"PieceCID":{"/":"baga6ea4seaqgntqfunthkuwpixacxtcoqojjdg66vq254l7vmjyyjdu7pntvcby"}},{"Size":8589934592,"PieceCID":{"/":"baga6ea4seaqnqyicdbbfvnpjlmokmi45fgroiigxa2uw6nz6f6ojveoxlhizwai"}},{"Size":17179869184,"PieceCID":{"/":"baga6ea4seaqg2nsld34emra2ljfgrbrdcswmbjdpaftrpzjuipudt3w7qpbikpa"}}]}}
6.	2020-09-08 06:59:12 +0000 UTC:	[event;sealing.SectorPreCommit1]	{"User":{"PreCommit1Out":"eyJy(....)QBQ=","TicketEpoch":40023}}
7.	2020-09-08 07:08:41 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]{"User":{}}
	seal pre commit(2) failed: failed to acquire sector {2388 142} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-142] 8 true false false}]): 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-142 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-142: read tcp 192.168.2.6:47566->192.168.2.100:2345: read: connection timed out


8.	2020-09-08 07:09:41 +0000 UTC:	[event;sealing.SectorRetrySealPreCommit2]	{"User":{}}
9.	2020-09-08 07:09:41 +0000 UTC:	[event;sealing.SectorSealPreCommit2Failed]{"User":{}}
	seal pre commit(2) failed: allocate local sector for fetching: couldn't find a suitable path for a sector
10.	2020-09-08 07:10:41 +0000 UTC:	[event;sealing.SectorRetrySealPreCommit1]	{"User":{}}

Worker log:

2020-09-08T06:59:12.975Z	INFO	stores	stores/remote.go:221	Fetch http://192.168.2.100:2345/remote/sealed/s-t02388-142 -> /home/sealer2/.lotusworker/sealed/fetching/s-t02388-142
2020-09-08T06:59:54.635Z	DEBUG	stores	stores/util_unix.go:28	move sector data	{"from": "/home/sealer2/.lotusworker/sealed/fetching/s-t02388-142", "to": "/home/sealer2/.lotusworker/sealed/s-t02388-142"}
2020-09-08T06:59:54.637Z	INFO	stores	stores/remote.go:322	Delete http://192.168.2.100:2345/remote/sealed/s-t02388-142
2020-09-08T06:59:54.638Z	INFO	stores	stores/remote.go:221	Fetch http://192.168.2.100:2345/remote/cache/s-t02388-142 -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-142
2020-09-08T07:08:41.484Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': failed to acquire sector {2388 142} from remote (tried [{2c2954ff-2daa-42b2-926a-3294c7f3f485 [http://192.168.2.100:2345/remote/cache/s-t02388-142] 8 true false false}]):
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).acquireFromRemote
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:217
  - 1 error occurred:
	* fetch error http://192.168.2.100:2345/remote/cache/s-t02388-142 (storage 2c2954ff-2daa-42b2-926a-3294c7f3f485) -> /home/sealer2/.lotusworker/cache/fetching/s-t02388-142: read tcp 192.168.2.6:47566->192.168.2.100:2345: read: connection timed out


2020-09-08T07:09:41.013Z	WARN	rpc	go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241	error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/sealer2/lotus/extern/sector-storage/stores/local.go:402
2020-09-08T07:11:13.935Z	INFO	stores	stores/http_handler.go:120	SERVE DELETE /remote/sealed/s-t02388-142
2020-09-08T07:11:13.936Z	INFO	stores	stores/local.go:514	remove /home/sealer2/.lotusworker/sealed/s-t02388-142

@RobQuistNL
Copy link
Contributor Author

The retries don't really work in this case; see my comment here: #3720

@shotcollin
Copy link
Contributor

I'm seeing this same thing, also after upgrading my network. Is the solution to this year-old bug really to downgrade the network? Transferring cache files between workers takes hours at 1Gbps.

@RobQuistNL
Copy link
Contributor Author

transferring 400GB of data at 1Gbps (125MB/s) is supposed to take about an hour

@shotcollin
Copy link
Contributor

Yes, I have seen cache transfer take around an hour when there's only one--but another 5-10 min each for the 32G sealed and unsealed sector files.

What I was probably thinking of is something I recently started seeing, which is that sometimes jobs end up being shuttled among workers. If multiple sectors start sealing around the same time, when one inevitably fails (I haven't yet been able to identify why), it's taken up by a different worker. And if more than one fails, there can be multiple simultaneous transfers from or to a single worker sharing the 1Gbps connection, making them each take 2+ hours.

@Reiers
Copy link

Reiers commented Apr 18, 2022

Hi @RobQuistNL

Thanks for the report. This issue is a bit outdated and with the new scheduler improvements in the upcoming release - I think this issue will be resolved completely.

Closing ticket for now - should it still be a issue, please let me know and will re-open the ticket and triage it again.

Thank you !

@Reiers Reiers closed this as completed Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants