Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket error when running Filet #22

Closed
davidgasquez opened this issue Jan 11, 2023 · 6 comments
Closed

Websocket error when running Filet #22

davidgasquez opened this issue Jan 11, 2023 · 6 comments

Comments

@davidgasquez
Copy link
Contributor

From time to time, filet jobs will get stuck in Google Cloud Batch. The lily daemon gets killed and sentinel-archiver hangs waiting for it to come back.

This is how the resources looks like:

1673433890

The log produced by lily reports no route found for :: and websocket: close 1000 (normal).

The issue might be related with the job missing resources.

$ tail -f lily.log 
{"level":"info","ts":"2023-01-10T23:16:04.795Z","logger":"lily/index/processor","caller":"processor/state.go:362","msg":"processor ended","task":"miner_sector_deal","height":"2483772","reporter":"arch0109-2023-01-04","duration":84.470909876}
{"level":"debug","ts":"2023-01-10T23:16:04.797Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_event","reporter":"arch0109-2023-01-04","status":"OK","duration":84.465847436}
{"level":"debug","ts":"2023-01-10T23:16:04.811Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_infos_v7","reporter":"arch0109-2023-01-04","status":"OK","duration":84.465866927}
{"level":"info","ts":"2023-01-10T23:16:04.811Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_event","status":"OK","duration":84.465847436}
{"level":"info","ts":"2023-01-10T23:16:04.823Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_infos_v7","status":"OK","duration":84.465866927}
{"level":"debug","ts":"2023-01-10T23:16:04.823Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_deal","reporter":"arch0109-2023-01-04","status":"OK","duration":84.468916001}
{"level":"info","ts":"2023-01-10T23:16:04.823Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_deal","status":"OK","duration":84.468916001}
{"level":"debug","ts":"2023-01-10T23:16:07.606Z","logger":"basichost","caller":"basic/basic_host.go:312","msg":"failed to fetch local IPv6 address","error":"no route found for ::"}
{"level":"debug","ts":"2023-01-10T23:16:11.564Z","logger":"rpc","caller":"go-jsonrpc@v0.1.8/websocket.go:624","msg":"websocket error","error":"websocket: close 1000 (normal)"}
{"level":"debug","ts":"2023-01-10T23:16:12.690Z","logger":"basichost","caller":"basic/basic_host.go:312","msg":"failed to fetch local IPv6 address","error":"no route found for ::"}
@davidgasquez davidgasquez self-assigned this Jan 11, 2023
@davidgasquez
Copy link
Contributor Author

davidgasquez commented Jan 14, 2023

I've been running jobs in a bigger machine (n2-highmem-48 with 384GB of RAM instead of 128GB).

All was working fine but after 2 days, some Lily daemons have started to die. 😿

image

Same error as always, Sentinel Archiver says failed to connect to lily api at /ip4/127.0.0.1/tcp/1234: context deadline exceeded.

That said, after manually 😅 checking lots of machines, I discovered this error:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd4cbda]

goroutine 28318710 [running]:
github.com/filecoin-project/go-amt-ipld/v2.(*Node).forEachAt(0xc09bc7a510, {0x56150f0?, 0xc0001a2000?}, {0x7fe57c6d2910?, 0xc0d461e140?}, 0xc696ac6000?, 0x0, 0x0, 0xc5ef75dba8)
        /go/pkg/mod/github.com/filecoin-project/go-amt-ipld/v2@v2.1.1-0.20201006184820-924ee87a1349/amt.go:270 +0x19a
github.com/filecoin-project/go-amt-ipld/v2.(*Root).ForEach(...)
        /go/pkg/mod/github.com/filecoin-project/go-amt-ipld/v2@v2.1.1-0.20201006184820-924ee87a1349/amt.go:257
github.com/filecoin-project/specs-actors/v2/actors/util/adt.(*Array).ForEach(0xc696989848?, {0x55fbfa0?, 0xc696ac6000?}, 0xc5ef75dc08?)
        /go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.6/actors/util/adt/array.go:81 +0xc7
github.com/filecoin-project/lily/chain/actors/builtin/miner.(*deadline2).ForEachPartition(0xc689bcd550, 0xd763f68d80)
        /build/lily/chain/actors/builtin/miner/v2.go:535 +0xc7
github.com/filecoin-project/lily/tasks/actorstate/miner.LoadSectorState.func1(0x4047220?, {0x5619700, 0xc689bcd550})
        /build/lily/tasks/actorstate/miner/sector_events.go:302 +0xd8
github.com/filecoin-project/lily/chain/actors/builtin/miner.(*state2).ForEachDeadline.func1(0xc001566900?, 0xc40cd01860)
        /build/lily/chain/actors/builtin/miner/v2.go:345 +0xe2
github.com/filecoin-project/specs-actors/v2/actors/builtin/miner.(*Deadlines).ForEach(0xc08f360460?, {0x7fe64427ff48, 0xc0d461e140}, 0xc5ef75dd30)
        /go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.6/actors/builtin/miner/deadline_state.go:89 +0x7c
github.com/filecoin-project/lily/chain/actors/builtin/miner.(*state2).ForEachDeadline(0xc08f360460, 0xd48934ede0)
        /build/lily/chain/actors/builtin/miner/v2.go:344 +0xd6
github.com/filecoin-project/lily/tasks/actorstate/miner.LoadSectorState({0x56150b8, 0xc6079230c0}, {0x5639c30, 0xc08f360460})
        /build/lily/tasks/actorstate/miner/sector_events.go:301 +0x222
github.com/filecoin-project/lily/tasks/actorstate/miner.DiffMinerSectorStates.func2()
        /build/lily/tasks/actorstate/miner/sector_events.go:383 +0x5f
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:72 +0xa5

@ribasushi
Copy link

cc @rvagg: the above looks like the area you worked around some time back

@rvagg
Copy link
Member

rvagg commented Jan 16, 2023

gee, we're digging back into history on this one

My best guess is that we're dealing with a nil AMT at that point, nothing obvious in AMT instantiation or even Array instantiation in specs-actors ADT. I wonder if there's a missing error check somewhere up the chain from instantiation of this stuff? Maybe we have blocks missing in a datastore, an AMT is instantiated that should have errored, but the error hasn't propagated but the ForEach is still called on a thing that's nil. 🤷

@davidgasquez
Copy link
Contributor Author

Small update. Got the error again when running a walk over the vm_messages tasks.

@frrist
Copy link
Member

frrist commented Feb 3, 2023

@davidgasquez are you sure the only task running was the vm_messages task? I wouldn't expect the vm_messages task to call the miner task method LoadSectorState during execution.

@davidgasquez
Copy link
Contributor Author

@davidgasquez are you sure the only task running was the vm_messages task?

It was only running vm_messages task. That said, there is no panic in this case so it might be caused for another reason. You can see how the last line of logs played up when processing snapshot_576000_578882_1667002745.

1675345707

Need to dig a bit more on this one though. As I mentioned, the end error is related to websocket but the cause might be other.

@davidgasquez davidgasquez removed their assignment Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants