Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of disk space causes blockchain disk corruption #9292

Closed
8 of 18 tasks
ArseniiPetrovich opened this issue Sep 12, 2022 · 3 comments
Closed
8 of 18 tasks

Running out of disk space causes blockchain disk corruption #9292

ArseniiPetrovich opened this issue Sep 12, 2022 · 3 comments
Labels
area/chain Area: Chain need/author-input Hint: Needs Author Input

Comments

@ArseniiPetrovich
Copy link
Contributor

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

lotus version 1.17.2-dev+calibnet+git.29fff4f

Describe the Bug

Here at Lotus nodes we unfortunatelly run out of disk space recently on one of our archival nodes on calibrationnet. It was running 1.16.0, and when we restarted it failed with the following issue:

2022-09-12T16:56:48.469Z	WARN	modules	modules/chain.go:89	loading chain state from disk: loading tipset: get block bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k: ipld: could not find bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x20fa0f9]

I tried to upgrade to 1.17 as suggested at #8916, but it didn't help. Is there any chance to recover from this condition?
Thank you!

Logging Information

022-09-12T16:56:46.115Z	INFO	badger	v2@v2.2007.3/levels.go:183	All 0 tables opened in 0s

2022-09-12T16:56:46.116Z	INFO	badger	v2@v2.2007.3/value.go:1158	Replaying file id: 0 at offset: 0

2022-09-12T16:56:46.116Z	INFO	badger	v2@v2.2007.3/value.go:1178	Replay took: 3.572µs

2022-09-12T16:56:46.126Z	INFO	badger	v2@v2.2007.3/levels.go:183	All 0 tables opened in 0s

2022-09-12T16:56:46.128Z	INFO	badger	v2@v2.2007.3/value.go:1158	Replaying file id: 0 at offset: 0

2022-09-12T16:56:46.128Z	INFO	badger	v2@v2.2007.3/value.go:1178	Replay took: 3.369µs

ERROR: cannot dial address ws://0.0.0.0:1234/rpc/v0 for dial tcp 0.0.0.0:1234: connect: connection refused: dial tcp 0.0.0.0:1234: connect: connection refused

2022-09-12T16:56:48.022Z	INFO	badgerbs	v2@v2.2007.3/levels.go:183	All 144 tables opened in 1.88s

2022-09-12T16:56:48.239Z	INFO	badgerbs	v2@v2.2007.3/value.go:1158	Replaying file id: 186 at offset: 97039571

2022-09-12T16:56:48.464Z	INFO	badgerbs	v2@v2.2007.3/value.go:1178	Replay took: 225.549956ms

2022-09-12T16:56:48.469Z	WARN	modules	modules/chain.go:89	loading chain state from disk: loading tipset: get block bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k: ipld: could not find bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x20fa0f9]

goroutine 1 [running]:
github.com/filecoin-project/lotus/chain/types.(*TipSet).ParentState(...)
	/go/lotus/chain/types/tipset.go:223
github.com/filecoin-project/lotus/node/modules.NetworkName({0x7f47042b4800?, 0xc0008ed3b0?}, {0x4b4cca0?, 0xc000011380?}, 0x6?, {0x4b539f0, 0x6874aa0}, 0xc003425080?, {0xc000389180, 0xd, ...}, ...)
	/go/lotus/node/modules/chain.go:131 +0xd9
reflect.Value.call({0x3a9d6e0?, 0x4869160?, 0x2?}, {0x3db45ef, 0x4}, {0xc0008986e0, 0x7, 0x203000?})
	/usr/local/go/src/reflect/value.go:556 +0x845
reflect.Value.Call({0x3a9d6e0?, 0x4869160?, 0x6727a5?}, {0xc0008986e0, 0x7, 0x7})
	/usr/local/go/src/reflect/value.go:339 +0xbf
github.com/filecoin-project/lotus/node.as.func2({0xc0008986e0?, 0x3a052c0?, 0x10?})
	/go/lotus/node/options.go:140 +0xf0
reflect.Value.call({0x3a9d6e0?, 0xc000534930?, 0x6727a5?}, {0x3db45ef, 0x4}, {0xc0008c8370, 0x7, 0x30?})
	/usr/local/go/src/reflect/value.go:556 +0x845
reflect.Value.Call({0x3a9d6e0?, 0xc000534930?, 0x672b07?}, {0xc0008c8370, 0x7, 0x7})
	/usr/local/go/src/reflect/value.go:339 +0xbf
go.uber.org/dig.defaultInvoker({0x3a9d6e0?, 0xc000534930?, 0xc0004e8e70?}, {0xc0008c8370?, 0x7?, 0x4b6fd58?})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:355 +0x28
go.uber.org/dig.(*node).Call(0xc00083a140, {0x4b6fd58?, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:806 +0x259
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x39180c0}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x3a9d7e0}, {0xc0004e8cb0, 0x7, 0x7}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*node).Call(0xc00080db80, {0x4b6fd58?, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:797 +0xff
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x3b2bf20}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x3986300}, {0xc0004c6e40, 0x2, 0x2}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*node).Call(0xc00080cd20, {0x4b6fd58?, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:797 +0xff
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x3c06c40}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x394b380}, {0xc00041e7a0, 0x1, 0x1}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*Container).Invoke(0xc003232af0, {0x394b380?, 0xc0012bf600}, {0x189153a?, 0x1?, 0x1?})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:503 +0x2b9
go.uber.org/fx.(*App).executeInvoke(0xc00346dad0, {{0x394b380, 0xc0012bf600}, {0xc003457a40, 0x7, 0x8}})
	/go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:964 +0x39f
go.uber.org/fx.(*App).executeInvokes(...)
	/go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:929
go.uber.org/fx.New({0xc000541458, 0x3, 0x1c?})
	/go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:596 +0xa4b
github.com/filecoin-project/lotus/node.New({0x4b60f58, 0xc000c669f0}, {0xc0032325f0, 0x9, 0x9})
	/go/lotus/node/builder.go:361 +0x477
main.glob..func5(0xc000c68700)
	/go/lotus/cmd/lotus/daemon.go:317 +0x1609
github.com/urfave/cli/v2.(*App).RunAsSubcommand(0xc000583ba0, 0xc000c68200)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:495 +0xaff
github.com/urfave/cli/v2.(*Command).startApp(0x64ee5e0, 0xc000c68200)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/command.go:287 +0x77b
github.com/urfave/cli/v2.(*Command).Run(0xc0000cc140?, 0xc0000cc140?)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/command.go:95 +0xba
github.com/urfave/cli/v2.(*App).RunContext(0xc000583860, {0x4b60ee8?, 0xc000128000}, {0xc000126000, 0x2, 0x2})
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:341 +0xbc8
github.com/urfave/cli/v2.(*App).Run(...)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:247
github.com/filecoin-project/lotus/cli.RunApp(0x39efa00?)
	/go/lotus/cli/helper.go:35 +0x4e
main.main()
	/go/lotus/cmd/lotus/main.go:111 +0x90c

Repo Steps

  1. Run lotus
  2. Run out of disk space
  3. See error
@TippyFlitsUK
Copy link
Contributor

Can you elaborate on why you see this as being an issue @ArseniiPetrovich? It is not a surprise to me that running out of chain disk space would result in chain corruption and maintaining disk space is something that needs to be monitored to avoid. It can also be easily resolved by importing a new lightweight snapshot.

@TippyFlitsUK TippyFlitsUK added area/chain Area: Chain need/author-input Hint: Needs Author Input and removed need/triage kind/bug Kind: Bug labels Sep 12, 2022
@ArseniiPetrovich
Copy link
Contributor Author

@TippyFlitsUK not so easy for an archival nodes that have all the chain state :)
Sure, disk space need to be monitored and it's purely our fault that we overlooked this alert in our systems. However, chain corruption when having a lack of disk space still have to be considered as a bug, at least from my point of view, no matter "surprise" it or not, because it makes even a simple mistake to have great consequences. Can't we verify the available space before writing there or at least deploy a kind of recovery tool that allows you to rollback to several blocks behind the chain and resync?

@TippyFlitsUK
Copy link
Contributor

Thanks for the clarification @ArseniiPetrovich! Agreed that this presents a far bigger problem with archival nodes. I don't agree that represents a bug though.
Can you please file a new ticket using the enhancement request form and provide the additional info requested.
Many thanks! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/chain Area: Chain need/author-input Hint: Needs Author Input
Projects
None yet
Development

No branches or pull requests

2 participants