Running out of disk space causes blockchain disk corruption #9292

ArseniiPetrovich · 2022-09-12T17:04:21Z

Checklist

This is not a security-related bug/issue. If it is, please follow please follow the security policy.
This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
This is not a new feature request. If it is, please file a feature request instead.
This is not an enhancement request. If it is, please file a improvement suggestion instead.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.

Lotus component

Lotus Version

lotus version 1.17.2-dev+calibnet+git.29fff4f

Describe the Bug

Here at Lotus nodes we unfortunatelly run out of disk space recently on one of our archival nodes on calibrationnet. It was running 1.16.0, and when we restarted it failed with the following issue:

2022-09-12T16:56:48.469Z	WARN	modules	modules/chain.go:89	loading chain state from disk: loading tipset: get block bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k: ipld: could not find bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x20fa0f9]

I tried to upgrade to 1.17 as suggested at #8916, but it didn't help. Is there any chance to recover from this condition?
Thank you!

Logging Information

022-09-12T16:56:46.115Z	INFO	badger	v2@v2.2007.3/levels.go:183	All 0 tables opened in 0s

2022-09-12T16:56:46.116Z	INFO	badger	v2@v2.2007.3/value.go:1158	Replaying file id: 0 at offset: 0

2022-09-12T16:56:46.116Z	INFO	badger	v2@v2.2007.3/value.go:1178	Replay took: 3.572µs

2022-09-12T16:56:46.126Z	INFO	badger	v2@v2.2007.3/levels.go:183	All 0 tables opened in 0s

2022-09-12T16:56:46.128Z	INFO	badger	v2@v2.2007.3/value.go:1158	Replaying file id: 0 at offset: 0

2022-09-12T16:56:46.128Z	INFO	badger	v2@v2.2007.3/value.go:1178	Replay took: 3.369µs

ERROR: cannot dial address ws://0.0.0.0:1234/rpc/v0 for dial tcp 0.0.0.0:1234: connect: connection refused: dial tcp 0.0.0.0:1234: connect: connection refused

2022-09-12T16:56:48.022Z	INFO	badgerbs	v2@v2.2007.3/levels.go:183	All 144 tables opened in 1.88s

2022-09-12T16:56:48.239Z	INFO	badgerbs	v2@v2.2007.3/value.go:1158	Replaying file id: 186 at offset: 97039571

2022-09-12T16:56:48.464Z	INFO	badgerbs	v2@v2.2007.3/value.go:1178	Replay took: 225.549956ms

2022-09-12T16:56:48.469Z	WARN	modules	modules/chain.go:89	loading chain state from disk: loading tipset: get block bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k: ipld: could not find bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x20fa0f9]

goroutine 1 [running]:
github.com/filecoin-project/lotus/chain/types.(*TipSet).ParentState(...)
	/go/lotus/chain/types/tipset.go:223
github.com/filecoin-project/lotus/node/modules.NetworkName({0x7f47042b4800?, 0xc0008ed3b0?}, {0x4b4cca0?, 0xc000011380?}, 0x6?, {0x4b539f0, 0x6874aa0}, 0xc003425080?, {0xc000389180, 0xd, ...}, ...)
	/go/lotus/node/modules/chain.go:131 +0xd9
reflect.Value.call({0x3a9d6e0?, 0x4869160?, 0x2?}, {0x3db45ef, 0x4}, {0xc0008986e0, 0x7, 0x203000?})
	/usr/local/go/src/reflect/value.go:556 +0x845
reflect.Value.Call({0x3a9d6e0?, 0x4869160?, 0x6727a5?}, {0xc0008986e0, 0x7, 0x7})
	/usr/local/go/src/reflect/value.go:339 +0xbf
github.com/filecoin-project/lotus/node.as.func2({0xc0008986e0?, 0x3a052c0?, 0x10?})
	/go/lotus/node/options.go:140 +0xf0
reflect.Value.call({0x3a9d6e0?, 0xc000534930?, 0x6727a5?}, {0x3db45ef, 0x4}, {0xc0008c8370, 0x7, 0x30?})
	/usr/local/go/src/reflect/value.go:556 +0x845
reflect.Value.Call({0x3a9d6e0?, 0xc000534930?, 0x672b07?}, {0xc0008c8370, 0x7, 0x7})
	/usr/local/go/src/reflect/value.go:339 +0xbf
go.uber.org/dig.defaultInvoker({0x3a9d6e0?, 0xc000534930?, 0xc0004e8e70?}, {0xc0008c8370?, 0x7?, 0x4b6fd58?})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:355 +0x28
go.uber.org/dig.(*node).Call(0xc00083a140, {0x4b6fd58?, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:806 +0x259
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x39180c0}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x3a9d7e0}, {0xc0004e8cb0, 0x7, 0x7}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*node).Call(0xc00080db80, {0x4b6fd58?, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:797 +0xff
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x3b2bf20}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x3986300}, {0xc0004c6e40, 0x2, 0x2}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*node).Call(0xc00080cd20, {0x4b6fd58?, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:797 +0xff
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x3c06c40}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x394b380}, {0xc00041e7a0, 0x1, 0x1}}, {0x4b6fd58, 0xc003232af0})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*Container).Invoke(0xc003232af0, {0x394b380?, 0xc0012bf600}, {0x189153a?, 0x1?, 0x1?})
	/go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:503 +0x2b9
go.uber.org/fx.(*App).executeInvoke(0xc00346dad0, {{0x394b380, 0xc0012bf600}, {0xc003457a40, 0x7, 0x8}})
	/go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:964 +0x39f
go.uber.org/fx.(*App).executeInvokes(...)
	/go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:929
go.uber.org/fx.New({0xc000541458, 0x3, 0x1c?})
	/go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:596 +0xa4b
github.com/filecoin-project/lotus/node.New({0x4b60f58, 0xc000c669f0}, {0xc0032325f0, 0x9, 0x9})
	/go/lotus/node/builder.go:361 +0x477
main.glob..func5(0xc000c68700)
	/go/lotus/cmd/lotus/daemon.go:317 +0x1609
github.com/urfave/cli/v2.(*App).RunAsSubcommand(0xc000583ba0, 0xc000c68200)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:495 +0xaff
github.com/urfave/cli/v2.(*Command).startApp(0x64ee5e0, 0xc000c68200)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/command.go:287 +0x77b
github.com/urfave/cli/v2.(*Command).Run(0xc0000cc140?, 0xc0000cc140?)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/command.go:95 +0xba
github.com/urfave/cli/v2.(*App).RunContext(0xc000583860, {0x4b60ee8?, 0xc000128000}, {0xc000126000, 0x2, 0x2})
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:341 +0xbc8
github.com/urfave/cli/v2.(*App).Run(...)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:247
github.com/filecoin-project/lotus/cli.RunApp(0x39efa00?)
	/go/lotus/cli/helper.go:35 +0x4e
main.main()
	/go/lotus/cmd/lotus/main.go:111 +0x90c

Repo Steps

Run lotus
Run out of disk space
See error

The text was updated successfully, but these errors were encountered:

TippyFlitsUK · 2022-09-12T18:44:09Z

Can you elaborate on why you see this as being an issue @ArseniiPetrovich? It is not a surprise to me that running out of chain disk space would result in chain corruption and maintaining disk space is something that needs to be monitored to avoid. It can also be easily resolved by importing a new lightweight snapshot.

ArseniiPetrovich · 2022-09-12T20:13:05Z

@TippyFlitsUK not so easy for an archival nodes that have all the chain state :)
Sure, disk space need to be monitored and it's purely our fault that we overlooked this alert in our systems. However, chain corruption when having a lack of disk space still have to be considered as a bug, at least from my point of view, no matter "surprise" it or not, because it makes even a simple mistake to have great consequences. Can't we verify the available space before writing there or at least deploy a kind of recovery tool that allows you to rollback to several blocks behind the chain and resync?

TippyFlitsUK · 2022-09-12T20:18:38Z

Thanks for the clarification @ArseniiPetrovich! Agreed that this presents a far bigger problem with archival nodes. I don't agree that represents a bug though.
Can you please file a new ticket using the enhancement request form and provide the additional info requested.
Many thanks! 🙏

ArseniiPetrovich added kind/bug Kind: Bug need/triage labels Sep 12, 2022

TippyFlitsUK added area/chain Area: Chain need/author-input Hint: Needs Author Input and removed need/triage kind/bug Kind: Bug labels Sep 12, 2022

TippyFlitsUK closed this as completed Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of disk space causes blockchain disk corruption #9292

Running out of disk space causes blockchain disk corruption #9292

ArseniiPetrovich commented Sep 12, 2022

TippyFlitsUK commented Sep 12, 2022

ArseniiPetrovich commented Sep 12, 2022

TippyFlitsUK commented Sep 12, 2022

Running out of disk space causes blockchain disk corruption #9292

Running out of disk space causes blockchain disk corruption #9292

Comments

ArseniiPetrovich commented Sep 12, 2022

Checklist

Lotus component

Lotus Version

Describe the Bug

Logging Information

Repo Steps

TippyFlitsUK commented Sep 12, 2022

ArseniiPetrovich commented Sep 12, 2022

TippyFlitsUK commented Sep 12, 2022