Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/utils: Handle graceful shutdown on low disk space #21884

Closed
wants to merge 2 commits into from
Closed

cmd/utils: Handle graceful shutdown on low disk space #21884

wants to merge 2 commits into from

Conversation

vyrwu
Copy link

@vyrwu vyrwu commented Nov 21, 2020

Currently, the program will panic when the node runs out of memory, sometimes leading to corrupting the databases.
This commit introduces a goroutine that checks for available disk space every 5 seconds:

  • If less than 500 MB, prints warning
  • If less than 100 MB, writes SIGTERM to channel that is used to handle graceful termination of a node

Currently, the program will panic when the node runs out of memory, sometimes leading to corrupting the databases.
This commit introduces a goroutine that checks for available disk space every 5 seconds:
  - If less than 500 MB, prints warning
  - If less than 100 MB, writes SIGTERM to channel that is used to handle graceful termination of a node
@vyrwu vyrwu changed the title Handle graceful shutdown on low disk space cmd/utils: Handle graceful shutdown on low disk space Nov 21, 2020
@vyrwu
Copy link
Author

vyrwu commented Nov 21, 2020

Addresses #20965

@holiman One requirement was unclear to me: "check the disk where the datadir resides". I can get this path from Config, but why do I need it? Or did I miss-understood the requirements?

@holiman
Copy link
Contributor

holiman commented Nov 23, 2020

@vyrwu thanks for picking up this task!

One requirement was unclear to me: "check the disk where the datadir resides". I can get this path from Config, but why do I need it?

You're currently checking os.Getwd(). However, in many deployments, people don't have the data directory in the same location as where they're executing from. E.g. if you run dockerized geth, then you have probably mounted your data directory from the host. And the data directory is the concern here, not the working directory.

Otherwise I think it looks good, and clever idea to use the sigterm channel to simulate a regular exit!

We'll have to check whether 100Mb is sufficient. Whenever geth exits, it has quite a lot of data held in memory which must be persisted, so 100Mb might be on the low side.

I would assume the level for low-disk-exit should be on the same order of magnitude as the cache limit.

@vyrwu
Copy link
Author

vyrwu commented Nov 23, 2020

Thanks for fast reply. I will work on it next weekend. I'll try testing geth a bit too, and see maybe I can simulate OOM somehow to verify that the fix does what it supposed to.

cmd/utils/cmd.go Outdated
@@ -312,3 +313,28 @@ func ExportPreimages(db ethdb.Database, fn string) error {
log.Info("Exported preimages", "file", fn)
return nil
}

func ensureSufficientMemory(sigc chan os.Signal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing to have this method named ensureSufficientMemory, as it's disk-space, not RAM that's being checked.

Copy link
Author

@vyrwu vyrwu Nov 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c355ae4

cmd/utils/cmd.go Outdated
log.Info("Available disk space is less than 100 MB. Gracefully shutting down to prevent database corruption.")
sigc <- syscall.SIGTERM
} else if avMemMB < 500 {
log.Warnf("Node is running low on memory. It will terminate if memory runs below 100MB. Remaining: %v MB.", avMemMB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, low on disk space ... if disk space runs below...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c355ae4

cmd/utils/cmd.go Outdated
go func() {
var stat syscall.Statfs_t
wd, err := os.Getwd(); err != nil {
Fatalf("Error reading available memory of Node: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid Fatalf -- that one causes an immediate os.Exit, which means it will almost certainly cause data loss and/or database corruption. We only use it in cases where things are already irrevocably broken beyond repair.
Just use log.Warn (and perhaps send a SIGTERM?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c355ae4

@holiman
Copy link
Contributor

holiman commented Nov 23, 2020

As for testing, it might be good to make it trigger after ~1h, and right before it triggers, print out the measured disk size. And once geth has exited, you can dump out the same measure again. Doing that a few times should collect some stats on how much disk is used by the shutdown process.

@vyrwu
Copy link
Author

vyrwu commented Nov 28, 2020

This is still untested, I was reading a bit about Geth today and digging more into the codebase. I still need to find the right cache limit for the disk size thresholds, there are a few defined in code but I'm sure I can find the right one after understanding it a little better. 👍

My intuition is that it's 1000 MBs.

@vyrwu
Copy link
Author

vyrwu commented Nov 28, 2020

Also need to look into these CI errors:

# github.com/ethereum/go-ethereum/cmd/utils
cmd\utils\cmd.go:322:12: undefined: syscall.Statfs_t
cmd\utils\cmd.go:323:13: undefined: syscall.Statfs
util.go:47: exit status 2
exit status 1

@holiman
Copy link
Contributor

holiman commented Dec 5, 2020

About the syscall package (https://golang.org/pkg/syscall/)

Deprecated: this package is locked down. Callers should use the corresponding package in the golang.org/x/sys repository instead.

Also particular, Statfs seems to be a unix thing, (but seems to be deprecated in favour of vstatfs ) .

The sys package has statfs in the unix package.

So I think what's needed is,

  • Find a corresponding windows method of figuring out the disk use,
  • Verify that we also have something for darwin (maybe the unix version works there)

And once we have that, create architecture-specific files, with a method e.g. availableSpace(path string), defined in <xx>.go, where there is one <xxx>_windows.go and one <xxx>_linux.go with appropriate compiler directives.

@holiman
Copy link
Contributor

holiman commented Jan 20, 2021

Superseded by #22103

@holiman holiman closed this Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants