-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli,base: surface stored critical errors at the right moment #47756
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning this up!
pkg/base/store_spec.go
Outdated
// These files typically request operator intervention after a | ||
// corruption event by preventing the affected node(s) from starting | ||
// back up. | ||
func (ssl StoreSpecList) AssertNoPriorCriticalAlert() (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Assert
here strikes me as weird, since the "assertion" is returned via an error. Have you considered something like PriorCriticalAlertError()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take it. Thanks for the suggestion.
❌ The GitHub CI (Cockroach) build has failed on cd63487f. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
If/when storage detects an important error, the text for this error is stored in a file named `_CRITICAL_ERROR.txt` in the auxiliary directory. The intention is to block subsequent server restarts until the error is investigated and the file (manually) removed. Prior to this patch, this check was done in the startup sequence 1) before logging was fully initialized 2) using a `log.Fatal` to announce the critical error. The first aspect is problematic because it logs before logging flags are applied. The second is problematic because it makes the failure super-verbose and buries the lede. This patch simplifies the code and makes the error reported at the right place. Example, before: ``` kena@kenax ....com/cockroachdb/cockroach % ./cockroach start-single-node F200421 14:24:02.303675 1 cli/start.go:478 From .../auxiliary/_CRITICAL_ALERT.txt: boom goroutine 1 [running]: github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x6a7ca00, 0xed630f902, 0x0, 0x1000) ...//src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8 github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x6a79800, 0xc000000004, 0x33eb7b9, 0xc, 0x1de, 0xc000ba8180, 0x76) ...//src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:210 +0xa92 github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x15e3420, 0xc000078168, 0x4, 0x2, 0x0, 0x0, 0xc00063f7e8, 0x1, 0x1) ...//src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9 ... [147/245] **************************************************************************** This node experienced a fatal error (printed above), and as a result the process is terminating. Fatal errors can occur due to faulty hardware (disks, memory, clocks) or a ... support@cockroachlabs.com The Cockroach Labs team appreciates your feedback. ``` Example, after: ``` kena@kenax ....com/cockroachdb/cockroach % ./cockroach start-single-node * * ERROR: ERROR: startup forbidden by prior critical alert * DETAIL: From /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/cockroach-data/auxiliary/_CRITICAL_ALERT.txt: * boom * ``` Release note (cli change): The error message displayed upon `cockroach start` / `cockroach start-single-node` when manual intervention is needed in the store directory is now clearer.
bors r=tbg |
❌ The GitHub CI (Cockroach) build has failed on 110f0587. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
❌ The GitHub CI (Cockroach) build has failed on 110f0587. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Build succeeded |
Fixes #44041
If/when storage detects an important error, the text for this error is
stored in a file named
_CRITICAL_ERROR.txt
in the auxiliarydirectory. The intention is to block subsequent server restarts until
the error is investigated and the file (manually) removed.
Prior to this patch, this check was done in the startup sequence 1)
before logging was fully initialized 2) using a
log.Fatal
toannounce the critical error.
The first aspect is problematic because it logs before logging flags
are applied. The second is problematic because it makes the failure
super-verbose and buries the lede.
This patch simplifies the code and makes the error reported at the
right place.
Example, before:
Example, after:
Release note (cli change): The error message displayed upon
cockroach start
/cockroach start-single-node
when manual intervention isneeded in the store directory is now clearer.