Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli,base: surface stored critical errors at the right moment #47756

Merged
merged 1 commit into from
Apr 21, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 21, 2020

Fixes #44041

If/when storage detects an important error, the text for this error is
stored in a file named _CRITICAL_ERROR.txt in the auxiliary
directory. The intention is to block subsequent server restarts until
the error is investigated and the file (manually) removed.

Prior to this patch, this check was done in the startup sequence 1)
before logging was fully initialized 2) using a log.Fatal to
announce the critical error.

The first aspect is problematic because it logs before logging flags
are applied. The second is problematic because it makes the failure
super-verbose and buries the lede.

This patch simplifies the code and makes the error reported at the
right place.

Example, before:

kena@kenax ....com/cockroachdb/cockroach % ./cockroach start-single-node
F200421 14:24:02.303675 1 cli/start.go:478  From .../auxiliary/_CRITICAL_ALERT.txt:

boom

goroutine 1 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x6a7ca00, 0xed630f902, 0x0, 0x1000)
        ...//src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x6a79800, 0xc000000004, 0x33eb7b9, 0xc, 0x1de, 0xc000ba8180, 0x76)
        ...//src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:210 +0xa92
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x15e3420, 0xc000078168, 0x4, 0x2, 0x0, 0x0, 0xc00063f7e8, 0x1, 0x1)
        ...//src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
...
[147/245]
****************************************************************************

This node experienced a fatal error (printed above), and as a result the
process is terminating.

Fatal errors can occur due to faulty hardware (disks, memory, clocks) or a
...
    support@cockroachlabs.com

The Cockroach Labs team appreciates your feedback.

Example, after:

kena@kenax ....com/cockroachdb/cockroach % ./cockroach start-single-node
*
* ERROR: startup forbidden by prior critical alert
* DETAIL: From /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/cockroach-data/auxiliary/_CRITICAL_ALERT.txt:
* boom
*

Release note (cli change): The error message displayed upon cockroach start / cockroach start-single-node when manual intervention is
needed in the store directory is now clearer.

@knz knz requested a review from tbg April 21, 2020 14:39
@knz knz requested a review from a team as a code owner April 21, 2020 14:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning this up!

// These files typically request operator intervention after a
// corruption event by preventing the affected node(s) from starting
// back up.
func (ssl StoreSpecList) AssertNoPriorCriticalAlert() (err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Assert here strikes me as weird, since the "assertion" is returned via an error. Have you considered something like PriorCriticalAlertError()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take it. Thanks for the suggestion.

@blathers-crl
Copy link

blathers-crl bot commented Apr 21, 2020

❌ The GitHub CI (Cockroach) build has failed on cd63487f.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

If/when storage detects an important error, the text for this error is
stored in a file named `_CRITICAL_ERROR.txt` in the auxiliary
directory. The intention is to block subsequent server restarts until
the error is investigated and the file (manually) removed.

Prior to this patch, this check was done in the startup sequence 1)
before logging was fully initialized 2) using a `log.Fatal` to
announce the critical error.

The first aspect is problematic because it logs before logging flags
are applied. The second is problematic because it makes the failure
super-verbose and buries the lede.

This patch simplifies the code and makes the error reported at the
right place.

Example, before:
```
kena@kenax ....com/cockroachdb/cockroach % ./cockroach start-single-node
F200421 14:24:02.303675 1 cli/start.go:478  From .../auxiliary/_CRITICAL_ALERT.txt:

boom

goroutine 1 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x6a7ca00, 0xed630f902, 0x0, 0x1000)
        ...//src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x6a79800, 0xc000000004, 0x33eb7b9, 0xc, 0x1de, 0xc000ba8180, 0x76)
        ...//src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:210 +0xa92
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x15e3420, 0xc000078168, 0x4, 0x2, 0x0, 0x0, 0xc00063f7e8, 0x1, 0x1)
        ...//src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
...
[147/245]
****************************************************************************

This node experienced a fatal error (printed above), and as a result the
process is terminating.

Fatal errors can occur due to faulty hardware (disks, memory, clocks) or a
...
    support@cockroachlabs.com

The Cockroach Labs team appreciates your feedback.
```

Example, after:

```
kena@kenax ....com/cockroachdb/cockroach % ./cockroach start-single-node
*
* ERROR: ERROR: startup forbidden by prior critical alert
* DETAIL: From /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/cockroach-data/auxiliary/_CRITICAL_ALERT.txt:
* boom
*
```

Release note (cli change): The error message displayed upon `cockroach
start` / `cockroach start-single-node` when manual intervention is
needed in the store directory is now clearer.
@knz
Copy link
Contributor Author

knz commented Apr 21, 2020

bors r=tbg

@blathers-crl
Copy link

blathers-crl bot commented Apr 21, 2020

❌ The GitHub CI (Cockroach) build has failed on 110f0587.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl
Copy link

blathers-crl bot commented Apr 21, 2020

❌ The GitHub CI (Cockroach) build has failed on 110f0587.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build succeeded

@craig craig bot merged commit 998abbe into cockroachdb:master Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: the call to GetPreventedStartupMessage is not in the right place
3 participants