Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: fix panic recovery for error handling #38570

Merged
merged 1 commit into from
Jul 9, 2019

Conversation

RaduBerinde
Copy link
Member

@RaduBerinde RaduBerinde commented Jun 29, 2019

The major entry points in the optimizer catch all panics that throw an
error and converts them to errors. Unfortunately, this also catches
runtime errors (in which case we convert them to errors and lose the
stack trace).

This change adds a ShouldCatch helper which determines if we should
return a thrown object as an error. If the object is a
runtime.Error, it gets wrapped by an AssertionFailed error which
will cause correct error handling (stack trace, sentry reporting, etc).

As part of this change, we are also removing wrappers like
builderError, which are no longer useful. We fix the opt tester to
fail with the full error information (using %+v) for assertion
errors.

Release note: None

@RaduBerinde RaduBerinde requested a review from a team as a code owner June 29, 2019 03:26
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this PR is misguided. When i wrote the code I intended to catch runtime.Error panics and letting them flow through. The reason is that runtime.Error panics are recoverable, and there is no reason to let a cluster go down when they occur.

FYI I even went through the go source code to validate the following:

  • runtime.Error is only emitted for "soft" errors like out-of-bound accesses, assertion failures, etc
  • for "serious" internal errors e.g. in the scheduler, bad goroutine state, allocator problem etc, the runtime throws a string which does not implement error and thus will not be captured here.

So, can you explain a little better why you thought this PR was a good idea?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @knz, and @rytaft)

@RaduBerinde
Copy link
Member Author

Today if you are working on a change that results in a nil dereference or out-of-bound access, you get a one line error with no stack trace. Good luck debugging that. IMO that is not acceptable, both for development workflow and customer support (what will we do when we get a report from a customer which just says "out of bounds" with no other context?)

When we agreed to catch assertion errors thrown by the optimizer, it was with the condition that we will still always get stack traces for them. The discussion was mostly focused on assertions generated by our code, I don't think we specifically discussed catching runtime errors (at least not to my knowledge). I am ok catching them but only if we don't lose the stack trace.

@knz
Copy link
Contributor

knz commented Jun 29, 2019 via email

@RaduBerinde
Copy link
Member Author

It doesn't work. The stack trace isn't shown in important cases:

In cockroach demo:

root@127.68.126.34:45519/defaultdb> select 1 as lolomg;
pq: runtime error: index out of range
root@127.68.126.34:45519/defaultdb> 

In an opt test:

--- FAIL: TestBuilder (0.00s)
    --- FAIL: TestBuilder/select (0.00s)
        builder_test.go:60: 
            testdata/select:25: SELECT 1 AS lolomg
            expected:
            
            found:
            error: runtime error: index out of range
FAIL

@RaduBerinde
Copy link
Member Author

I put the patch which I ran above in https://github.com/RaduBerinde/cockroach/tree/opt-err-fix-2

@RaduBerinde
Copy link
Member Author

Maybe I should try NewAssertionErrorWithWrappedErrf?

@knz
Copy link
Contributor

knz commented Jul 1, 2019

oh yes, absolutely. I hadn't thought of that but indeed it's the best way to ensure we get telemetry, etc.

@RaduBerinde
Copy link
Member Author

Just leaving a note with the status of this PR - converting to AssertionFailed didn't quite work because it still doesn't print the stack trace in tests (with %+v); @knz is going to fix that in the error library first.

craig bot pushed a commit that referenced this pull request Jul 8, 2019
38710: errors: fix the formatting with %+v r=knz a=knz

(found by @RaduBerinde; needed to complete #38570)

The new library `github.com/cockroachdb/errors` was not implementing
`%+v` formatting properly for assertion and unimplemented errors.
The wrong implementation was hiding the details of the cause
of these two error types from the formatting logic.

Fixing this bug comprehensively required completing the investigation
of the Go 2 / `xerrors` error proposal. This revealed that the
implementation of `fmt.Formatter` for wrapper errors (a `Format()`
method) is required in all cases, at least until Go's stdlib
learns about `errors.Formatter`. More details at
golang/go#29934 and this commit message: cockroachdb/errors@78b6caa.

This patch bumps the dependency `github.com/cockroachdb/errors` to
pick up the fixes to assertion failures and unimplemented errors.

The new definition of `errors.FormatError()` subsequently required
re-implemening `Format)` for `pgerros.withCandidateCode`, which is
also done here.

Finally, this patch also picks up `errors.As()` and the new
streamlined `fmt.Formatter` / `errors.Formatter` interaction, so this
patch also simplifies a few custom error types in CockroachDB
accordingly.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@RaduBerinde
Copy link
Member Author

Updated, using NewAssertionErrorWithWrappedErrf now.

@RaduBerinde RaduBerinde force-pushed the opt-err-fix branch 2 times, most recently from 54b69fb to 525b9ff Compare July 9, 2019 01:40
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 19 of 19 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, and @rytaft)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)


pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

			// Convert runtime errors to internal errors, which display the stack and
			// get reported to Sentry.
			err = errors.NewAssertionErrorWithWrappedErrf(err, "")

That's what's creating the surprising result.
Until I fix this you can make the surprising errors with safe detail disappear (and also introduce a clarification about where the runtime error comes from) as follows:

err = errors.HandledWithMessage(err, "Go runtime error")
err = errors.WithAssertionFailure(err)
err = errors.WithStack(err)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)


pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

Previously, knz (kena) wrote…

That's what's creating the surprising result.
Until I fix this you can make the surprising errors with safe detail disappear (and also introduce a clarification about where the runtime error comes from) as follows:

err = errors.HandledWithMessage(err, "Go runtime error")
err = errors.WithAssertionFailure(err)
err = errors.WithStack(err)
``

</blockquote></details>

see https://github.com/cockroachdb/errors/pull/3


<!-- Sent from Reviewable.io -->

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)


pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

Previously, knz (kena) wrote…

see cockroachdb/errors#3

Then you can use err = errors.HandleAsAssertionFailure(err) instead of the 3 lines I listed above.

The major entry points in the optimizer catch all panics that throw an
error and converts them to errors. Unfortunately, this also catches
runtime errors (in which case we convert them to errors and lose the
stack trace).

This change adds a `ShouldCatch` helper which determines if we should
return a thrown object as an error. If the object is a
`runtime.Error`, it gets wrapped by an AssertionFailed error which
will cause correct error handling (stack trace, sentry reporting, etc).

As part of this change, we are also removing wrappers like
`builderError`, which are no longer useful. We fix the opt tester to
fail with the full error information (using `%+v`) for assertion
errors.

Release note: None
@RaduBerinde RaduBerinde requested a review from a team July 9, 2019 14:14
@RaduBerinde
Copy link
Member Author

Bumped the dep and switched to HandleAsAssertionFailure.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)

@RaduBerinde
Copy link
Member Author

TFTR!

bors r+

craig bot pushed a commit that referenced this pull request Jul 9, 2019
38570: opt: fix panic recovery for error handling r=RaduBerinde a=RaduBerinde

The major entry points in the optimizer catch all panics that throw an
error and converts them to errors. Unfortunately, this also catches
runtime errors (in which case we convert them to errors and lose the
stack trace).

This change adds a `ShouldCatch` helper which determines if we should
return a thrown object as an error. If the object is a
`runtime.Error`, it gets wrapped by an AssertionFailed error which
will cause correct error handling (stack trace, sentry reporting, etc).

As part of this change, we are also removing wrappers like
`builderError`, which are no longer useful. We fix the opt tester to
fail with the full error information (using `%+v`) for assertion
errors.

Release note: None

38660: opt: push limit into offset r=ridwanmsharif a=ridwanmsharif

This change pushes the limit into an offset whenever possible.
This shouldn't worsen any plan but does allow the `GetLimitedScans`
rule to fire in more scenarios.

Fixes #30416.
~~This is currently blocked on #38659.~~

Release note: None

38743: roachtest: skip jepsen/multi-register r=god a=nvanbenschoten

There's no use running this every night until #36431 is fixed.

Release note: None

38746: roachtest: don't reuse clusters after test failure r=andreimatei a=andreimatei

We've had a case where a cluster got messed up somehow and then a bunch
of tests that tried to reuse it failed. This patch employes a big hammer
and makes it so that we don't reuse a cluster after test failure (which
failure can be cluster related or not).

Release note: None

38766: scripts/release-notes.py: help the user with --from/--until r=lhirata a=knz

Requested by @lhirata

Release note: None

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Jul 9, 2019

Build succeeded

@craig craig bot merged commit 5ab44a9 into cockroachdb:master Jul 9, 2019
@RaduBerinde RaduBerinde deleted the opt-err-fix branch July 10, 2019 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants