Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/pretty: mitigate stack overflows of Pretty #110374

Merged
merged 3 commits into from
Sep 20, 2023

Conversation

mgartner
Copy link
Collaborator

@mgartner mgartner commented Sep 11, 2023

errorutil: moved kv-specific errors out of errorutil

Errors for missing store and node descriptors have been moved from the
errorutil package to the kvpb package, because the errorutil package is
a low-level package to aid in working with errors in general. It should
not contain facilities for creating specific errors as this muddles the
package and can lead to import cycles.

Release note: None

errorutil: move SendReport to new sentryutil package

The SendReport function has been moved out of the errorutil package
and into a new sentryutil package. This avoids muddling the errorutil
package with a Sentry-specific function, and it breaks errorutil's
dependence on pkg/settings and pkg/util/log/logcrash.

Release note: None

util/pretty: mitigate stack overflows of Pretty

This commit reduces the chance of a stack overflow from recursive calls
of *beExec.be. The Pretty function will now return an internal error
if the recursive depth of be surpasses 10,000.

Informs #91197

Release note: None

@mgartner mgartner requested review from a team as code owners September 11, 2023 20:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach taken here works if we choose to error out when the SQL is sufficiently complex.
But is that the right UX wrt user expectations?

We are looking here at a formatter. If the SQL is sufficiently complex, we still have an option to render the SQL -- using the simple formatter (the one that's not pretty).
What of avoiding an error return in Pretty() and instead have it call .AST.String() if memory was exceeded?

Reviewed 14 of 14 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/util/pretty/pretty.go line 119 at r1 (raw file):

}

const maxBeDepth = 100_000

arguably i could see us use 10.000 here (anything above 1000 looks suspicious to me).

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I agree with @knz that we should make an effort to try to render the SQL even if it is slightly ugly.

Reviewed 14 of 14 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/cmd/reduce/reduce/reducesql/reducesql.go line 443 at r1 (raw file):

		p, err := cfg.Pretty(stmt)
		if err != nil {
			panic(err)

Maybe instead of panic we should revert to using stmt?


pkg/internal/sqlsmith/sqlsmith.go line 195 at r1 (raw file):

		p, err := prettyCfg.Pretty(stmt)
		if err != nil {
			panic(err)

Same comment here.


pkg/sql/explain_bundle.go line 266 at r1 (raw file):

		b.stmt, err = cfg.Pretty(b.plan.stmt.AST)
		if err != nil {
			return err

If we hit this error, instead of giving up on saving the statement, I propose we copy this logic from the b.plan.stmt.AST == nil case above so that the bundle still has something useful:

		b.stmt = stmtRawSQL
		// If we're collecting a redacted bundle, redact the raw SQL completely.
		if b.flags.RedactValues && b.stmt != "" {
			b.stmt = string(redact.RedactedMarker())
		}

If you want to save the formatting error, could add something like "\n-- failed to format SQL: %v" to the b.stmt.


pkg/util/pretty/pretty.go line 119 at r1 (raw file):

}

const maxBeDepth = 100_000

Personally I would go even smaller (maybe 10k) but this is also fine.


pkg/util/pretty/pretty.go line 125 at r1 (raw file):

	defer func() { b.beDepth-- }()
	if b.beDepth > maxBeDepth {
		return nil, errors.AssertionFailedf("max call stack depth of be exceeded")

Another idea, if @knz doesn't like this, is to panic instead of adding error handling to every stack frame.

@knz
Copy link
Contributor

knz commented Sep 11, 2023

Another idea, if @knz doesn't like this, is to panic instead of adding error handling to every stack frame.

lol, yes I am known to advertise for this type of error handling, which would be super justified here.

Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the UX could improve, but I'd argue this is already a vast improvement. An internal error is strictly better than a crashed node.

IMO it should be up to the caller of Pretty to decide what to do if there is an error, but if you both feel strongly that it should just spit out a "non-pretty" string I can make that change.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)


pkg/util/pretty/pretty.go line 119 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Personally I would go even smaller (maybe 10k) but this is also fine.

Heh, you two are on the same page today, huh? 😉 Done.


pkg/util/pretty/pretty.go line 119 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

arguably i could see us use 10.000 here (anything above 1000 looks suspicious to me).

Done.


pkg/util/pretty/pretty.go line 125 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Another idea, if @knz doesn't like this, is to panic instead of adding error handling to every stack frame.

A panic here won't be guaranteed to be caught by a recover. I originally wanted to do this but from what I could tell the code path for the explain bundle would not recover from the panic, so it'd end up crashing the node.

@mgartner
Copy link
Collaborator Author

A panic here won't be guaranteed to be caught by a recover. I originally wanted to do this but from what I could tell the code path for the explain bundle would not recover from the panic, so it'd end up crashing the node.

Or are you suggesting to add recover to the Pretty function?

@knz
Copy link
Contributor

knz commented Sep 11, 2023

Or are you suggesting to add recover to the Pretty function?

yes, that. panic-based error handling should never cross package boundaries.

@mgartner mgartner requested a review from a team as a code owner September 12, 2023 14:11
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that. panic-based error handling should never cross package boundaries.

👍 Done.

I added two more commits to avoid some import cycles. PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)


pkg/cmd/reduce/reduce/reducesql/reducesql.go line 443 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Maybe instead of panic we should revert to using stmt?

Done.


pkg/internal/sqlsmith/sqlsmith.go line 195 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Same comment here.

Done.


pkg/sql/explain_bundle.go line 266 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

If we hit this error, instead of giving up on saving the statement, I propose we copy this logic from the b.plan.stmt.AST == nil case above so that the bundle still has something useful:

		b.stmt = stmtRawSQL
		// If we're collecting a redacted bundle, redact the raw SQL completely.
		if b.flags.RedactValues && b.stmt != "" {
			b.stmt = string(redact.RedactedMarker())
		}

If you want to save the formatting error, could add something like "\n-- failed to format SQL: %v" to the b.stmt.

Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 28 of 28 files at r3, 10 of 10 files at r4, 14 of 14 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


-- commits line 18 at r4:
💯 nice


pkg/kv/kvpb/errors.go line 1631 at r3 (raw file):

// StoreDescNotFoundError is reported when a stored descriptor is missing.
type StoreDescNotFoundError struct {

For the sake of error encode/decode, I recommend you keep a single data type (maybe with a boolean var to distinguish node/store); then you will want to add this to the init function:

errorutilpath := reflect.TypeOf(errorutil.XXX_TBD{}).PkgPath()
errors.RegisterTypeMigration(errorutilpath, "*errorutil.descriptorNotFound", &DescNotFoundError)

where XXX_TBD is a sentinel type remaining behind in errorutil to enable retrieving the PkgPath via reflect.


pkg/sql/explain_bundle.go line 261 at r5 (raw file):

		b.stmt, err = cfg.Pretty(b.plan.stmt.AST)
		if err != nil {
			// Use the raw statement string if pretty-printing fails.

I'm not keen on having a blanket ignore of the object in err. This feels like a candidate for errors.Is. If there's any other error later in the Pretty interface that is more "Serious" we won't want that to get ignored.


pkg/sql/show_create_clauses.go line 173 at r5 (raw file):

		}
		query, err = cfg.Pretty(parsed.AST)
		if err != nil {

ditto my previous comment about errors.Is


pkg/sql/show_create_clauses.go line 174 at r5 (raw file):

		query, err = cfg.Pretty(parsed.AST)
		if err != nil {
			log.Warningf(ctx, "error pretty-printing query for view %s (%v): %+v",

What is this warning going to help with? What should an operator do if they see it?


pkg/sql/sem/builtins/builtins.go line 11162 at r5 (raw file):

		p, err := p.Pretty(stmts[idx].AST)
		if err != nil {
			return "", err

I think this is a case where we probably want an alternate representation if this one fails. This code is used when displaying queries in DB Console.


pkg/util/pretty/pretty.go line 141 at r5 (raw file):

	defer func() { b.beDepth-- }()
	if b.beDepth > maxBeDepth {
		panic(errors.AssertionFailedf("max call stack depth of be exceeded"))

possibly a custom error marker here to enable errors.Is.

Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)


-- commits line 18 at r4:

Previously, knz (Raphael 'kena' Poss) wrote…

💯 nice

👍


pkg/kv/kvpb/errors.go line 1631 at r3 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

For the sake of error encode/decode, I recommend you keep a single data type (maybe with a boolean var to distinguish node/store); then you will want to add this to the init function:

errorutilpath := reflect.TypeOf(errorutil.XXX_TBD{}).PkgPath()
errors.RegisterTypeMigration(errorutilpath, "*errorutil.descriptorNotFound", &DescNotFoundError)

where XXX_TBD is a sentinel type remaining behind in errorutil to enable retrieving the PkgPath via reflect.

Done.


pkg/sql/explain_bundle.go line 261 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I'm not keen on having a blanket ignore of the object in err. This feels like a candidate for errors.Is. If there's any other error later in the Pretty interface that is more "Serious" we won't want that to get ignored.

Great catch. Done.


pkg/sql/show_create_clauses.go line 173 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ditto my previous comment about errors.Is

Done.


pkg/sql/show_create_clauses.go line 174 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

What is this warning going to help with? What should an operator do if they see it?

It wasn't intended for an operator. I thought it might aid us in debugging in the future, but I'm not sure exactly how. I've removed it.


pkg/sql/sem/builtins/builtins.go line 11162 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I think this is a case where we probably want an alternate representation if this one fails. This code is used when displaying queries in DB Console.

Makes sense, done.


pkg/util/pretty/pretty.go line 141 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

possibly a custom error marker here to enable errors.Is.

Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 31 files at r6, 9 of 10 files at r7, 15 of 15 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/kv/kvpb/errors.go line 1662 at r8 (raw file):

func (e *DescNotFoundError) SafeFormatError(p errors.Printer) (next error) {
	if e.isStore {
		p.Printf("store descriptor with store ID %d was not found", e.storeID)

nit: maybe

s := redact.SafeString("node")
id := uint64(e.NodeID)
if e.isStore {
  s = "store"
  id = uint64(e.StoreID)
}
p.Printf("%s descriptor with %s ID %d was not found", s, s, id)

(you could even collapse the two ID members into a single integer field)


pkg/sql/sem/builtins/builtins.go line 11161 at r8 (raw file):

	for idx := range stmts {
		p, err := p.Pretty(stmts[idx].AST)
		if err != nil {

errors.Is

Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)


pkg/kv/kvpb/errors.go line 1662 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: maybe

s := redact.SafeString("node")
id := uint64(e.NodeID)
if e.isStore {
  s = "store"
  id = uint64(e.StoreID)
}
p.Printf("%s descriptor with %s ID %d was not found", s, s, id)

(you could even collapse the two ID members into a single integer field)

👍 Done.


pkg/sql/sem/builtins/builtins.go line 11161 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

errors.Is

Done.


pkg/util/pretty/pretty.go line 135 at r8 (raw file):

// maxBeDepth is the maximum allowed recursive call depth of be. If the depth
// exceeds this value, be will panic.
const maxBeDepth = 10_000

I raised this back to 100,000 because this test was failing to pretty-round-trip.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo settling on the max size constant.

Reviewed 1 of 19 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/util/pretty/pretty.go line 135 at r8 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I raised this back to 100,000 because this test was failing to pretty-round-trip.

I would rather change the test than the constant here. These tests were not principled.

@mgartner
Copy link
Collaborator Author

pkg/util/pretty/pretty.go line 135 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I would rather change the test than the constant here. These tests were not principled.

The test looks like a regression test and it's not a terribly obscure case—it's an insert of ~400 rows. Shouldn't we be able to handle such a statement?

@knz
Copy link
Contributor

knz commented Sep 15, 2023

Shouldn't we be able to handle such a statement?

I would expect so, but not with a recursion depth of 100000. What is the effective recursion depth used for it?

@mgartner
Copy link
Collaborator Author

The depth for that test grows to 19,361.

@mgartner mgartner force-pushed the pretty-stack-overflow-catcher branch 2 times, most recently from 141d2d7 to dfd056c Compare September 15, 2023 19:53
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thank you for fixing this!

Reviewed 8 of 28 files at r3, 6 of 31 files at r6, 25 of 25 files at r9, 11 of 11 files at r10, 15 of 19 files at r11, 1 of 1 files at r12, 3 of 3 files at r13, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz and @mgartner)


pkg/util/pretty/pretty.go line 135 at r8 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

The test looks like a regression test and it's not a terribly obscure case—it's an insert of ~400 rows. Shouldn't we be able to handle such a statement?

🤷 I'm ok either way.

(Personally I think Go's default stack limit of 1 GiB is giant but I'm coming from a previous codebase where we could only use 20 KiB stacks, so I'm biased.)

@mgartner mgartner force-pushed the pretty-stack-overflow-catcher branch 3 times, most recently from ef5ae0d to 209b8ef Compare September 19, 2023 14:40
@mgartner
Copy link
Collaborator Author

pkg/util/pretty/pretty.go line 135 at r8 (raw file):

Previously, michae2 (Michael Erickson) wrote…

🤷 I'm ok either way.

(Personally I think Go's default stack limit of 1 GiB is giant but I'm coming from a previous codebase where we could only use 20 KiB stacks, so I'm biased.)

I updated it to 50,000, which allows the test to pass with plenty of buffer and also catches the stack overflow. Let me know what y'all think.

@knz
Copy link
Contributor

knz commented Sep 19, 2023

i'm ok with it.

Errors for missing store and node descriptors have been moved from the
errorutil package to the kvpb package, because the errorutil package is
a low-level package to aid in working with errors in general. It should
not contain facilities for creating specific errors as this muddles the
package and can lead to import cycles.

Release note: None
The `SendReport` function has been moved out of the errorutil package
and into a new sentryutil package. This avoids muddling the errorutil
package with a Sentry-specific function, and it breaks errorutil's
dependence on `pkg/settings` and `pkg/util/log/logcrash`.

Release note: None
This commit reduces the chance of a stack overflow from recursive calls
of `*beExec.be`. The `Pretty` function will now return an internal error
if the recursive depth of `be` surpasses 10,000.

Informs cockroachdb#91197

Release note: None
@mgartner
Copy link
Collaborator Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 19, 2023

Build failed:

@mgartner
Copy link
Collaborator Author

Flaked.

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 20, 2023

Build succeeded:

@craig craig bot merged commit 9cc17d7 into cockroachdb:master Sep 20, 2023
3 checks passed
@mgartner mgartner deleted the pretty-stack-overflow-catcher branch September 20, 2023 19:46
kvoli added a commit to kvoli/cockroach that referenced this pull request Sep 25, 2023
`StoreNotFoundError` and `NodeNotFoundError` errors were moved to the
`kvpb` pkg in cockroachdb#110374. As part of the move, `crdb_internal` functions
which checked if the error were `DescNotFoundError` were also updated so
that node/store not found errors would be recognized e.g.

```
errors.Is(kvpb.NewNodeNotFoundError(nodeID), &kvpb.DescNotFoundError{})
```

This didn't work, because the error doesn't match the reference error
variable being given. It does match the type. Update these error
assertions to use `HasType` instead.

Resolves: cockroachdb#111084
Epic: none
Release note: None
craig bot pushed a commit that referenced this pull request Sep 25, 2023
…111232

110967: asim: enable random zone config event generation r=kvoli a=wenyihu6

Previously, zone config event generation used hardcoded span configurations.
This limits our ability to test the allocator more thoroughly.

To improve this, this patch enables random span configs to be generated and
applied as part of the simulation. These configurations are generated by
randomly selecting the primary region, region V.S. zone survival goal, and
leaseholder preference.

```
The following command is now supported:
"rand_events" [cycle_via_random_survival_goals]
```

Part of: #106192
Release Note: none
Epic: none

111192: bulk: allow caller-configurable priority in SSTBatcher r=adityamaru a=stevendanna

This adds the ability for some callers to use a higher admission priority in SSTBatcher. This is helpful for streaming where we want to run at a priority that isn't subject to the elastic admission regime.

Epic: none

Release note: None

111206: kv: fix (store|node) not found err checking r=erikgrinaker a=kvoli

`StoreNotFoundError` and `NodeNotFoundError` errors were moved to the `kvpb` pkg in #110374. As part of the move, `crdb_internal` functions which checked if the error were `DescNotFoundError` were also updated so that node/store not found errors would be recognized e.g.

```
errors.Is(kvpb.NewNodeNotFoundError(nodeID), &kvpb.DescNotFoundError{})
```

This didn't work, because the error doesn't match the reference error variable being given. It does match the type. Update these error assertions to use `HasType` instead.

Resolves: #111084
Epic: none
Release note: None

111214: release: fix roachtest artifacts name r=srosenberg a=rail

This fixes the roachtest artifacts directory name.

Epic: none
Release note: None

111217: cloud/azure: Fix azure schemes r=benbardin a=benbardin

Part of: https://cockroachlabs.atlassian.net/browse/CRDB-31120

Release note (bug fix): Fixes azure schemes in storage, kms and external conns.

111223: storage: populate timestamp field in lock table values r=nvanbenschoten a=nvanbenschoten

Informs #109645.

This commit starts writing the `Timestamp` field in lock table `MVCCMetadata` values for shared and exclusive locks. This mirrors the behavior of intent locks. This is not strictly needed, as the timestamp is always equal to `Txn.WriteTimestamp`, but it is cheap to do and helps unify some stats logic, which uses this field to compute "lock age".

Maybe we'll get rid of this for all lock strengths one day...

Release note: None

111230: authors: add xhesikam to authors r=xhesikam a=xhesikam

Release note: None
Epic: None

111231: backupccl: add missing ctx cancel check r=msbutler a=adityamaru

In #111159 we deduced from the stacks a
situation in which the goroutine draining `spanCh` had exited due to a context cancelation but the
writer was not listening for a ctx cancelation.
This manifests as a stuck restore when using
the non-default make simple import spans implementation.

Fixes: #111159
Release note: None

111232: kv: deflake TestLeaseExpirationBasedDrainTransferWithExtension r=nvanbenschoten a=nvanbenschoten

Informs #110715, which will be fixed by a non-clean backport (see 42e45b4) of this commit.

This commit deflakes `TestLeaseExpirationBasedDrainTransferWithExtension` by disabling the node suspect timer in leaseTransferTest tests. These tests manually control the clock and have a habit of inducing destabilizing clock jumps. In this case, if n2 looked at liveness immediately after one of these manual clock jumps, it would mark n1 as suspect and refuse to transfer it the lease for the 30 second `server.time_after_store_suspect`, which is longer than the 5 second `server.shutdown.lease_transfer_wait`. This would cause the test to fail.

Before this patch, the test would fail under stress race in about 8 minutes. Since the patch, it hasn't failed in over 30 minutes.

Release note: None

Co-authored-by: wenyihu6 <wenyi.hu@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Rail Aliiev <rail@iqchoice.com>
Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: xhesikam <xhesika@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants