Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: v20.2.2: index out of bounds when Getting from Bytes vector #57297

Closed
cockroach-teamcity opened this issue Dec 1, 2020 · 4 comments · Fixed by #59028
Closed

colexec: v20.2.2: index out of bounds when Getting from Bytes vector #57297

cockroach-teamcity opened this issue Dec 1, 2020 · 4 comments · Fixed by #59028
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.

Comments

@cockroach-teamcity
Copy link
Member

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/2066108371/?referrer=webhooks_plugin

Panic message:

error.go:90: unexpected error from the vectorized engine: ×
--
*barriers.barrierError
*errutil.withPrefix: unexpected error from the vectorized engine (1)
error.go:90: *withstack.withStack (top exception)
*assert.withAssertionFailure
*contexttags.withContext: n2 (2)
*contexttags.withContext: n1 (3)
*colexecerror.notInternalError
inbox.go:248: *withstack.withStack (4)
(check the extra data payloads)

Stacktrace (expand for inline code snippets):

i.close()
colexecerror.InternalError(log.PanicAsError(0, err))
}
in pkg/sql/colflow/colrpc.(*Inbox).Next.func1
/usr/local/go/src/runtime/panic.go#L678-L680 in runtime.gopanic
func ExpectedError(err error) {
panic(newNotInternalError(err))
}
in pkg/sql/colexecbase/colexecerror.ExpectedError
// DrainMeta.
colexecerror.ExpectedError(meta.Err)
}
in pkg/sql/colflow/colrpc.(*Inbox).Next
func (n *noopOperator) Next(ctx context.Context) coldata.Batch {
return n.input.Next(ctx)
}
in pkg/sql/colexec.(*noopOperator).Next
// Get a fresh batch.
m.batch = m.input.Next(m.Ctx)
if m.batch.Length() == 0 {
in pkg/sql/colexec.(*Materializer).next
func (m *Materializer) nextAdapter() {
m.outputRow = m.next()
}
in pkg/sql/colexec.(*Materializer).nextAdapter
}()
operation()
return retErr
in pkg/sql/colexecbase/colexecerror.CatchVectorizedRuntimeError
for m.State == execinfra.StateRunning {
if err := colexecerror.CatchVectorizedRuntimeError(m.nextAdapter); err != nil {
m.MoveToDraining(err)
in pkg/sql/colexec.(*Materializer).Next
for {
row, meta := src.Next()
// Emit the row; stop if no more rows are needed.
in pkg/sql/execinfra.Run
ctx = pb.self.Start(ctx)
Run(ctx, pb.self, pb.Out.output)
}
in pkg/sql/execinfra.(*ProcessorBase).Run
}
headProc.Run(ctx)
return nil
in pkg/sql/flowinfra.(*FlowBase).Run
// TODO(radu): this should go through the flow scheduler.
if err := flow.Run(ctx, func() {}); err != nil {
log.Fatalf(ctx, "unexpected error from syncFlow.Start(): %v\n"+
in pkg/sql.(*DistSQLPlanner).Run
recv.expectedRowsRead = int64(physPlan.TotalEstimatedScannedRows)
return dsp.Run(planCtx, txn, physPlan, recv, evalCtx, nil /* finishedSetupFn */)
}
in pkg/sql.(*DistSQLPlanner).PlanAndRun
// the planner whether or not to plan remote table readers.
cleanup := ex.server.cfg.DistSQLPlanner.PlanAndRun(
ctx, evalCtx, planCtx, planner.txn, planner.curPlan.main, recv,
in pkg/sql.(*connExecutor).execWithDistSQLEngine
ex.sessionTracing.TraceExecStart(ctx, "distributed")
stats, err := ex.execWithDistSQLEngine(
ctx, planner, stmt.AST.StatementType(), res, distributePlan.WillDistribute(), progAtomic,
in pkg/sql.(*connExecutor).dispatchToExecutionEngine
p.autoCommit = os.ImplicitTxn.Get() && !ex.server.cfg.TestingKnobs.DisableAutoCommit
if err := ex.dispatchToExecutionEngine(ctx, p, res); err != nil {
return nil, nil, err
in pkg/sql.(*connExecutor).execStmtInOpenState
} else {
ev, payload, err = ex.execStmtInOpenState(ctx, stmt, res, pinfo)
}
in pkg/sql.(*connExecutor).execStmt
if !portal.exhausted {
ev, payload, err = ex.execStmt(stmtCtx, curStmt, stmtRes, pinfo)
// Portal suspension is supported via a "side" state machine
in pkg/sql.(*connExecutor).execPortal
res = stmtRes
ev, payload, err = ex.execPortal(ctx, portal, portalName, stmtRes, pinfo)
return err
in pkg/sql.(*connExecutor).execCmd.func2
return err
}()
// Note: we write to ex.statsCollector.phaseTimes, instead of ex.phaseTimes,
in pkg/sql.(*connExecutor).execCmd
var err error
if err = ex.execCmd(ex.Ctx()); err != nil {
if errors.IsAny(err, io.EOF, errDrainingComplete) {
in pkg/sql.(*connExecutor).run
}()
return h.ex.run(ctx, s.pool, reserved, cancel)
}
in pkg/sql.(*Server).ServeConn
reservedOwned = false // We're about to pass ownership away.
retErr = sqlServer.ServeConn(ctx, connHandler, reserved, cancelConn)
}()
in pkg/sql/pgwire.(*conn).processCommandsAsync.func1
/usr/local/go/src/runtime/asm_amd64.s#L1356-L1358 in runtime.goexit

pkg/sql/colflow/colrpc/inbox.go in pkg/sql/colflow/colrpc.(*Inbox).Next.func1 at line 248
/usr/local/go/src/runtime/panic.go in runtime.gopanic at line 679
pkg/sql/colexecbase/colexecerror/error.go in pkg/sql/colexecbase/colexecerror.ExpectedError at line 186
pkg/sql/colflow/colrpc/inbox.go in pkg/sql/colflow/colrpc.(*Inbox).Next at line 289
pkg/sql/colexec/operator.go in pkg/sql/colexec.(*noopOperator).Next at line 238
pkg/sql/colexec/materializer.go in pkg/sql/colexec.(*Materializer).next at line 222
pkg/sql/colexec/materializer.go in pkg/sql/colexec.(*Materializer).nextAdapter at line 247
pkg/sql/colexecbase/colexecerror/error.go in pkg/sql/colexecbase/colexecerror.CatchVectorizedRuntimeError at line 93
pkg/sql/colexec/materializer.go in pkg/sql/colexec.(*Materializer).Next at line 253
pkg/sql/execinfra/base.go in pkg/sql/execinfra.Run at line 170
pkg/sql/execinfra/processorsbase.go in pkg/sql/execinfra.(*ProcessorBase).Run at line 775
pkg/sql/flowinfra/flow.go in pkg/sql/flowinfra.(*FlowBase).Run at line 392
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).Run at line 422
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).PlanAndRun at line 1002
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execWithDistSQLEngine at line 1001
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).dispatchToExecutionEngine at line 872
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execStmtInOpenState at line 639
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execStmt at line 114
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execPortal at line 203
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).execCmd.func2 at line 1533
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).execCmd at line 1535
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).run at line 1391
pkg/sql/conn_executor.go in pkg/sql.(*Server).ServeConn at line 508
pkg/sql/pgwire/conn.go in pkg/sql/pgwire.(*conn).processCommandsAsync.func1 at line 626
/usr/local/go/src/runtime/asm_amd64.s in runtime.goexit at line 1357
Tag Value
Cockroach Release v20.2.2
Cockroach SHA: 92d9495
Platform linux amd64
Distribution CCL
Environment v20.2.2
Command server
Go Version ``
# of CPUs
# of Goroutines
@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Dec 1, 2020
@yuzefovich yuzefovich changed the title sentry: error.go:90: unexpected error from the vectorized engine: × -- *barriers.barrierError *errutil.withPrefix: unexpected error from the vectorized engine (1) error.go:90: *withstack.withStack (top exception) *assert.withAssertionFailure *contexttags.withContext: n2 (2) *contexttags.withContext: n1 (3) *colexecerror.notInternalError inbox.go:248: *withstack.withStack (4) (check the extra data payloads) colflow: v20.2.2: wrongfully propagating expected error as an internal one Dec 1, 2020
@yuzefovich
Copy link
Member

My initial interpretation of the root cause of the issue is wrong - the actual error has the following stack trace:

error.go:90 (func1): *barriers.barrierError: unexpected error from the vectorized engine: ×
via *withstack.withStack
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror/error.go", line 90, in func1
  File "/usr/local/go/src/runtime/panic.go", line 679, in gopanic
  File "/usr/local/go/src/runtime/panic.go", line 103, in goPanicSliceB
  File "/go/src/github.com/cockroachdb/cockroach/pkg/col/coldata/bytes.go", line 110, in Get
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/selection_ops.eg.go", line 379, in Next
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/selection_ops.eg.go", line 5996, in Next
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/simple_project.go", line 124, in Next
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/simple_project.go", line 124, in Next
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/deselector.go", line 61, in Next
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go", line 240, in func1
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror/error.go", line 93, in CatchVectorizedRuntimeError
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go", line 232, in sendBatches
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go", line 316, in runWithStream
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go", line 180, in Run
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go", line 595, in func1
  File "/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go", line 1204, in 1
  File "/usr/local/go/src/runtime/asm_amd64.s", line 1357, in goexit

so we have an index out of bounds panic when getting from a Bytes vector on a remote node, and then the gateway node propagates that error further (the stack trace printed above).

I now believe that our error propagation in the inbox works correctly, and we don't have much to go on with the Bytes.Get issue, so I think we should close the issue as unactionable. cc @asubiotto @jordanlewis

@yuzefovich yuzefovich changed the title colflow: v20.2.2: wrongfully propagating expected error as an internal one colexec: v20.2.2: index out of bounds when Getting from Bytes vector Dec 8, 2020
@asubiotto
Copy link
Contributor

Fair enough, but isn't there still a problem in the inbox if we propagate an ExpectedError, recover, and then propagate that panic as an InternalError in the deferred func?

@yuzefovich
Copy link
Member

No, I think the logic works correctly. With ExpectedError we wrap err object with notInternalError wrapper, then we catch it with recover and propagate it further, unchanged, as InternalError (which will be essentially panic(notInternalError(err))), and this is exactly what we want.

@asubiotto
Copy link
Contributor

Ah right, because we don't want to wrap it with anything else. Gotcha. I agree there's nothing to do here.

craig bot pushed a commit that referenced this issue Jan 19, 2021
59028: coldata: fix Bytes invariant in some cases r=yuzefovich a=yuzefovich

**execgen: remove SLICE method**

`execgen.SLICE` directive is used only in one place, and we can use
`execgen.WINDOW` there instead (which will have the same effect).

Release note: None

**coldata: fix updating offsets of bytes in Batch.SetLength**

In `SetLength` method we are maintaining the invariant of `Bytes`
vectors that the offsets are non-decreasing sequences. Previously, this
was done incorrectly when a selection vector is present on the batch
which could lead to out of bounds errors (caught by our panic-catcher)
some time later. This is now fixed by correctly paying attention to the
selection vector.

I neither can easily come up with an example query that would trigger
this condition nor can I prove that it won't occur, but I think we have
seen a single sentry report that could be explained by this bug, so I
think it's worth backporting.

Additionally, this commit uses the assumption that the selection vectors
are increasing sequences in order to calculate the largest index
accessed by the batch.

Fixes: #57297.

Release note (bug fix): Previously, CockroachDB could encounter an
internal error when executing queries with BYTES or STRING types via the
vectorized engine in rare circumstances, and now this is fixed.

59127: importccl: skip BenchmarkUserFileImport r=RaduBerinde a=RaduBerinde

This benchmark causes repeated CI failures.

Informs #59126.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants