Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: v21.2.7: nil pointer in gRPC transport when sending snapshots #81227

Closed
cockroach-teamcity opened this issue May 13, 2022 · 14 comments · Fixed by #88470 or #93867
Closed

kvserver: v21.2.7: nil pointer in gRPC transport when sending snapshots #81227

cockroach-teamcity opened this issue May 13, 2022 · 14 comments · Fixed by #88470 or #93867
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 13, 2022

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3269587070/?referrer=webhooks_plugin

Panic message:

panic.go:965: runtime error: invalid memory address or nil pointer dereference
--
runtime.errorString
panic.go:965: *withstack.withStack (top exception)

Stacktrace (expand for inline code snippets):

/usr/local/go/src/runtime/panic.go#L964-L966 in runtime.gopanic
/usr/local/go/src/runtime/panic.go#L211-L213 in runtime.panicmem
/usr/local/go/src/runtime/signal_unix.go#L733-L735 in runtime.sigpanic
https://github.com/cockroachdb/cockroach/blob/37dee546a7c52870a8dc58826f0cffe2afa8d47a/vendor/google.golang.org/grpc/internal/transport/transport.go#L308-L310 in google.golang.org/grpc/internal/transport.(*Stream).compareAndSwapState
https://github.com/cockroachdb/cockroach/blob/37dee546a7c52870a8dc58826f0cffe2afa8d47a/vendor/google.golang.org/grpc/internal/transport/http2_client.go#L927-L929 in google.golang.org/grpc/internal/transport.(*http2Client).Write
https://github.com/cockroachdb/cockroach/blob/37dee546a7c52870a8dc58826f0cffe2afa8d47a/vendor/google.golang.org/grpc/stream.go#L845-L847 in google.golang.org/grpc.(*clientStream).CloseSend.func1
https://github.com/cockroachdb/cockroach/blob/37dee546a7c52870a8dc58826f0cffe2afa8d47a/vendor/google.golang.org/grpc/stream.go#L661-L663 in google.golang.org/grpc.(*clientStream).withRetry
https://github.com/cockroachdb/cockroach/blob/37dee546a7c52870a8dc58826f0cffe2afa8d47a/vendor/google.golang.org/grpc/stream.go#L852-L854 in google.golang.org/grpc.(*clientStream).CloseSend

defer func() {
if err := stream.CloseSend(); err != nil {
log.Warningf(ctx, "failed to close snapshot stream: %+v", err)
in pkg/kv/kvserver.(*RaftTransport).SendSnapshot.func1
}()
return sendSnapshot(
ctx, t.st, stream, storePool, header, snap, newBatch, sent,
in pkg/kv/kvserver.(*RaftTransport).SendSnapshot
ctx, "send-snapshot", sendSnapshotTimeout, func(ctx context.Context) error {
return r.store.cfg.Transport.SendSnapshot(
ctx,
in pkg/kv/kvserver.(*Replica).sendSnapshot.func4
defer cancel()
err := fn(ctx)
if err != nil && errors.Is(ctx.Err(), context.DeadlineExceeded) {
in pkg/util/contextutil.RunWithTimeout
}
err = contextutil.RunWithTimeout(
ctx, "send-snapshot", sendSnapshotTimeout, func(ctx context.Context) error {
in pkg/kv/kvserver.(*Replica).sendSnapshot
// these, it would be susceptible to future similar issues.
if err := r.sendSnapshot(ctx, rDesc, SnapshotRequest_INITIAL, priority); err != nil {
return nil, err
in pkg/kv/kvserver.(*Replica).initializeRaftLearners
var err error
desc, err = r.initializeRaftLearners(
ctx, desc, priority, reason, details, adds, roachpb.LEARNER,
in pkg/kv/kvserver.(*Replica).changeReplicasImpl
// queue is active.
if _, err := repl.changeReplicasImpl(ctx, desc, priority, reason, details, chgs); err != nil {
return err
in pkg/kv/kvserver.(*replicateQueue).changeReplicas
if err := rq.changeReplicas(
ctx,
in pkg/kv/kvserver.(*replicateQueue).considerRebalance
case AllocatorConsiderRebalance:
return rq.considerRebalance(ctx, repl, voterReplicas, nonVoterReplicas, canTransferLeaseFrom, dryRun)
case AllocatorFinalizeAtomicReplicationChange:
in pkg/kv/kvserver.(*replicateQueue).processOneChange
for {
requeue, err := rq.processOneChange(ctx, repl, rq.canTransferLeaseFrom, false /* dryRun */)
if isSnapshotError(err) {
in pkg/kv/kvserver.(*replicateQueue).process
realRepl, _ := repl.(*Replica)
processed, err := bq.impl.process(ctx, realRepl, confReader)
if err != nil {
in pkg/kv/kvserver.(*baseQueue).processReplica.func1
defer cancel()
err := fn(ctx)
if err != nil && errors.Is(ctx.Err(), context.DeadlineExceeded) {
in pkg/util/contextutil.RunWithTimeout
defer span.Finish()
return contextutil.RunWithTimeout(ctx, fmt.Sprintf("%s queue process replica %d", bq.name, repl.GetRangeID()),
bq.processTimeoutFunc(bq.store.ClusterSettings(), repl), func(ctx context.Context) error {
in pkg/kv/kvserver.(*baseQueue).processReplica
start := timeutil.Now()
err := bq.processReplica(ctx, repl)
in pkg/kv/kvserver.(*baseQueue).processLoop.func2.1
f(ctx)
}()
in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
/usr/local/go/src/runtime/asm_amd64.s#L1370-L1372 in runtime.goexit

/usr/local/go/src/runtime/panic.go in runtime.gopanic at line 965
/usr/local/go/src/runtime/panic.go in runtime.panicmem at line 212
/usr/local/go/src/runtime/signal_unix.go in runtime.sigpanic at line 734
vendor/google.golang.org/grpc/internal/transport/transport.go in google.golang.org/grpc/internal/transport.(*Stream).compareAndSwapState at line 309
vendor/google.golang.org/grpc/internal/transport/http2_client.go in google.golang.org/grpc/internal/transport.(*http2Client).Write at line 928
vendor/google.golang.org/grpc/stream.go in google.golang.org/grpc.(*clientStream).CloseSend.func1 at line 846
vendor/google.golang.org/grpc/stream.go in google.golang.org/grpc.(*clientStream).withRetry at line 662
vendor/google.golang.org/grpc/stream.go in google.golang.org/grpc.(*clientStream).CloseSend at line 853
pkg/kv/kvserver/raft_transport.go in pkg/kv/kvserver.(*RaftTransport).SendSnapshot.func1 at line 691
pkg/kv/kvserver/raft_transport.go in pkg/kv/kvserver.(*RaftTransport).SendSnapshot at line 695
pkg/kv/kvserver/replica_command.go in pkg/kv/kvserver.(*Replica).sendSnapshot.func4 at line 2525
pkg/util/contextutil/context.go in pkg/util/contextutil.RunWithTimeout at line 89
pkg/kv/kvserver/replica_command.go in pkg/kv/kvserver.(*Replica).sendSnapshot at line 2523
pkg/kv/kvserver/replica_command.go in pkg/kv/kvserver.(*Replica).initializeRaftLearners at line 1680
pkg/kv/kvserver/replica_command.go in pkg/kv/kvserver.(*Replica).changeReplicasImpl at line 1049
pkg/kv/kvserver/replicate_queue.go in pkg/kv/kvserver.(*replicateQueue).changeReplicas at line 1363
pkg/kv/kvserver/replicate_queue.go in pkg/kv/kvserver.(*replicateQueue).considerRebalance at line 1124
pkg/kv/kvserver/replicate_queue.go in pkg/kv/kvserver.(*replicateQueue).processOneChange at line 482
pkg/kv/kvserver/replicate_queue.go in pkg/kv/kvserver.(*replicateQueue).process at line 304
pkg/kv/kvserver/queue.go in pkg/kv/kvserver.(*baseQueue).processReplica.func1 at line 959
pkg/util/contextutil/context.go in pkg/util/contextutil.RunWithTimeout at line 89
pkg/kv/kvserver/queue.go in pkg/kv/kvserver.(*baseQueue).processReplica at line 918
pkg/kv/kvserver/queue.go in pkg/kv/kvserver.(*baseQueue).processLoop.func2.1 at line 838
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 442
/usr/local/go/src/runtime/asm_amd64.s in runtime.goexit at line 1371
Tag Value
Cockroach Release v21.2.7
Cockroach SHA: 37dee54
Platform linux amd64
Distribution CCL
Environment v21.2.7
Command server
Go Version ``
# of CPUs
# of Goroutines

Jira issue: CRDB-15285

@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels May 13, 2022
@yuzefovich
Copy link
Member

Huh, nil pointer in the gRPC transport. Probably unactionable.

@yuzefovich yuzefovich changed the title sentry: panic.go:965: runtime error: invalid memory address or nil pointer dereference -- runtime.errorString panic.go:965: *withstack.withStack (top exception) kvserver: v21.2.7: nil pointerin in gRPC transport when sending snapshots May 16, 2022
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 9, 2022

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

FWIW, thought this might be related to gRPC issues that Andrei has been looking at over in #80878, but doesn't seem to be the case (those were caused by calls to ClientStream.Context() that were also addressed upstream in grpc/grpc-go#5323).

@pav-kv pav-kv changed the title kvserver: v21.2.7: nil pointerin in gRPC transport when sending snapshots kvserver: v21.2.7: nil pointer in in gRPC transport when sending snapshots Sep 22, 2022
@pav-kv pav-kv changed the title kvserver: v21.2.7: nil pointer in in gRPC transport when sending snapshots kvserver: v21.2.7: nil pointer in gRPC transport when sending snapshots Sep 22, 2022
@pav-kv
Copy link
Collaborator

pav-kv commented Sep 22, 2022

It looks like the attempt field is nil despite the promise in this comment:

https://github.com/cockroachdb/vendored/blob/78199fff2f3cf02d4cb350424d731d9d256a11e9/google.golang.org/grpc/stream.go#L471-L478

UPD: More likely attempt.s is nil.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 22, 2022

This seems to be fixed in grpc/grpc-go#5323.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 22, 2022

There are no recent re-occurrences of this panic in Sentry.

@pav-kv pav-kv closed this as completed Sep 22, 2022
@pav-kv
Copy link
Collaborator

pav-kv commented Sep 22, 2022

Although it doesn't look like we picked up this change. Reopening to investigate further.

@pav-kv pav-kv reopened this Sep 22, 2022
@pav-kv
Copy link
Collaborator

pav-kv commented Sep 22, 2022

Need to bump grpc from 1.46 to 1.47.

pav-kv added a commit to pav-kv/cockroach that referenced this issue Sep 22, 2022
Fixes cockroachdb#81227

Release note: upgrade grpc from v1.46.0 to v.1.47.0 which fixes a rare bug that
hit us with panic on a nil pointer.
craig bot pushed a commit that referenced this issue Sep 22, 2022
88454: ui: insights transaction details support multiple blocking transactions r=j82w a=j82w

This adds support for multiple blocking transactions for a single waiting transaction. The cards were merged into the table, and the data was piped through to show multiple rows. The total contention time was also fixed to aggregate the contention time instead of just picking the latest.

before:
https://loom.com/share/0384ed937a344e2fb0105fefbc313acb

after:
https://www.loom.com/share/78e906f50a694cd59ac893ddb9c2239a

closes #88264

Release justification: Category 2: Bug fixes and
low-risk updates to new functionality

Release note: (ui change): Add support for multiple
 blocking transaction on insights transaction
 details page. Merged the cards into the table,
 and fixed the total contention time.

88470: *: upgrade grpc to v1.47.0 r=erikgrinaker a=pavelkalinnikov

Fixes #81227

Release note: upgrade grpc from v1.46.0 to v.1.47.0 which fixes a subtle bug
causing panic on a nil pointer.

88477: keys: mark 49 as reserved r=ajwerner a=ajwerner

Release note: None

88496: persistedsqlstats: speed up a test r=yuzefovich a=yuzefovich

Previously, a single unit test could take on the order of 4 minutes (or even exceed 5 minute timeout, rarely) because the job monitor checks whether a cluster setting has been updated only every minute, and we update the cluster setting twice in a unit test. This commit makes it so that in a testing setup the check happens every second.

Release note: None

88499: bazel: upgrade `rules_go` r=rail a=healthy-pod

Pull in cockroachdb/rules_go#8.

Closes #88048

Release note: None

Co-authored-by: j82w <jwilley@cockroachlabs.com>
Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
@craig craig bot closed this as completed in a2089f9 Sep 22, 2022
@erikgrinaker
Copy link
Contributor

Reopening, since we had to roll back to 1.46.0 due to other test failures (#88745).

@erikgrinaker erikgrinaker reopened this Sep 27, 2022
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 27, 2022

@nvanbenschoten I'm handing this over to KV, since it's a general bug in gRPC which requires an upgrade. We should try to get this in early in the 23.1 cycle, and fix the issues that motivated #88745.

@yuzefovich
Copy link
Member

I also run into the panic in the gRPC that was fixed upstream in grpc/grpc-go#5323. Looking forward to getting the gRPC dependency bumped :)

@erikgrinaker
Copy link
Contributor

I'll pick this up again.

@blathers-crl
Copy link

blathers-crl bot commented Nov 15, 2022

cc @cockroachdb/replication

@blathers-crl
Copy link

blathers-crl bot commented Dec 18, 2022

cc @cockroachdb/replication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.
Projects
None yet
5 participants