roachtest: acceptance/version-upgrade failed #43957

tbg · 2020-01-14T14:59:52Z

I broke the issue filing (fixing in #43956) so these fell off the radar:

https://teamcity.cockroachdb.com/viewLog.html?buildId=1686776&buildTypeId=Cockroach_UnitTests
https://teamcity.cockroachdb.com/viewLog.html?buildId=1687798&buildTypeId=Cockroach_UnitTests

Both are from today and have a node crash of this kind (TLDR: no inbound stream connection)

That error unfortunately just means "the other node didn't connect to us". Could it not be doing that because of some version incompatibility that we introduced?

@solongordon I think this must've become a bug when you rewrote this method in f8faf89? Perhaps it's just a bad idea to use distsql in a migration because if there's a version bump between the nodes some old nodes might refuse the inbound connection (or so I'm imagining this breaks).

F200113 16:42:41.482752 101 server/server.go:1623  [n3] error with attached stack trace:
    github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).execInternal.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:477
    github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).execInternal
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:574
    github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).queryInternal
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:252
    github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryWithUser
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:269
    github.com/cockroachdb/cockroach/pkg/sqlmigrations.migrateSystemNamespace
    	/go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:708
    github.com/cockroachdb/cockroach/pkg/sqlmigrations.(*Manager).EnsureMigrations
    	/go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:573
    github.com/cockroachdb/cockroach/pkg/server.(*Server).Start
    	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1617
    github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2
    	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:698
    github.com/cockroachdb/cockroach/pkg/cli.runStart.func3
    	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:814
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1357
  - error with embedded safe details: read-deprecated-namespace-table
  - read-deprecated-namespace-table:
  - no inbound stream connection
    github.com/cockroachdb/cockroach/pkg/sql/flowinfra.init
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow_registry.go:30
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5222
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.main
    	/usr/local/go/src/runtime/proc.go:190
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1357
failed to run migration "migrate system.namespace_deprecated entries into system.namespace"
github.com/cockroachdb/cockroach/pkg/sqlmigrations.(*Manager).EnsureMigrations
	/go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:574
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1617
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:698
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:814
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357
goroutine 101 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x6ea9c01, 0xed5ae9501, 0x0, 0x47a6ea0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x6ea6a60, 0xc000000004, 0x65311ef, 0x10, 0x657, 0xc003d80c00, 0xb52)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:211 +0xa0c
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x47a6da0, 0xc0006b9380, 0x4000000000000004, 0x2, 0x3f66d30, 0x3, 0xc0039a2d38, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x47a6da0, 0xc0006b9380, 0x1, 0x4, 0x3f66d30, 0x3, 0xc0039a2d38, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0xc0003aa800, 0x47a6da0, 0xc0006bc1e0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1623 +0x2b54
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2(0xc000852120, 0xc0003de118, 0xc0001f8060, 0x47a6da0, 0xc0006bc1e0, 0x0, 0x2bbd3c01, 0xed5ae94f6, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:698 +0x107
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3(0xc0003de118, 0x47a6da0, 0xc0006bc1e0, 0x480d4e0, 0xc00027a580, 0xc000852120, 0xc0001f8060, 0x0, 0x2bbd3c01, 0xed5ae94f6, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:814 +0x181
created by github.com/cockroachdb/cockroach/pkg/cli.runStart
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:654 +0x9d8

The text was updated successfully, but these errors were encountered:

solongordon · 2020-01-14T16:30:56Z

Interesting. I would expect if there was a version incompatibility that the new node wouldn't even attempt to plan a distsql query with the incompatible nodes. It looks like the older node is failing to connect to the newer one but it's not clear why from the logs:

I200113 16:42:31.478241 7675 sql/flowinfra/outbox.go:230  [n1] outbox: connection dial error: initial connection heartbeat failed:
  - rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26261: connect: connection refused"
failed to connect to n3 at 127.0.0.1:26261
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:169
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).DialNoBreaker
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:105
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*Outbox).mainLoop
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/outbox.go:225
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*Outbox).run
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/outbox.go:429
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337

I suppose I'll try disabling distsql for the migration query to see if that makes the sporadic roachtest failures go away, but it would be nice to understand why the connection fails.

tbg · 2020-01-14T16:37:05Z

The nodes cycle back and forth in this test, so possibly the planning node still considers the other node running at a binary that supports the new stuff. But then it cycles back when the plan actually gets applied.

This can't happen if DistSQL uses the active cluster version on remote nodes to determine whether to use them for planning (since that can only ever increase). But maybe it isn't smart like that? cc @andreimatei

solongordon · 2020-01-14T16:50:32Z

DistSQL uses the DistSQLVersion to determine which nodes should be included in the plan. It doesn't look like this has changed between 19.2 and HEAD.

tbg · 2020-01-14T17:04:23Z

Ok, so DistSQL isn't equipped to handle the fact that "newer" binaries may restart into the old binary again. Unfortunately I looked and this test isn't doing it. Could an old node be trying to run the migration on a newer node? That seems unlikely. Looks like someone from DistSQL-land needs to look into this more.

andreimatei · 2020-01-14T21:29:56Z

FWIW, I got one failure in 23 runs with

bin/roachtest run --parallelism=10 --count=30 acceptance/version-upgrade --roachprod=bin/roachprod

tbg · 2020-01-15T12:50:35Z

#44005 is where the official test runner tracks this now.

andreimatei · 2020-01-15T14:59:43Z

Solon, I'll leave this in your hands.

Very flaky, apparently because of some problem with a recent migration. Touches cockroachdb#43957, cockroachdb#44005 Release note: None

andreimatei · 2020-01-15T17:14:43Z

EnsureMigrations runs before the DistSQL server is hooked up to the socket:

cockroach/pkg/server/server.go

Line 1828 in 46a38c9

netutil.FatalIfUnexpected(s.grpc.Serve(anyL))

So DistSQL RPCs can't be used during migrations.

andreimatei · 2020-01-15T17:24:10Z

I'll try to fix it.

43720: coldata: fix behavior of Vec.Append in some cases when NULLs are present r=yuzefovich a=yuzefovich We would always Get and then Set a value while Append'ing without paying attention to whether the value is actually NULL. This can lead to problems in case of flat bytes if the necessary invariant is unmaintained. Now this is fixed by explicitly enforcing the invariant. Additionally, this commit ensures that the destination slice has the desired capacity before appending one value at a time (in case of a present selection vector). I tried approach with paying attention to whether the value is NULL before appending it and saw a significant performance hit, so I think this approach is the least evil. Fixes: #42774. Release note: None 43933: backupccl: ensure restore on success is run once r=pbardea a=pbardea It seems that jobs today do not ensure that the OnSuccess callback is called exactly once. This PR moves the cleanup stages of RESTORE, formerly located in the OnSuccess callback to be the final steps of Resume. This should help ensure that these stages are run once and only once. Release note (bug fix): Ensure that RESTORE cleanup is run exactly once. 44013: roachtest: skip acceptance/version-upgrade because flaky r=andreimatei a=andreimatei Very flaky, apparently because of some problem with a recent migration. Touches #43957, #44005 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

andreimatei · 2020-01-15T22:47:48Z

I fooled myself thinking I understood the initialization order. The gRPC server is started before the migrations start. But perhaps it's not hooked up to the mux yet?
I'm no longer sure whether the DistSQL server not having started is the problem. I'll look more tomorrow.

This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101

andreimatei · 2020-01-17T01:38:32Z

Root cause in #44101
Narrow fix in #44102

tbg · 2020-01-17T08:05:24Z

Oof, that must have been painful to dig up. Thanks Andrei!

…

On Fri, Jan 17, 2020 at 2:38 AM Andrei Matei ***@***.***> wrote: Root cause in #44101 <#44101> Narrow fix in #44102 <#44102> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#43957?email_source=notifications&email_token=ABGXPZBOBE7YSP5KPUFKD2LQ6EDZTA5CNFSM4KGUN322YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJGE44I#issuecomment-575426161>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZC4XP4IHCMF7GCIANDQ6EDZTANCNFSM4KGUN32Q> .

This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101 Release note: None

44102: sql: don't distribute migration queries r=andreimatei a=andreimatei This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by #44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of #44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes #43957 Fixes #44005 Touches #44101 Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

tbg assigned solongordon Jan 14, 2020

tbg added C-test-failure Broken test (automatically or manually discovered). O-roachtest labels Jan 14, 2020

andreimatei mentioned this issue Jan 14, 2020

server: TestListenerFileCreation failed #43949

Closed

andreimatei mentioned this issue Jan 15, 2020

roachtest: skip acceptance/version-upgrade because flaky #44013

Merged

andreimatei added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Jan 15, 2020

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 15, 2020

roachtest: skip acceptance/version-upgrade because flaky

d6743ec

Very flaky, apparently because of some problem with a recent migration. Touches cockroachdb#43957, cockroachdb#44005 Release note: None

andreimatei assigned andreimatei and unassigned solongordon Jan 15, 2020

andreimatei mentioned this issue Jan 17, 2020

sql: don't distribute migration queries #44102

Merged

tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020

craig bot closed this as completed in e12735f Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: acceptance/version-upgrade failed #43957

roachtest: acceptance/version-upgrade failed #43957

tbg commented Jan 14, 2020

solongordon commented Jan 14, 2020 •

edited

Loading

tbg commented Jan 14, 2020

solongordon commented Jan 14, 2020

tbg commented Jan 14, 2020

andreimatei commented Jan 14, 2020

tbg commented Jan 15, 2020

andreimatei commented Jan 15, 2020

andreimatei commented Jan 15, 2020

andreimatei commented Jan 15, 2020

andreimatei commented Jan 15, 2020

andreimatei commented Jan 17, 2020

tbg commented Jan 17, 2020 via email

roachtest: acceptance/version-upgrade failed #43957

roachtest: acceptance/version-upgrade failed #43957

Comments

tbg commented Jan 14, 2020

solongordon commented Jan 14, 2020 • edited Loading

tbg commented Jan 14, 2020

solongordon commented Jan 14, 2020

tbg commented Jan 14, 2020

andreimatei commented Jan 14, 2020

tbg commented Jan 15, 2020

andreimatei commented Jan 15, 2020

andreimatei commented Jan 15, 2020

andreimatei commented Jan 15, 2020

andreimatei commented Jan 15, 2020

andreimatei commented Jan 17, 2020

tbg commented Jan 17, 2020 via email

solongordon commented Jan 14, 2020 •

edited

Loading