Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/version-upgrade failed #43957

Closed
tbg opened this issue Jan 14, 2020 · 12 comments · Fixed by #44102
Closed

roachtest: acceptance/version-upgrade failed #43957

tbg opened this issue Jan 14, 2020 · 12 comments · Fixed by #44102
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@tbg
Copy link
Member

tbg commented Jan 14, 2020

I broke the issue filing (fixing in #43956) so these fell off the radar:

https://teamcity.cockroachdb.com/viewLog.html?buildId=1686776&buildTypeId=Cockroach_UnitTests
https://teamcity.cockroachdb.com/viewLog.html?buildId=1687798&buildTypeId=Cockroach_UnitTests

Both are from today and have a node crash of this kind (TLDR: no inbound stream connection)

That error unfortunately just means "the other node didn't connect to us". Could it not be doing that because of some version incompatibility that we introduced?

@solongordon I think this must've become a bug when you rewrote this method in f8faf89? Perhaps it's just a bad idea to use distsql in a migration because if there's a version bump between the nodes some old nodes might refuse the inbound connection (or so I'm imagining this breaks).

F200113 16:42:41.482752 101 server/server.go:1623  [n3] error with attached stack trace:
    github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).execInternal.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:477
    github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).execInternal
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:574
    github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).queryInternal
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:252
    github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryWithUser
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:269
    github.com/cockroachdb/cockroach/pkg/sqlmigrations.migrateSystemNamespace
    	/go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:708
    github.com/cockroachdb/cockroach/pkg/sqlmigrations.(*Manager).EnsureMigrations
    	/go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:573
    github.com/cockroachdb/cockroach/pkg/server.(*Server).Start
    	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1617
    github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2
    	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:698
    github.com/cockroachdb/cockroach/pkg/cli.runStart.func3
    	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:814
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1357
  - error with embedded safe details: read-deprecated-namespace-table
  - read-deprecated-namespace-table:
  - no inbound stream connection
    github.com/cockroachdb/cockroach/pkg/sql/flowinfra.init
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow_registry.go:30
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5222
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.doInit
    	/usr/local/go/src/runtime/proc.go:5217
    runtime.main
    	/usr/local/go/src/runtime/proc.go:190
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1357
failed to run migration "migrate system.namespace_deprecated entries into system.namespace"
github.com/cockroachdb/cockroach/pkg/sqlmigrations.(*Manager).EnsureMigrations
	/go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:574
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1617
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:698
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:814
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357
goroutine 101 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x6ea9c01, 0xed5ae9501, 0x0, 0x47a6ea0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x6ea6a60, 0xc000000004, 0x65311ef, 0x10, 0x657, 0xc003d80c00, 0xb52)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:211 +0xa0c
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x47a6da0, 0xc0006b9380, 0x4000000000000004, 0x2, 0x3f66d30, 0x3, 0xc0039a2d38, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x47a6da0, 0xc0006b9380, 0x1, 0x4, 0x3f66d30, 0x3, 0xc0039a2d38, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0xc0003aa800, 0x47a6da0, 0xc0006bc1e0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1623 +0x2b54
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2(0xc000852120, 0xc0003de118, 0xc0001f8060, 0x47a6da0, 0xc0006bc1e0, 0x0, 0x2bbd3c01, 0xed5ae94f6, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:698 +0x107
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3(0xc0003de118, 0x47a6da0, 0xc0006bc1e0, 0x480d4e0, 0xc00027a580, 0xc000852120, 0xc0001f8060, 0x0, 0x2bbd3c01, 0xed5ae94f6, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:814 +0x181
created by github.com/cockroachdb/cockroach/pkg/cli.runStart
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:654 +0x9d8

@tbg tbg added C-test-failure Broken test (automatically or manually discovered). O-roachtest labels Jan 14, 2020
@solongordon
Copy link
Contributor

solongordon commented Jan 14, 2020

Interesting. I would expect if there was a version incompatibility that the new node wouldn't even attempt to plan a distsql query with the incompatible nodes. It looks like the older node is failing to connect to the newer one but it's not clear why from the logs:

I200113 16:42:31.478241 7675 sql/flowinfra/outbox.go:230  [n1] outbox: connection dial error: initial connection heartbeat failed:
  - rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26261: connect: connection refused"
failed to connect to n3 at 127.0.0.1:26261
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:169
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).DialNoBreaker
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:105
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*Outbox).mainLoop
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/outbox.go:225
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*Outbox).run
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/outbox.go:429
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337

I suppose I'll try disabling distsql for the migration query to see if that makes the sporadic roachtest failures go away, but it would be nice to understand why the connection fails.

@tbg
Copy link
Member Author

tbg commented Jan 14, 2020

The nodes cycle back and forth in this test, so possibly the planning node still considers the other node running at a binary that supports the new stuff. But then it cycles back when the plan actually gets applied.

This can't happen if DistSQL uses the active cluster version on remote nodes to determine whether to use them for planning (since that can only ever increase). But maybe it isn't smart like that? cc @andreimatei

@solongordon
Copy link
Contributor

DistSQL uses the DistSQLVersion to determine which nodes should be included in the plan. It doesn't look like this has changed between 19.2 and HEAD.

@tbg
Copy link
Member Author

tbg commented Jan 14, 2020

Ok, so DistSQL isn't equipped to handle the fact that "newer" binaries may restart into the old binary again. Unfortunately I looked and this test isn't doing it. Could an old node be trying to run the migration on a newer node? That seems unlikely. Looks like someone from DistSQL-land needs to look into this more.

@andreimatei
Copy link
Contributor

FWIW, I got one failure in 23 runs with

bin/roachtest run --parallelism=10 --count=30 acceptance/version-upgrade --roachprod=bin/roachprod

@tbg
Copy link
Member Author

tbg commented Jan 15, 2020

#44005 is where the official test runner tracks this now.

@andreimatei
Copy link
Contributor

Solon, I'll leave this in your hands.

@andreimatei andreimatei added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Jan 15, 2020
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 15, 2020
Very flaky, apparently because of some problem with a recent migration.
Touches cockroachdb#43957, cockroachdb#44005

Release note: None
@andreimatei
Copy link
Contributor

EnsureMigrations runs before the DistSQL server is hooked up to the socket:

netutil.FatalIfUnexpected(s.grpc.Serve(anyL))

So DistSQL RPCs can't be used during migrations.

@andreimatei
Copy link
Contributor

I'll try to fix it.

craig bot pushed a commit that referenced this issue Jan 15, 2020
43720: coldata: fix behavior of Vec.Append in some cases when NULLs are present r=yuzefovich a=yuzefovich

We would always Get and then Set a value while Append'ing without paying
attention to whether the value is actually NULL. This can lead to
problems in case of flat bytes if the necessary invariant is
unmaintained. Now this is fixed by explicitly enforcing the invariant.
Additionally, this commit ensures that the destination slice has the
desired capacity before appending one value at a time (in case of
a present selection vector).

I tried approach with paying attention to whether the value is NULL
before appending it and saw a significant performance hit, so I think
this approach is the least evil.

Fixes: #42774.

Release note: None

43933: backupccl: ensure restore on success is run once r=pbardea a=pbardea

It seems that jobs today do not ensure that the OnSuccess callback is
called exactly once. This PR moves the cleanup stages of RESTORE,
formerly located in the OnSuccess callback to be the final steps of
Resume. This should help ensure that these stages are run once and only
once.

Release note (bug fix): Ensure that RESTORE cleanup is run exactly once.

44013: roachtest: skip acceptance/version-upgrade because flaky r=andreimatei a=andreimatei

Very flaky, apparently because of some problem with a recent migration.
Touches #43957, #44005

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Paul Bardea <pbardea@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@andreimatei
Copy link
Contributor

I fooled myself thinking I understood the initialization order. The gRPC server is started before the migrations start. But perhaps it's not hooked up to the mux yet?
I'm no longer sure whether the DistSQL server not having started is the problem. I'll look more tomorrow.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 17, 2020
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by cockroachdb#44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of cockroachdb#44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes cockroachdb#43957
Fixes cockroachdb#44005
Touches cockroachdb#44101
@andreimatei
Copy link
Contributor

Root cause in #44101
Narrow fix in #44102

@tbg
Copy link
Member Author

tbg commented Jan 17, 2020 via email

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 18, 2020
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by cockroachdb#44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of cockroachdb#44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes cockroachdb#43957
Fixes cockroachdb#44005
Touches cockroachdb#44101

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 21, 2020
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by cockroachdb#44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of cockroachdb#44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes cockroachdb#43957
Fixes cockroachdb#44005
Touches cockroachdb#44101

Release note: None
@tbg tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020
craig bot pushed a commit that referenced this issue Jan 23, 2020
44102: sql: don't distribute migration queries r=andreimatei a=andreimatei

This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by #44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of #44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes #43957
Fixes #44005
Touches #44101

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig craig bot closed this as completed in e12735f Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants