-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: acceptance/version-upgrade failed #43957
Comments
Interesting. I would expect if there was a version incompatibility that the new node wouldn't even attempt to plan a distsql query with the incompatible nodes. It looks like the older node is failing to connect to the newer one but it's not clear why from the logs:
I suppose I'll try disabling distsql for the migration query to see if that makes the sporadic roachtest failures go away, but it would be nice to understand why the connection fails. |
The nodes cycle back and forth in this test, so possibly the planning node still considers the other node running at a binary that supports the new stuff. But then it cycles back when the plan actually gets applied. This can't happen if DistSQL uses the active cluster version on remote nodes to determine whether to use them for planning (since that can only ever increase). But maybe it isn't smart like that? cc @andreimatei |
DistSQL uses the DistSQLVersion to determine which nodes should be included in the plan. It doesn't look like this has changed between 19.2 and HEAD. |
Ok, so DistSQL isn't equipped to handle the fact that "newer" binaries may restart into the old binary again. Unfortunately I looked and this test isn't doing it. Could an old node be trying to run the migration on a newer node? That seems unlikely. Looks like someone from DistSQL-land needs to look into this more. |
FWIW, I got one failure in 23 runs with
|
#44005 is where the official test runner tracks this now. |
Solon, I'll leave this in your hands. |
Very flaky, apparently because of some problem with a recent migration. Touches cockroachdb#43957, cockroachdb#44005 Release note: None
cockroach/pkg/server/server.go Line 1828 in 46a38c9
So |
I'll try to fix it. |
43720: coldata: fix behavior of Vec.Append in some cases when NULLs are present r=yuzefovich a=yuzefovich We would always Get and then Set a value while Append'ing without paying attention to whether the value is actually NULL. This can lead to problems in case of flat bytes if the necessary invariant is unmaintained. Now this is fixed by explicitly enforcing the invariant. Additionally, this commit ensures that the destination slice has the desired capacity before appending one value at a time (in case of a present selection vector). I tried approach with paying attention to whether the value is NULL before appending it and saw a significant performance hit, so I think this approach is the least evil. Fixes: #42774. Release note: None 43933: backupccl: ensure restore on success is run once r=pbardea a=pbardea It seems that jobs today do not ensure that the OnSuccess callback is called exactly once. This PR moves the cleanup stages of RESTORE, formerly located in the OnSuccess callback to be the final steps of Resume. This should help ensure that these stages are run once and only once. Release note (bug fix): Ensure that RESTORE cleanup is run exactly once. 44013: roachtest: skip acceptance/version-upgrade because flaky r=andreimatei a=andreimatei Very flaky, apparently because of some problem with a recent migration. Touches #43957, #44005 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
I fooled myself thinking I understood the initialization order. The gRPC server is started before the migrations start. But perhaps it's not hooked up to the mux yet? |
This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101
Oof, that must have been painful to dig up. Thanks Andrei!
…On Fri, Jan 17, 2020 at 2:38 AM Andrei Matei ***@***.***> wrote:
Root cause in #44101
<#44101>
Narrow fix in #44102 <#44102>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#43957?email_source=notifications&email_token=ABGXPZBOBE7YSP5KPUFKD2LQ6EDZTA5CNFSM4KGUN322YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJGE44I#issuecomment-575426161>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZC4XP4IHCMF7GCIANDQ6EDZTANCNFSM4KGUN32Q>
.
|
This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101 Release note: None
This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by cockroachdb#44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of cockroachdb#44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes cockroachdb#43957 Fixes cockroachdb#44005 Touches cockroachdb#44101 Release note: None
44102: sql: don't distribute migration queries r=andreimatei a=andreimatei This patch inhibits DistSQL distribution for the queries that the migrations run. This was prompted by #44101, which is causing a distributed query done soon after a node startup to sometimes fail. I've considered more bluntly disabling distribution for any query for a short period of time after the node starts up, but I went with the more targeted change to migrations because I think it's a bad idea for migrations to use query distribution even outside of #44101 - distributed queries are more fragile than local execution in general (for example, because of DistSender retries). And migrations can't tolerate any flakiness. Fixes #43957 Fixes #44005 Touches #44101 Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
I broke the issue filing (fixing in #43956) so these fell off the radar:
https://teamcity.cockroachdb.com/viewLog.html?buildId=1686776&buildTypeId=Cockroach_UnitTests
https://teamcity.cockroachdb.com/viewLog.html?buildId=1687798&buildTypeId=Cockroach_UnitTests
Both are from today and have a node crash of this kind (TLDR: no inbound stream connection)
That error unfortunately just means "the other node didn't connect to us". Could it not be doing that because of some version incompatibility that we introduced?
@solongordon I think this must've become a bug when you rewrote this method in f8faf89? Perhaps it's just a bad idea to use distsql in a migration because if there's a version bump between the nodes some old nodes might refuse the inbound connection (or so I'm imagining this breaks).
The text was updated successfully, but these errors were encountered: