Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: v19.2.0-beta.20190930 #41128

Closed
11 of 18 tasks
maddyblue opened this issue Sep 26, 2019 · 28 comments
Closed
11 of 18 tasks

release: v19.2.0-beta.20190930 #41128

maddyblue opened this issue Sep 26, 2019 · 28 comments
Assignees

Comments

@maddyblue
Copy link
Contributor

maddyblue commented Sep 26, 2019

Candidate SHA: 250f4c3
Deployment status: http://mjibson-release-250f4c36de2b88eff443cf9be9cd5d2759312c88-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster
Older: http://mjibson-release-v1920-beta20190930-0001.roachprod.crdb.io:26258/
Even older: http://mjibson-release-v1920-alpha20190805-0001.roachprod.crdb.io:26258/

Release qualification:

Nightly Suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1514704&buildTypeId=Cockroach_Nightlies_NightlySuite
Old nightly suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1511951&buildTypeId=Cockroach_Nightlies_NightlySuite
Even older nightly suite: https://teamcity.cockroachdb.com/viewLog.html?buildId=1509990&buildTypeId=Cockroach_Nightlies_NightlySuite

Release process checklist

Prep date: 2019-09-30

Release date: 2019-10-02

@maddyblue
Copy link
Contributor Author

Changing the SHA to cf5c2bd from 77f26d1 to pick up the fix for #41145.

@maddyblue
Copy link
Contributor Author

Test clusters are still running 77f26d1.

@maddyblue
Copy link
Contributor Author

maddyblue commented Sep 27, 2019

Failed roachtest nightlies (https://teamcity.cockroachdb.com/viewLog.html?buildId=1509990&buildTypeId=Cockroach_Nightlies_NightlySuite):

  • import/tpcc/warehouses=1000/nodes=32 @dt
  • restore2TB/nodes=10 @dt
  • schemachange/mixed/tpcc @dt
  • tpcc/mixed-headroom/n5cpu16
  • version/mixed/nodes=5
  • cancel/tpcc/distsql/w=10,nodes=3
  • clearrange/checks=false
  • clearrange/checks=true
  • hibernate
  • kv50/rangelookups/relocate/nodes=8
  • tpccbench/nodes=9/cpu=4/chaos/partition
  • typeorm
  • acceptance/bank/zerosum-splits

Edit 2019-10-01: this list is replaced by the one below #41128 (comment)

@maddyblue
Copy link
Contributor Author

This release is canceled due to some new bugs. I'm going to start a new roachprod cluster comparison with master as of right now. On Monday the release manager and others will decide what to do. I'll post links to those test clusters here.

@maddyblue
Copy link
Contributor Author

Latency on the master cluster is going up. It is below 20ms for the previous release. Read ops are similar.

ss

@maddyblue
Copy link
Contributor Author

maddyblue commented Sep 30, 2019

This is true for most metrics that are related to SQL or storage.

@maddyblue
Copy link
Contributor Author

ss

For comparison, the same image of the 0805 release cluster.

@maddyblue
Copy link
Contributor Author

maddyblue commented Sep 30, 2019

@maddyblue
Copy link
Contributor Author

Current status: we are going to try to release 250f4c3 on Wednesday (Oct 2).

@knz
Copy link
Contributor

knz commented Oct 1, 2019

(sorry wrong button)
Regarding the 3rd issue on the list: @lucy-zhang found that VALIDATE CONSTRAINT running concurrently with TPC-C 1K reveals invalid FK relations. The output is OK when TPC-C is not running concurrently, or when running lighter TPC-C workloads. This means that there are isolation problems.

I am tempted to interpret this as a real-world instance of #41173, @andreimatei what do you think?

@knz
Copy link
Contributor

knz commented Oct 1, 2019

from @lucy-zhang offline:

I'm not sure if andrei's issue is the same thing. the rows that are supposed to be missing are in tpcc.warehouse, which we never update (AFAIK) after we restore the fixtures, so even if the read timestamp were slightly behind for some parts of the reads it shouldn't matter

@knz
Copy link
Contributor

knz commented Oct 1, 2019

Failed nightlies as of 2019-10-01:

Note that at 10.00am CEST the test suite is still running (19 hours after starting). There may be more failures incoming.

Edit: at 1pm CEST it's still running.

Edit: at 3.30pm CEST it finished running and one additional issue dripped out (tpccbench/chaos/partition added to list above)

@knz
Copy link
Contributor

knz commented Oct 1, 2019

For reference, the following failures from the original list on 2019-09-27 are not there any more:

  • tpcc/mixed-headroom/n5cpu16
  • version/mixed/nodes=5
  • tpccbench/nodes=9/cpu=4/chaos/partition
  • typeorm
  • acceptance/bank/zerosum-splits

The following are new:

  • network/tpcc/nodes=4
  • TestRandomSyntaxSQLSmith (although I suspect we're going to ignore this since this test is dripping new issues on every run by design)

@knz
Copy link
Contributor

knz commented Oct 1, 2019

My analysis of the issues (same order as above):

@knz
Copy link
Contributor

knz commented Oct 1, 2019

Regarding this:

#37259: inter-node network mishaps cause a tpcc run to fail.
My understanding: the test uses toxiproxy between nodes to introduce network partitions, and then asserts that there is no goroutine peak in the server. However, the partition also causes legitimate SQL errors in clients, and that makes the test abort in error, in a way that's not relevant to the primary thing being tested. @ajwerner do you agree with my analysis?

Answer from @ajwerner:

That initial analysis seems sound.

(thus classifying the issue as not a release blocker)

@nvanbenschoten
Copy link
Member

#40359 can be ignored. It appears to be a testing error caused by new retryable errors that we see during RELOCATE RANGE statements. See #41106 (which I'm planning to return to later today).

@jordanlewis
Copy link
Member

Signing off on the cancel test. It's a testing error caused by the operation in question finishing too quickly.

@knz
Copy link
Contributor

knz commented Oct 1, 2019

Regarding #40935 update from @lucy-zhang

current update is that @jordan and I have tried to reproduce this by turning loadgen back on for one of the clusters where I got this test failure, and running that select query manually, and neither of us have seen it again
so it looks like there is some additional state necessary to repro aside from "heavy load" (possibly related to the early state of the cluster?)
so this does seem rarer than I first expected

Separately @andy-kimball states with agreement from @bdarnell

If this repros only under heavy stress, and we have seen it nowhere else, I don't think it should be classified as a beta blocker. The bar should be very high now. I'd only block beta for "high severity" (which this is) and "common" (which this isn't).

So I'm checking this as signed off by lucy, jordan and andy.

@knz knz self-assigned this Oct 1, 2019
@rafiss
Copy link
Collaborator

rafiss commented Oct 1, 2019

The Hibernate issues (#40538) are not a beta blocker. The tests are concerningly flaky though, and we will continue investigating this during the rest of the release period. The issues stem from the Hibernate tests occasionally being unable to connect to the DB.

@maddyblue
Copy link
Contributor Author

I checked off the sqlsmith failure because it should never block a release.

@knz
Copy link
Contributor

knz commented Oct 2, 2019

There's discussion on to whether to adopt the latest changes that improve on #41206 (perf regression).
If we bump the SHA here is the diff:

New features:
#41190 - decommissioning via atomic replication changes
#40954 - SHOW RANGE FOR ROW
#41138 - stats collection in movr

Bug fixes:
#41153 - rocksdb assert revert
#41244 - rocksdb compaction bug fix
#41194 - addsstable bug fix
#41195 - addsstable bug fix
#41196 - addsstable bug fix
#41217 - libroach iterator bug fix
#41187 - sql planning fix
#41212 - sql planning fix
#41241 - fastinset bug fix (sql planning & others)
#41231 - mem leak bug fix

Perf:
#41220 - sql planning perf / mem usage improvement

Polish:
#40493 - sql polish zone config introspection
#40948 - sql polish
#41129 - distsql plan viz polish
#41192 - roachtest improvement
#41215 - code polish
#41221 - k8s conf update
#41235 - test fixes
#41237 - movr polish

@knz
Copy link
Contributor

knz commented Oct 2, 2019

Input from @awoods187 and @bdarnell: continue with the same SHA.

@knz
Copy link
Contributor

knz commented Oct 2, 2019

Note I have checked the cluster health at http://mjibson-release-250f4c36de2b88eff443cf9be9cd5d2759312c88-0001.roachprod.crdb.io:26258/#/metrics/overview/cluster

The cluster displays a performance anomaly for the period up to and including Sept 30.

Then yesterday (Oct 1) Nathan uploaded a new binary with the fix from #41220 (and other fixes) which demonstrates that the anomaly disappears. Note however that these fixes are not present in today's release.

@knz
Copy link
Contributor

knz commented Oct 2, 2019

@dt about the clearrange tests:

I’m fine with clearrange failures for now — we’re investigating but AFAIK, it is just slow, not wrong.

(considering this as sign-off)

@knz
Copy link
Contributor

knz commented Oct 2, 2019

@dt about the restore2TB test:

the RESTORE one I took a very cursory look at and didn’t see what killed it, just exit 255
might have OOMed I guess — i wonder if it was the rocks logging leak? I donno. I’m fine signing off on that too I guess.

@knz
Copy link
Contributor

knz commented Oct 2, 2019

@irfansharif and @andreimatei do not have anything to say on the remaining roachtest failure.

So I did go and investigate the log files. I am not seeing errors in the CockroachDB logs themselves other than the expected "cannot connect to node" (the node being shut down under chaos).

@bdarnell says "go"

@knz
Copy link
Contributor

knz commented Oct 2, 2019

The version is tagged, the binaries are uploaded, the docker image works, so from engineering the release is ready to go.
The release notes PR is here: cockroachdb/docs#5250

Handing this off to docs.

@knz knz closed this as completed Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants