release: v19.1.0-beta.20190318 #35660

ajwerner · 2019-03-12T18:56:54Z

Candidate SHA: a512e39
Deployment status: Ran roachprod based clusters, deployed to adriatic late (2019/04/18 12:40 EST)
Nightly Suite: TeamCity Job

Release process checklist

Prep date: 3/12/2019

Pick a SHA, fill in Candidate SHA above, notify #release-process of SHA.
Tag the provisional SHA
Publish provisional binaries
Check binaries
- Update VERSION to v19.1.0-beta.20190318 bincheck#95
Deploy to test clusters
- http://ajwerner-release-v19.1.0-beta.20190304-0001.roachprod.crdb.io:26258
- http://ajwerner-release-v19.1.0-beta.20190318-0001.roachprod.crdb.io:26258
- adriatic 2019/04/18 12:40 EST
  - Late, should have been done the week before, instead I pursued a new roachprod based approach but neglected to do the usual deployment
Start nightly suite
- cdc/crdb-chaos/rangefeed=true failed due to known test flakiness (see release: v19.1.0-beta.20190318 #35660 (comment))
- cdc/tpcc-1000/rangefeed=true failed due to known test flakiness (see release: v19.1.0-beta.20190318 #35660 (comment))
- cdc/crdb-chaos/rangefeed=true failed due to known test flakiness (see release: v19.1.0-beta.20190318 #35660 (comment))
- jepsen-batch1/bank/subcritical-skews failed due to ntp flakes (see release: v19.1.0-beta.20190318 #35660 (comment))
- jepsen-batch2/g2/subcritical-skews failed due to ntp flakes (see release: v19.1.0-beta.20190318 #35660 (comment))
- jepsen-batch3/register/subcritical-skews failed due to ntp flakes (see release: v19.1.0-beta.20190318 #35660 (comment))
- jepsen-batch3/sets/subcritical-skews failed due to ntp flakes (see release: v19.1.0-beta.20190318 #35660 (comment))
- kv/gracefuldraining/nodes=3 failed due to known test flakiness (see release: v19.1.0-beta.20190318 #35660 (comment))
- splits/load/uniform/nodes=3 failed due to known test flakiness (see release: v19.1.0-beta.20190318 #35660 (comment))
- clearrange/checks=true failed due to known test flakiness (see release: v19.1.0-beta.20190318 #35660 (comment))
Verify node crash reports
- sentry report
fill in Deployment status above with clusters and Nightly Suite with the link to Nightly TeamCity Job
Keep an eye on clusters until release date. Do not proceed below until the release date.

Release date: 3/18/2019

Check cluster status
- Other than the initial failure, which seems related to Large number of KV transaction restarts + 15s KV transaction duration lead to poor performance #35290, both the clusters seem happy and the metrics seem to line up
Tag release
Bless provisional binaries
For production or stable releases in the latest major release series
- Update the Homebrew Formula
- Update the image tag in our Orchestrator configurations
- Announce version to the registration cluster
- Upgrade the registration cluster (reg runs latest stable version)
Update docs
External communications for release

The text was updated successfully, but these errors were encountered:

ajwerner · 2019-03-13T12:47:39Z

I kicked off two clusters last night, one with the provisional release, one with the last release. I loaded up tpc-c 1k on to each and left them running when I went to bed. The new release seems to have died a little over an hour in. There was a big spike in CPU/memory as well as in txn restarts then eventually an error hits the client and the load stops.

Then there are a couple of interesting things in the logs but on the whole it's pretty sparse
There are a couple of

ip-172-31-35-120> W190313 05:06:27.719833 13603764 storage/intentresolver/intent_resolver.go:822  [n4,s4,r3044/1:/Table/57/1/{80/0-90/0}] failed to gc transaction record: could not GC completed transaction anchored at /Table/57/1/87/6/0: context canceled
ip-172-31-46-235> W190313 05:06:27.796814 11838596 storage/intentresolver/intent_resolver.go:822  [n6,s6,r2953/3:/Table/57/1/8{50/0-60/0}] failed to gc transaction record: could not GC completed transaction anchored at /Table/57/1/852/7/0: context canceled

and then

ip-172-31-44-162> I190313 05:07:32.666042 22749 sql/distsql_running.go:149  [n2,client=172.31.35.96:44164,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=0f1ce342 key=/Table/57/1/625/2/0 rw=true pri=0.02300479 stat=ABORTED epo=0 ts=1552453630.311500840,1 orig=1552453626.561107837,0 max=1552453627.061107837,0 wto=false seq=1

Then a few minutes later there's a cascade of deadline exceeded and then the load stops.

tbg · 2019-03-13T12:50:28Z

Any heap profiles (`heap_profiler` log subdir)?

…

On Wed, Mar 13, 2019 at 1:47 PM ajwerner ***@***.***> wrote: I kicked off two clusters last night, one with the provisional release, one with the last release. I loaded up tpc-c 1k on to each and left them running when I went to bed. The new release seems to have died a little over an hour in. There was a big spike in CPU/memory as well as in txn restarts then eventually an error hits the client and the load stops. Then there are a couple of interesting things in the logs but on the whole it's pretty sparse There are a couple of ip-172-31-35-120> W190313 05:06:27.719833 13603764 storage/intentresolver/intent_resolver.go:822 [n4,s4,r3044/1:/Table/57/1/{80/0-90/0}] failed to gc transaction record: could not GC completed transaction anchored at /Table/57/1/87/6/0: context canceled ip-172-31-46-235> W190313 05:06:27.796814 11838596 storage/intentresolver/intent_resolver.go:822 [n6,s6,r2953/3:/Table/57/1/8{50/0-60/0}] failed to gc transaction record: could not GC completed transaction anchored at /Table/57/1/852/7/0: context canceled and then ip-172-31-44-162> I190313 05:07:32.666042 22749 sql/distsql_running.go:149 [n2,client=172.31.35.96:44164,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=0f1ce342 key=/Table/57/1/625/2/0 rw=true pri=0.02300479 stat=ABORTED epo=0 ts=1552453630.311500840,1 orig=1552453626.561107837,0 max=1552453627.061107837,0 wto=false seq=1 Then a few minutes later there's a cascade of deadline exceeded and then the load stops. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#35660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135CjNqg_hrkOZLeaYzIx_ZY7ye5usks5vWPN6gaJpZM4brpYe> .

ajwerner · 2019-03-13T12:52:31Z

Unfortunately no, the heap_profiler directories are empty

tbg · 2019-03-13T13:07:50Z

Ok. Poking around a bit, the I/O latencies went up but this can be fallout

Raft time also goes up:

So something expensive was likely happening, but it's unclear what. There is no increase in MsgApp but at the end we're seeing more heartbeats, a further sign of system overload.

ajwerner · 2019-03-13T13:09:16Z

The resource usage is steady until around 5:05:43 and then starts to explode.

ip-172-31-33-131> I190313 05:05:43.996810 487 server/status/runtime.go:464  [n1] runtime stats: 12 GiB RSS, 1110 goroutines, 415 MiB/120 MiB/864 MiB GO alloc/idle/total, 7.8 GiB/9.9 GiB CGO alloc/total, 48632.4 CGO/sec, 248.4/49.4 %(u/s)time, 0.0 %gc (7x), 46 MiB/430 MiB (r/w)net
ip-172-31-44-162> I190313 05:05:47.079751 360 server/status/runtime.go:464  [n2] runtime stats: 12 GiB RSS, 1413 goroutines, 750 MiB/385 MiB/1.4 GiB GO alloc/idle/total, 7.8 GiB/9.8 GiB CGO alloc/total, 66019.7 CGO/sec, 649.4/76.2 %(u/s)time, 1.9 %gc (10x), 305 MiB/464 MiB (r/w)net
ip-172-31-42-34> I190313 05:05:47.722956 373 server/status/runtime.go:464  [n3] runtime stats: 12 GiB RSS, 1389 goroutines, 1.0 GiB/1.1 GiB/2.4 GiB GO alloc/idle/total, 7.7 GiB/9.9 GiB CGO alloc/total, 54711.4 CGO/sec, 716.7/100.3 %(u/s)time, 2.1 %gc (9x), 722 MiB/38 MiB (r/w)net
ip-172-31-35-120> I190313 05:05:50.455167 334 server/status/runtime.go:464  [n4] runtime stats: 13 GiB RSS, 2228 goroutines, 1.7 GiB/222 MiB/2.2 GiB GO alloc/idle/total, 7.8 GiB/10 GiB CGO alloc/total, 63641.5 CGO/sec, 1135.1/107.3 %(u/s)time, 2.5 %gc (9x), 575 MiB/784 MiB (r/w)net

There's nothing interesting in the logs at the beginning of this increase. Over the next two minutes resource utilization spikes

ip-172-31-42-34> I190313 05:06:58.320226 373 server/status/runtime.go:464  [n3] runtime stats: 17 GiB RSS, 1994 goroutines, 5.8 GiB/150 MiB/6.3 GiB GO alloc/idle/total, 7.7 GiB/10 GiB CGO alloc/total, 16079.4 CGO/sec, 1496.8/49.0 %(u/s)time, 0.0 %gc (2x), 63 MiB/3.7 MiB (r/w)net

Including RAM, CPU and network. Disk reads spike and writes fall off a cliff.

The primary event that correlates with the beginning in the spike is a rise in active DistSQL Flows that also seems to coincide with a period of transaction restart errors.

ajwerner · 2019-03-13T13:15:32Z

We see increases in transaction restarts which seem to lead to distsql flows running much longer than in the steady state. We don't have a lot to go on to determine what caused this restart storm. The logging we do get shows:

ip-172-31-33-131> I190313 05:07:00.066230 62109 sql/distsql_running.go:149  [n1,client=172.31.35.96:42848,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=ec0bfaf7 key=/Table/57/1/948/8/0 rw=true pri=0.02702558 stat=ABORTED epo=0 ts=1552453586.736337287,2 orig=1552453559.687657007,0 max=1552453560.187657007,0 wto=false seq=1

I worry this isn't the most interesting of the restarts.

ajwerner · 2019-03-18T12:39:40Z

I'm worried I somewhat dropped the ball on getting sign-off on roachtest failures. Many scary sounding roachtests failed when I ran them last week. I am running them all again but I'll post the list here as I dig in to it.

As well as many logic tests which seem to fail due to:

panic: unhandled node type: *sql.commentOnTableNode [recovered]
	panic: panic while executing 1 statements: COMMENT ON TABLE _ IS 'waa'; caused by unhandled node type: *sql.commentOnTableNode

see here

tbg · 2019-03-18T12:44:03Z

@andreimatei or @bdarnell can qualify the jepsen failures (IIRC the exit code 255 ones aren't "real failures" -- but I haven't opened the artifacts). kv/gracefuldraining has been failing for ages (I mean it -- ages) and has no weight on the releasability (except if nodes died with scary errors). The CDC tests have been flaky for a long time, sort of strategically due to switching to RangeFeed early in the process. @danhhz can tell you what his expectations are. Mine are that they are still known flaky and can't really tell you much.

I'd worry about three of them: kv0, clearrange/checks=true (will look at that one), splits/load/...

For the rest I'd look for crashes only.

tbg · 2019-03-18T12:45:40Z

clearrange is fine. It's a crash, but one triggered by an assertion only active in that special test and one that I know can fail.

tbg · 2019-03-18T12:47:22Z

kv0 died in roachprod start, so nothing to see there.
splits/load/uniform/nodes=3 actually tests load-based splitting and has been broken since I rewrote LBS. It's broken only in the sense that there's sometimes a few less splits than it expects. This isn't new and also not critical, though at some point I'll want to understand it.

ajwerner · 2019-03-18T12:48:33Z

Thank you for jumping in so quickly to help qualify these failures!

danhhz · 2019-03-18T14:25:45Z

Yeah, the cdc failures are expected as of the sha you cut

bdarnell · 2019-03-18T15:15:19Z

The jepsen failures are fine. I don't think all "exit code 255" errors are benign, but both of these were ntp flakes (maybe we should be fixing the skews nemeses to not hit ntp as hard):

Caused by: java.lang.RuntimeException: sudo -S -u root bash -x -c "cd /; ntpdate -b ntp.ubuntu.com" returned non-zero exit status 1 on 10.128.0.116. STDOUT:


STDERR:
+ cd /
+ ntpdate -b ntp.ubuntu.com
13 Mar 09:21:45 ntpdate[4845]: 91.189.89.198 rate limit response from server.
13 Mar 09:21:46 ntpdate[4845]: 91.189.91.157 rate limit response from server.
13 Mar 09:21:47 ntpdate[4845]: 91.189.94.4 rate limit response from server.
13 Mar 09:21:49 ntpdate[4845]: 91.189.89.199 rate limit response from server.
13 Mar 09:21:49 ntpdate[4845]: no server suitable for synchronization found

tbg · 2019-03-18T15:17:27Z

What about the third and fourth one (from jepsen-batch3)? Or am I counting those wrong?

bdarnell · 2019-03-18T15:18:23Z

Sorry, got confused by the teamcity * putting things out of order (what does this mean?). All four of these are the same ntp rate-limit issue.

ajwerner self-assigned this Mar 12, 2019

jseldess mentioned this issue Mar 13, 2019

Add v19.1.0-beta.20190318 release notes cockroachdb/docs#4492

Merged

ajwerner closed this as completed Mar 19, 2019

RaduBerinde mentioned this issue Apr 13, 2019

sql panic: unhandled node type: *sql.commentOnTableNode #36822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v19.1.0-beta.20190318 #35660

release: v19.1.0-beta.20190318 #35660

ajwerner commented Mar 12, 2019 •

edited

Loading

ajwerner commented Mar 13, 2019

tbg commented Mar 13, 2019 via email

ajwerner commented Mar 13, 2019

tbg commented Mar 13, 2019

ajwerner commented Mar 13, 2019

ajwerner commented Mar 13, 2019

ajwerner commented Mar 18, 2019

tbg commented Mar 18, 2019 •

edited

Loading

tbg commented Mar 18, 2019

tbg commented Mar 18, 2019

ajwerner commented Mar 18, 2019

danhhz commented Mar 18, 2019

bdarnell commented Mar 18, 2019

tbg commented Mar 18, 2019

bdarnell commented Mar 18, 2019

release: v19.1.0-beta.20190318 #35660

release: v19.1.0-beta.20190318 #35660

Comments

ajwerner commented Mar 12, 2019 • edited Loading

ajwerner commented Mar 13, 2019

tbg commented Mar 13, 2019 via email

ajwerner commented Mar 13, 2019

tbg commented Mar 13, 2019

ajwerner commented Mar 13, 2019

ajwerner commented Mar 13, 2019

ajwerner commented Mar 18, 2019

tbg commented Mar 18, 2019 • edited Loading

tbg commented Mar 18, 2019

tbg commented Mar 18, 2019

ajwerner commented Mar 18, 2019

danhhz commented Mar 18, 2019

bdarnell commented Mar 18, 2019

tbg commented Mar 18, 2019

bdarnell commented Mar 18, 2019

ajwerner commented Mar 12, 2019 •

edited

Loading

tbg commented Mar 18, 2019 •

edited

Loading