Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: v19.1.0-beta.20190318 #35660

Closed
13 of 17 tasks
ajwerner opened this issue Mar 12, 2019 · 15 comments
Closed
13 of 17 tasks

release: v19.1.0-beta.20190318 #35660

ajwerner opened this issue Mar 12, 2019 · 15 comments
Assignees

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Mar 12, 2019

Candidate SHA: a512e39
Deployment status: Ran roachprod based clusters, deployed to adriatic late (2019/04/18 12:40 EST)
Nightly Suite: TeamCity Job

Release process checklist

Prep date: 3/12/2019

Release date: 3/18/2019

@ajwerner ajwerner self-assigned this Mar 12, 2019
@ajwerner
Copy link
Contributor Author

I kicked off two clusters last night, one with the provisional release, one with the last release. I loaded up tpc-c 1k on to each and left them running when I went to bed. The new release seems to have died a little over an hour in. There was a big spike in CPU/memory as well as in txn restarts then eventually an error hits the client and the load stops.

Then there are a couple of interesting things in the logs but on the whole it's pretty sparse
There are a couple of

ip-172-31-35-120> W190313 05:06:27.719833 13603764 storage/intentresolver/intent_resolver.go:822  [n4,s4,r3044/1:/Table/57/1/{80/0-90/0}] failed to gc transaction record: could not GC completed transaction anchored at /Table/57/1/87/6/0: context canceled
ip-172-31-46-235> W190313 05:06:27.796814 11838596 storage/intentresolver/intent_resolver.go:822  [n6,s6,r2953/3:/Table/57/1/8{50/0-60/0}] failed to gc transaction record: could not GC completed transaction anchored at /Table/57/1/852/7/0: context canceled

and then

ip-172-31-44-162> I190313 05:07:32.666042 22749 sql/distsql_running.go:149  [n2,client=172.31.35.96:44164,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=0f1ce342 key=/Table/57/1/625/2/0 rw=true pri=0.02300479 stat=ABORTED epo=0 ts=1552453630.311500840,1 orig=1552453626.561107837,0 max=1552453627.061107837,0 wto=false seq=1

Then a few minutes later there's a cascade of deadline exceeded and then the load stops.

@tbg
Copy link
Member

tbg commented Mar 13, 2019 via email

@ajwerner
Copy link
Contributor Author

Unfortunately no, the heap_profiler directories are empty

@tbg
Copy link
Member

tbg commented Mar 13, 2019

Ok. Poking around a bit, the I/O latencies went up but this can be fallout

image

Raft time also goes up:

image

So something expensive was likely happening, but it's unclear what. There is no increase in MsgApp but at the end we're seeing more heartbeats, a further sign of system overload.

@ajwerner
Copy link
Contributor Author

The resource usage is steady until around 5:05:43 and then starts to explode.

ip-172-31-33-131> I190313 05:05:43.996810 487 server/status/runtime.go:464  [n1] runtime stats: 12 GiB RSS, 1110 goroutines, 415 MiB/120 MiB/864 MiB GO alloc/idle/total, 7.8 GiB/9.9 GiB CGO alloc/total, 48632.4 CGO/sec, 248.4/49.4 %(u/s)time, 0.0 %gc (7x), 46 MiB/430 MiB (r/w)net
ip-172-31-44-162> I190313 05:05:47.079751 360 server/status/runtime.go:464  [n2] runtime stats: 12 GiB RSS, 1413 goroutines, 750 MiB/385 MiB/1.4 GiB GO alloc/idle/total, 7.8 GiB/9.8 GiB CGO alloc/total, 66019.7 CGO/sec, 649.4/76.2 %(u/s)time, 1.9 %gc (10x), 305 MiB/464 MiB (r/w)net
ip-172-31-42-34> I190313 05:05:47.722956 373 server/status/runtime.go:464  [n3] runtime stats: 12 GiB RSS, 1389 goroutines, 1.0 GiB/1.1 GiB/2.4 GiB GO alloc/idle/total, 7.7 GiB/9.9 GiB CGO alloc/total, 54711.4 CGO/sec, 716.7/100.3 %(u/s)time, 2.1 %gc (9x), 722 MiB/38 MiB (r/w)net
ip-172-31-35-120> I190313 05:05:50.455167 334 server/status/runtime.go:464  [n4] runtime stats: 13 GiB RSS, 2228 goroutines, 1.7 GiB/222 MiB/2.2 GiB GO alloc/idle/total, 7.8 GiB/10 GiB CGO alloc/total, 63641.5 CGO/sec, 1135.1/107.3 %(u/s)time, 2.5 %gc (9x), 575 MiB/784 MiB (r/w)net

There's nothing interesting in the logs at the beginning of this increase. Over the next two minutes resource utilization spikes

ip-172-31-42-34> I190313 05:06:58.320226 373 server/status/runtime.go:464  [n3] runtime stats: 17 GiB RSS, 1994 goroutines, 5.8 GiB/150 MiB/6.3 GiB GO alloc/idle/total, 7.7 GiB/10 GiB CGO alloc/total, 16079.4 CGO/sec, 1496.8/49.0 %(u/s)time, 0.0 %gc (2x), 63 MiB/3.7 MiB (r/w)net

Including RAM, CPU and network. Disk reads spike and writes fall off a cliff.

The primary event that correlates with the beginning in the spike is a rise in active DistSQL Flows that also seems to coincide with a period of transaction restart errors.
Screenshot_2019-03-13 SQL Dashboard Cockroach Console
Screenshot_2019-03-13 Distributed Dashboard Cockroach Console

@ajwerner
Copy link
Contributor Author

We see increases in transaction restarts which seem to lead to distsql flows running much longer than in the steady state. We don't have a lot to go on to determine what caused this restart storm. The logging we do get shows:

ip-172-31-33-131> I190313 05:07:00.066230 62109 sql/distsql_running.go:149  [n1,client=172.31.35.96:42848,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=ec0bfaf7 key=/Table/57/1/948/8/0 rw=true pri=0.02702558 stat=ABORTED epo=0 ts=1552453586.736337287,2 orig=1552453559.687657007,0 max=1552453560.187657007,0 wto=false seq=1

I worry this isn't the most interesting of the restarts.

@ajwerner
Copy link
Contributor Author

I'm worried I somewhat dropped the ball on getting sign-off on roachtest failures. Many scary sounding roachtests failed when I ran them last week. I am running them all again but I'll post the list here as I dig in to it.

Screenshot_2019-03-18 Cockroach Nightlies Roachtest Nightly - GCE #764 (13 Mar 19 04 42) Overview — TeamCity

As well as many logic tests which seem to fail due to:

panic: unhandled node type: *sql.commentOnTableNode [recovered]
	panic: panic while executing 1 statements: COMMENT ON TABLE _ IS 'waa'; caused by unhandled node type: *sql.commentOnTableNode

see here

@tbg
Copy link
Member

tbg commented Mar 18, 2019

@andreimatei or @bdarnell can qualify the jepsen failures (IIRC the exit code 255 ones aren't "real failures" -- but I haven't opened the artifacts). kv/gracefuldraining has been failing for ages (I mean it -- ages) and has no weight on the releasability (except if nodes died with scary errors). The CDC tests have been flaky for a long time, sort of strategically due to switching to RangeFeed early in the process. @danhhz can tell you what his expectations are. Mine are that they are still known flaky and can't really tell you much.

I'd worry about three of them: kv0, clearrange/checks=true (will look at that one), splits/load/...

For the rest I'd look for crashes only.

@tbg
Copy link
Member

tbg commented Mar 18, 2019

clearrange is fine. It's a crash, but one triggered by an assertion only active in that special test and one that I know can fail.

@tbg
Copy link
Member

tbg commented Mar 18, 2019

kv0 died in roachprod start, so nothing to see there.
splits/load/uniform/nodes=3 actually tests load-based splitting and has been broken since I rewrote LBS. It's broken only in the sense that there's sometimes a few less splits than it expects. This isn't new and also not critical, though at some point I'll want to understand it.

@ajwerner
Copy link
Contributor Author

Thank you for jumping in so quickly to help qualify these failures!

@danhhz
Copy link
Contributor

danhhz commented Mar 18, 2019

Yeah, the cdc failures are expected as of the sha you cut

@bdarnell
Copy link
Contributor

The jepsen failures are fine. I don't think all "exit code 255" errors are benign, but both of these were ntp flakes (maybe we should be fixing the skews nemeses to not hit ntp as hard):

Caused by: java.lang.RuntimeException: sudo -S -u root bash -x -c "cd /; ntpdate -b ntp.ubuntu.com" returned non-zero exit status 1 on 10.128.0.116. STDOUT:


STDERR:
+ cd /
+ ntpdate -b ntp.ubuntu.com
13 Mar 09:21:45 ntpdate[4845]: 91.189.89.198 rate limit response from server.
13 Mar 09:21:46 ntpdate[4845]: 91.189.91.157 rate limit response from server.
13 Mar 09:21:47 ntpdate[4845]: 91.189.94.4 rate limit response from server.
13 Mar 09:21:49 ntpdate[4845]: 91.189.89.199 rate limit response from server.
13 Mar 09:21:49 ntpdate[4845]: no server suitable for synchronization found

@tbg
Copy link
Member

tbg commented Mar 18, 2019

What about the third and fourth one (from jepsen-batch3)? Or am I counting those wrong?

@bdarnell
Copy link
Contributor

Sorry, got confused by the teamcity * putting things out of order (what does this mean?). All four of these are the same ntp rate-limit issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants