-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release: v19.1.0-beta.20190318 #35660
Comments
I kicked off two clusters last night, one with the provisional release, one with the last release. I loaded up tpc-c 1k on to each and left them running when I went to bed. The new release seems to have died a little over an hour in. There was a big spike in CPU/memory as well as in txn restarts then eventually an error hits the client and the load stops. Then there are a couple of interesting things in the logs but on the whole it's pretty sparse
and then
Then a few minutes later there's a cascade of deadline exceeded and then the load stops. |
Any heap profiles (`heap_profiler` log subdir)?
…On Wed, Mar 13, 2019 at 1:47 PM ajwerner ***@***.***> wrote:
I kicked off two clusters last night, one with the provisional release,
one with the last release. I loaded up tpc-c 1k on to each and left them
running when I went to bed. The new release seems to have died a little
over an hour in. There was a big spike in CPU/memory as well as in txn
restarts then eventually an error hits the client and the load stops.
Then there are a couple of interesting things in the logs but on the whole
it's pretty sparse
There are a couple of
ip-172-31-35-120> W190313 05:06:27.719833 13603764 storage/intentresolver/intent_resolver.go:822 [n4,s4,r3044/1:/Table/57/1/{80/0-90/0}] failed to gc transaction record: could not GC completed transaction anchored at /Table/57/1/87/6/0: context canceled
ip-172-31-46-235> W190313 05:06:27.796814 11838596 storage/intentresolver/intent_resolver.go:822 [n6,s6,r2953/3:/Table/57/1/8{50/0-60/0}] failed to gc transaction record: could not GC completed transaction anchored at /Table/57/1/852/7/0: context canceled
and then
ip-172-31-44-162> I190313 05:07:32.666042 22749 sql/distsql_running.go:149 [n2,client=172.31.35.96:44164,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=0f1ce342 key=/Table/57/1/625/2/0 rw=true pri=0.02300479 stat=ABORTED epo=0 ts=1552453630.311500840,1 orig=1552453626.561107837,0 max=1552453627.061107837,0 wto=false seq=1
Then a few minutes later there's a cascade of deadline exceeded and then
the load stops.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#35660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135CjNqg_hrkOZLeaYzIx_ZY7ye5usks5vWPN6gaJpZM4brpYe>
.
|
Unfortunately no, the heap_profiler directories are empty |
We see increases in transaction restarts which seem to lead to distsql flows running much longer than in the steady state. We don't have a lot to go on to determine what caused this restart storm. The logging we do get shows:
I worry this isn't the most interesting of the restarts. |
I'm worried I somewhat dropped the ball on getting sign-off on roachtest failures. Many scary sounding roachtests failed when I ran them last week. I am running them all again but I'll post the list here as I dig in to it. As well as many logic tests which seem to fail due to:
see here |
@andreimatei or @bdarnell can qualify the jepsen failures (IIRC the exit code 255 ones aren't "real failures" -- but I haven't opened the artifacts). I'd worry about three of them: kv0, clearrange/checks=true (will look at that one), splits/load/... For the rest I'd look for crashes only. |
|
kv0 died in roachprod start, so nothing to see there. |
Thank you for jumping in so quickly to help qualify these failures! |
Yeah, the cdc failures are expected as of the sha you cut |
The jepsen failures are fine. I don't think all "exit code 255" errors are benign, but both of these were ntp flakes (maybe we should be fixing the skews nemeses to not hit ntp as hard):
|
What about the third and fourth one (from jepsen-batch3)? Or am I counting those wrong? |
Sorry, got confused by the teamcity |
Candidate SHA: a512e39
Deployment status: Ran roachprod based clusters, deployed to adriatic late (2019/04/18 12:40 EST)
Nightly Suite: TeamCity Job
Release process checklist
Prep date:
3/12/2019
Pick a SHA, fill in
Candidate SHA
above, notify #release-process of SHA.Tag the provisional SHA
Publish provisional binaries
Check binaries
Deploy to test clusters
Start nightly suite
Verify node crash reports
fill in
Deployment status
above with clusters andNightly Suite
with the link to Nightly TeamCity JobKeep an eye on clusters until release date. Do not proceed below until the release date.
Release date:
3/18/2019
Check cluster status
Tag release
Bless provisional binaries
For production or stable releases in the latest major release series
Update docs
External communications for release
The text was updated successfully, but these errors were encountered: