-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
closedts: shorten target_duration from 30s to 5s #39643
closedts: shorten target_duration from 30s to 5s #39643
Conversation
cc @awoods187 |
Awesome! |
Are we concerned about transactions taking more than 5s retrying endlessly? This was a problem previously, though only with 48s+ transaction. Users' batch jobs may go haywire. Arguably if the closed timestamp prevents a txn once, it will do so again (in the absence of contention at least). |
Also, please run a few manual |
Yes, IIUC that's true for transactions which observe the clock. For transactions which don't, shouldn't they get pushed and then refresh, right?
I'll wait until the stress passes on critical packages before attempting to merge this. |
The test failure seems to not be directly related to learners as something similar happens when learners are disabled. I'm bisecting before typing up an issue.
|
Yes, though if the txn takes more than 5s, the refresh may fail. Definitely txns which don't read are safe.
Uhoh, that doesn't look good. Curious what you'll find. |
(My hope is that this is #39604) |
I've been stressing things and saw:
Which I don't think is related. Stressing it a while longer and then I'm going to bors this. Please don't hesitate to revert it. |
While stressing storage for cockroachdb#39643 I encountered TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic failing under stress. It also complained about the parent testing.T having been failed when then using a child so fixed that too though in a perhaps messy way. ``` --- FAIL: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic (2.29s) --- PASS: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic/initial_run (0.14s) client_test.go:1280: [NotLeaseHolderError] r1: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=4 start=1566532897.322355813,1 exp=1566532898.378802707,0 pro=1566532897.478824123,0 --- FAIL: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic/after_restart (1.02s) testing.go:820: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test ``` Release note: None
I also stressed master just to see what happens and observed some other flakes:
and
and
(#39838) Stressing storage and fixing the various rare failures would be good. In the meantime it's not clear that this made anything worse. If it becomes apparent that's not true, please do revert it. bors r+ |
Build failed |
9c29e06
to
a3d077c
Compare
Fixed a constant in the follower read test. bors r+ |
Build failed |
This PR dramatically shortens the closed timestamp target interval. With this setting the experimental_follower_read_timestamp() will now be 3.7 seconds in the past as opposed to the current 48 seconds in the past. This relatively aggressive setting of 5s is intended to shake out issues. This value has been verified to work on the `schemachange/mixed/tpcc` roachtest introduced in cockroachdb#39096 and for vanilla TPC-C runs around where we have established a baseline in the past using tpccbench. Specifically several runs at 2300 warehouses on 3x c5d.4xlarge nodes which is right at the passing boundary without the change were run only a very small difference in efficiency or tail latency was observed. It seems reasonable to attempt to live with this value on master for a while and see what happens. Fixes cockroachdb#37083. Release note: None
a3d077c
to
3e97e57
Compare
Generated the settings html file. bors r+ |
Build failed |
Flaked on #39610 bors r+ |
39643: closedts: shorten target_duration from 30s to 5s r=ajwerner a=ajwerner This PR dramatically shortens the closed timestamp target interval. With this setting the experimental_follower_read_timestamp() will now be 8.5 seconds in the past as opposed to the current 48 seconds in the past. This relatively aggressive setting of 5s is intended to shake out issues as we head into the stability period. This value has been verified to work on the `schemachange/mixed/tpcc` roachtest introduced in #39096 and for vanilla TPC-C runs around where we have established a baseline in the past using tpccbench. Specifically several runs at 2300 warehouses on 3x c5d.4xlarge nodes which is right at the passing boundary without the change were run only a very small difference in efficiency or tail latency was observed. It seems reasonable to attempt to live with this value on master for a while and see what happens. Fixes #37083. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Build succeeded |
While stressing storage for cockroachdb#39643 I encountered TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic failing under stress. It also complained about the parent testing.T having been failed when then using a child so fixed that too though in a perhaps messy way. ``` --- FAIL: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic (2.29s) --- PASS: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic/initial_run (0.14s) client_test.go:1280: [NotLeaseHolderError] r1: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=4 start=1566532897.322355813,1 exp=1566532898.378802707,0 pro=1566532897.478824123,0 --- FAIL: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic/after_restart (1.02s) testing.go:820: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test ``` Release note: None
39832: changefeedccl: allow base64-encoded client certificate r=rolandcrosby a=rolandcrosby Adds `client_cert` and `client_key` options to the `kafka://` changefeed URI scheme. Works like the existing `ca_cert` option: the user base64-encodes the contents of a PEM certificate and private key, and passes those base64 values as parameters in the Kafka URI. Fixes #39817. Release note (enterprise change): Client certificates are now supported for Kafka changefeed authentication. 39838: storage: fix stress flake r=ajwerner a=ajwerner While stressing storage for #39643 I encountered TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic failing under stress. It also complained about the parent testing.T having been failed when then using a child so fixed that too though in a perhaps messy way. ``` --- FAIL: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic (2.29s) --- PASS: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic/initial_run (0.14s) client_test.go:1280: [NotLeaseHolderError] r1: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=4 start=1566532897.322355813,1 exp=1566532898.378802707,0 pro=1566532897.478824123,0 --- FAIL: TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic/after_restart (1.02s) testing.go:820: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test ``` Release note: None Co-authored-by: Roland Crosby <roland@cockroachlabs.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
This reverts commit 3e97e57. The last couple of weeks of experience show that tests become flakey with the target_duration set to 5s. This isn't overly surprising given that updates to table descriptors observe their own timestamp and thus can never be refreshed. See https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/sqlbase/structured.go#L1496-L1509 References cockroachdb#39643. Release note: None
40527: Revert "closedts: shorten target_duration from 30s to 5s" r=andy-kimball a=ajwerner This reverts commit 3e97e57. The last couple of weeks of experience show that tests become flakey with the target_duration set to 5s. This isn't overly surprising given that updates to table descriptors observe their own timestamp and thus can never be refreshed. See https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/sqlbase/structured.go#L1496-L1509 References #39643. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
This PR dramatically shortens the closed timestamp target interval. With this
setting the experimental_follower_read_timestamp() will now be 8.5 seconds in
the past as opposed to the current 48 seconds in the past. This relatively
aggressive setting of 5s is intended to shake out issues as we head into the stability
period.
This value has been verified to work on the
schemachange/mixed/tpcc
roachtestintroduced in #39096 and for vanilla TPC-C runs around where we have
established a baseline in the past using tpccbench. Specifically several runs
at 2300 warehouses on 3x c5d.4xlarge nodes which is right at the passing
boundary without the change were run only a very small difference in
efficiency or tail latency was observed.
It seems reasonable to attempt to live with this value on master for a while
and see what happens.
Fixes #37083.
Release note: None