Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: jepsen subcritical-skews tests a skipped due to ntp rate limiting #35599

Closed
cockroach-teamcity opened this issue Mar 11, 2019 · 7 comments · Fixed by #112710 · May be fixed by #92125
Closed

roachtest: jepsen subcritical-skews tests a skipped due to ntp rate limiting #35599

cockroach-teamcity opened this issue Mar 11, 2019 · 7 comments · Fixed by #112710 · May be fixed by #92125
Labels
branch-master Failures and bugs on the master branch. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. O-roachtest O-robot Originated from a bot. skipped-test T-testeng TestEng Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 11, 2019

SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch1/bank-multitable/subcritical-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog

The test failed on master:
	jepsen.go:247,jepsen.go:308,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1170795-jepsen-batch1:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.38 -n 10.142.0.9 -n 10.142.0.41 -n 10.142.0.27 -n 10.142.0.26 \
		   --test bank-multitable --nemesis subcritical-skews \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 255
		: exit status 1

Jira issue: CRDB-4573

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Mar 11, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 11, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5ebfeec052f9cee4e63757defe7c9120643293db

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch1/bank-multitable/subcritical-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1174810&tab=buildLog

The test failed on release-2.1:
	jepsen.go:247,jepsen.go:308,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1174810-jepsen-batch1:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.47 -n 10.142.0.38 -n 10.142.0.44 -n 10.142.0.36 -n 10.142.0.41 \
		   --test bank-multitable --nemesis subcritical-skews \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 255
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7ce9188c6e64465d9dcb9f0ca0f113dd0e584da0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch1/bank-multitable/subcritical-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1178908&tab=buildLog

The test failed on release-2.1:
	jepsen.go:247,jepsen.go:308,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1178908-jepsen-batch1:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.142.0.39 -n 10.142.0.159 -n 10.142.0.38 -n 10.142.0.36 -n 10.142.0.160 \
		   --test bank-multitable --nemesis subcritical-skews \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 255
		: exit status 1

@bdarnell bdarnell changed the title roachtest: jepsen-batch1/bank-multitable/subcritical-skews failed roachtest: jepsen subcritical-skews rate limiting Mar 18, 2019
@bdarnell
Copy link
Contributor

The subcritical-skews nemesis resynchronizes with ntp frequently. This has recently started failing because we're getting rate-limited by the NTP server (it hard-codes ntp.ubuntu.com).

We need to either

  • Track the offsets we've applied and undo them without going back to NTP
  • Find an NTP server with higher rate limits (maybe gcp's internal ones?)
  • Run our own ntp server on the jepsen controller node and sync against it?

bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 18, 2019
craig bot pushed a commit that referenced this issue Mar 18, 2019
35284: storage,kv: make transaction deadline exceeded errors retriable r=andreimatei a=andreimatei

Before this patch, they were opaque TransactionStatusErrors.
The belief is that we should only be seeing such errors when a
transaction is pushed by minutes. Shockingly, this seems to hapen enough
in our tests, for example as described here: #18684 (comment)

This patch marks the error as retriable, since it technically is.

This patch also changes the semantics of the
EndTransactionRequest.Deadline field to make it exclusive so that it
matches the nature of SQL leases. No migration needed.

Touches #18684

Release note (sql change): "transaction deadline exceeded" errors are
now returned to the client with a retriable code.

35793: storage: fix TestRangeInfo flake and re-enable follower reads by default r=ajwerner a=ajwerner

This PR addresses a test flake introduced by enabling follower reads in
conjunction with #35130 which makes follower reads more generally possible
in the face of lease transfer.

Fixes #35758.

Release note: None

35865: roachtest: Skip flaky jepsen nemesis r=tbg a=bdarnell

See #35599

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: Ben Darnell <ben@bendarnell.com>
@tbg tbg removed the C-test-failure Broken test (automatically or manually discovered). label Mar 19, 2019
@tbg tbg changed the title roachtest: jepsen subcritical-skews rate limiting roachtest: jepsen subcritical-skews tests a skipped due to ntp rate limiting Mar 19, 2019
@jordanlewis jordanlewis added the C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. label Apr 11, 2019
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@petermattis petermattis removed their assignment Nov 4, 2021
@cucaroach cucaroach removed this from the 19.1 milestone Jan 6, 2022
@cucaroach
Copy link
Contributor

Clearing the milestone so this gets re-triaged.

@aliher1911 aliher1911 self-assigned this Jan 10, 2022
@aliher1911
Copy link
Contributor

While looking on other issues connected to jepsen tests I found that recent jepsen packages use pool.ntp.org instead of ntp.ubuntu.org.

I changed it and gave it a try and surprise we are not throttled by pool and I see no more complains in the log.

Since we have server address hardcoded into our tests it should be a quick win so that we could have tests reenabled.

@aliher1911
Copy link
Contributor

With jepsen change in place, I'll make a diff and see if it works or not. Running those tests with roachtest from dev looked fine.

@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Nov 15, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 15, 2022

cc @cockroachdb/test-eng

@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Nov 21, 2022
@srosenberg srosenberg added the branch-master Failures and bugs on the master branch. label Jul 7, 2023
craig bot pushed a commit that referenced this issue Oct 20, 2023
112710: roachtest: reenable Jepsen subcritical-skews test r=DarrylWong a=renatolabs

The Jepsen version we are using already moved from `ntp.ubuntu.org` to `pool.ntp.org`. We should be able to run these tests again.

https://github.com/cockroachdb/jepsen/blob/cdeef40a0cd24af0c989e0a7990ee1c7fa948f43/cockroachdb/src/jepsen/cockroach/time.clj#L27

Fixes: #35599

Release note: None

Co-authored-by: Renato Costa <renato@cockroachlabs.com>
@craig craig bot closed this as completed in 32f7a71 Oct 20, 2023
blathers-crl bot pushed a commit that referenced this issue Oct 20, 2023
The Jepsen version we are using already moved from `ntp.ubuntu.org` to
`pool.ntp.org`. We should be able to run these tests again.

https://github.com/cockroachdb/jepsen/blob/cdeef40a0cd24af0c989e0a7990ee1c7fa948f43/cockroachdb/src/jepsen/cockroach/time.clj#L27

Fixes: #35599

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 20, 2023
The Jepsen version we are using already moved from `ntp.ubuntu.org` to
`pool.ntp.org`. We should be able to run these tests again.

https://github.com/cockroachdb/jepsen/blob/cdeef40a0cd24af0c989e0a7990ee1c7fa948f43/cockroachdb/src/jepsen/cockroach/time.clj#L27

Fixes: #35599

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. O-roachtest O-robot Originated from a bot. skipped-test T-testeng TestEng Team
Projects
No open projects
Status: Done
9 participants