Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/splits/nodes=3/quiesce=true failed #62063

Closed
cockroach-teamcity opened this issue Mar 16, 2021 · 11 comments · Fixed by #62108
Closed

roachtest: kv/splits/nodes=3/quiesce=true failed #62063

cockroach-teamcity opened this issue Mar 16, 2021 · 11 comments · Fixed by #62108
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).kv/splits/nodes=3/quiesce=true failed on release-21.1@4a47e0e305cbdabf963f896c1cb571a28b34e63d:

		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2688,kv.go:589,test_runner.go:767: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2684
		  | main.registerKVSplits.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/kv.go:589
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1667,context.go:140,cluster.go:1656,test_runner.go:848: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2780761-1615874220-12-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 5369
		3: 5314
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /kv/splits/nodes=3/quiesce=true

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 16, 2021
@knz knz removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 16, 2021
@knz
Copy link
Contributor

knz commented Mar 16, 2021

Node spontaneously aborted due to a clock offset error.

This seems to suggest that the time synchronization may not be configured properly on these machines?

teamcity-2780761-1615874220-12-n4cpu4-0001> F210316 06:25:32.421638 1271 1@server/server.go:322 â‹® [n1] 62225  clock synchronization error: this node is more than 500ms away from at least half of the known nodes (1 of 2 are within the offset)

The workload command also detected this:

		  | W210316 06:25:24.052369 261 workload/workloadsql/workloadsql.go:150  [-] 185  ALTER TABLE kv SCATTER FROM (-1013675952503954432) TO (-1013675952503954432): pq: remote wall time is too far ahead (997.307724ms) to be trustworthy
		  | W210316 06:25:24.057565 137 workload/workloadsql/workloadsql.go:150  [-] 186  ALTER TABLE kv SCATTER FROM (-1015213176052684800) TO (-1015213176052684800): pq: remote wall time is too far ahead (963.358948ms) to be trustworthy
		  | W210316 06:25:24.058499 351 workload/workloadsql/workloadsql.go:150  [-] 187  ALTER TABLE kv SCATTER FROM (-1013983397213700096) TO (-1013983397213700096): pq: remote wall time is too far ahead (996.520502ms) to be trustworthy
		  | W210316 06:25:24.074307 218 workload/workloadsql/workloadsql.go:150  [-] 188  ALTER TABLE kv SCATTER FROM (-1015090198168785920) TO (-1015090198168785920): pq: remote wall time is too far ahead (994.917869ms) to be trustworthy
		  | W210316 06:25:24.077727 126 workload/workloadsql/workloadsql.go:150  [-] 189  ALTER TABLE kv SCATTER FROM (-1013552974620055552) TO (-1013552974620055552): pq: remote wall time is too far ahead (985.276807ms) to be trustworthy
		  | W210316 06:25:24.083863 321 workload/workloadsql/workloadsql.go:150  [-] 190  ALTER TABLE kv SCATTER FROM (-1014413819807344640) TO (-1014413819807344640): pq: remote wall time is too far ahead (988.916505ms) to be trustworthy
		  | W210316 06:25:24.098371 198 workload/workloadsql/workloadsql.go:150  [-] 191  ALTER TABLE kv SCATTER FROM (-1015336153936582656) TO (-1015336153936582656): pq: remote wall time is too far ahead (985.03805ms) to be trustworthy
		  | W210316 06:25:24.114842 270 workload/workloadsql/workloadsql.go:150  [-] 192  ALTER TABLE kv SCATTER FROM (-1024559495228966912) TO (-1024559495228966912): pq: remote wall time is too far ahead (974.053225ms) to be trustworthy

@knz
Copy link
Contributor

knz commented Mar 16, 2021

@jlinder @rickystewart do you know how roachtest-created VMs ensure their clock is properly synchronized?

@rail
Copy link
Member

rail commented Mar 16, 2021

I just took a loot at one of the PR agents and it looks like it runs chrony/chronyd:

rail@gce-agent-ppr-20210224-preempt-20:~$ systemctl status chronyd.service 
● chrony.service - chrony, an NTP client/server
   Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-03-16 14:22:49 UTC; 1h 1min ago
     Docs: man:chronyd(8)
           man:chronyc(1)
           man:chrony.conf(5)
 Main PID: 2812 (chronyd)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/chrony.service
           └─2812 /usr/sbin/chronyd
Mar 16 14:22:49 gce-agent-ppr-20210224-preempt-20 systemd[1]: Starting chrony, an NTP client/server...
Mar 16 14:22:49 gce-agent-ppr-20210224-preempt-20 chronyd[2812]: chronyd version 3.2 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH +SI
Mar 16 14:22:49 gce-agent-ppr-20210224-preempt-20 chronyd[2812]: Frequency -90.471 +/- 0.061 ppm read from /var/lib/chrony/chrony.drift
Mar 16 14:22:49 gce-agent-ppr-20210224-preempt-20 systemd[1]: Started chrony, an NTP client/server.
Mar 16 14:22:53 gce-agent-ppr-20210224-preempt-20 chronyd[2812]: Selected source 169.254.169.254
rail@gce-agent-ppr-20210224-preempt-20:~$ systemctl status chrony.service 
● chrony.service - chrony, an NTP client/server
   Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-03-16 14:22:49 UTC; 1h 2min ago
     Docs: man:chronyd(8)
           man:chronyc(1)
           man:chrony.conf(5)
 Main PID: 2812 (chronyd)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/chrony.service
           └─2812 /usr/sbin/chronyd
Mar 16 14:22:49 gce-agent-ppr-20210224-preempt-20 systemd[1]: Starting chrony, an NTP client/server...
Mar 16 14:22:49 gce-agent-ppr-20210224-preempt-20 chronyd[2812]: chronyd version 3.2 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH +SI
Mar 16 14:22:49 gce-agent-ppr-20210224-preempt-20 chronyd[2812]: Frequency -90.471 +/- 0.061 ppm read from /var/lib/chrony/chrony.drift
Mar 16 14:22:49 gce-agent-ppr-20210224-preempt-20 systemd[1]: Started chrony, an NTP client/server.
Mar 16 14:22:53 gce-agent-ppr-20210224-preempt-20 chronyd[2812]: Selected source 169.254.169.254

@rail
Copy link
Member

rail commented Mar 16, 2021

Just realized that the host is created by roachprod, let me peek there too.

@rail
Copy link
Member

rail commented Mar 16, 2021

Just used one of the roachprod created machines: ./bin/roachprod ssh teamcity-2780739-1615874978-42-n8cpu4:1

  • It's based on ubuntu 16.04
  • No chrony/chronyd running
  • There is google_clock_skew_daemon, I'm not familiar with it tbh

I wonder if we should use chrony/ntp instead.

@knz
Copy link
Contributor

knz commented Mar 16, 2021

@bdarnell do you know whether google_clock_skew_daemon is also performing general time sync? Or are we supposed to use it in combination with chrony/ntp?

@bdarnell
Copy link
Contributor

I don't know what exactly google_clock_skew_daemon does, but it is run in addition to chrony/ntpd/systemd-timesyncd and doesn't replace them (#13447 was caused by google_clock_skew_daemon not running).

We should be running exactly one of chrony, ntpd, or systemd-timesyncd. I believe we're just relying on the default behavior of ubuntu here. (if we're installing a non-default time sync tool we need to be sure to disable the default; this is something to watch out for when we upgrade to a newer ubuntu because i think the default has changed since 16.04).

@rail rail self-assigned this Mar 16, 2021
@rail
Copy link
Member

rail commented Mar 16, 2021

Just verified that we run ntpd on the GCE VMs and chrony in AWS. Looks like we explicitly added chrony in #31577, but only for AWS. We can do something similar in GCE.

@rail
Copy link
Member

rail commented Mar 16, 2021

BTW, the gce instances use this in ntp.conf

# Google NTP source. This was added as part of the GCE image build process.
server metadata.google.internal iburst

@knz
Copy link
Contributor

knz commented Mar 17, 2021

BTW, the gce instances use this in ntp.conf

What is this "iburst" about?

Also, I recall that ntpd in some deployments intendedly does not adjust the clock upon system startup if the skew is larger than some threshold. To achieve this, it is necessary to also run ntpdate prior to starting ntpd. This used to be the case at some point in the past in Debian and so I assume Ubuntu too.

Given the large skew value in the error message at the top, this is worth checking.

@rail
Copy link
Member

rail commented Mar 17, 2021

That is my understanding as well. ntpd panics if the skew is too large. https://chrony.tuxfamily.org/comparison.html compares them and it sounds like it's worth switching to chrony and at the same time have the same setup as in AWS.

craig bot pushed a commit that referenced this issue Mar 22, 2021
62108: roachprod: Install and configure chrony on GCE clusters r=rail a=rail

Fixes #62063

In #31577 we switched to `chrony` for AWS, but not for CGE. By default
they GCE clusters based on Ubuntu 16.04 use `ntp`.

This patch installs `chrony` (and automatically removes `ntp`) on GCE
and configures `chrony` to use Google's time server.

Release note: None

Co-authored-by: Rail Aliiev <rail@iqchoice.com>
@craig craig bot closed this as completed in df7b3cf Mar 22, 2021
rail added a commit to rail/cockroach that referenced this issue Mar 29, 2021
Fixes cockroachdb#62063

In cockroachdb#31577 we switched to `chrony` for AWS, but not for CGE. By default
they GCE clusters based on Ubuntu 16.04 use `ntp`.

This patch installs `chrony` (and automatically removes `ntp`) on GCE
and configures `chrony` to use Google's time server.

Release note: None
tbg pushed a commit to tbg/cockroach that referenced this issue May 14, 2021
Fixes cockroachdb#62063

In cockroachdb#31577 we switched to `chrony` for AWS, but not for CGE. By default
they GCE clusters based on Ubuntu 16.04 use `ntp`.

This patch installs `chrony` (and automatically removes `ntp`) on GCE
and configures `chrony` to use Google's time server.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants