Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: roachprod put cockroach fails with Connection closed by ... #37113

Closed
cockroach-teamcity opened this issue Apr 25, 2019 · 7 comments
Closed
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/event-log PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog

The test failed on master:
	cluster.go:1201,event_log.go:36,acceptance.go:92,test.go:1245: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1260033-acceptance /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 ./cockroach returned:
		stderr:
		
		stdout:
		teamcity-1260033-acceptance: putting (dist) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 ./cockroach
		...............................................................................................................................
		   1: done
		   2: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine root@35.231.2.99:./cockroach root@35.196.233.145:./cockroach
		Connection closed by 35.231.2.99 port 22
		: exit status 1
		   3: done
		   4: done
		I190425 06:22:39.419528 1 cluster_synced.go:986  put /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 failed
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Apr 25, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Apr 25, 2019
@andreimatei andreimatei changed the title roachtest: acceptance/event-log failed roachtest: roachprod put cockroach fails with Connection closed by ... Apr 25, 2019
@andreimatei
Copy link
Contributor

This ran had #37001 which I think we were hoping would fix this, but it didn't. cc @nvanbenschoten
cc @ajwerner - did you have a plan for adding more verbosity to scp?

@andreimatei
Copy link
Contributor

cc @bdarnell too. These ssh connections getting closed like this seems to be a very common failure. We really need to do something here - but what? Does anybody know what logging can be added to ssh and/or sshd?

ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 25, 2019
This single '-v' flag enable debug1 level logging in ssh in hopes
of helping root cause issues like cockroachdb#37113.
@ajwerner
Copy link
Contributor

Added a PR to at least add verbosity to the scp call. We should also consider adding logging to some other uses of SSH but this is the easiest place to inject extra logging. Logging at a higher level felt excessive as it would log about every block that gets transferred.

It'd be nice to know what the server is logging when this happens. Maybe we should add some logic to capture journalctl logs when tests fail to set up a cluster.

How do we classify this as a cluster creation failure rather than a test failure?

@tbg
Copy link
Member

tbg commented Apr 25, 2019

Logging at a higher level felt excessive as it would log about every block that gets transferred.

We could swallow the output if the command doesn't fail, nobody cares what scp says if it reports success (different for ssh, which sees the same problems)

It'd be nice to know what the server is logging when this happens. Maybe we should add some logic to capture journalctl logs when tests fail to set up a cluster.

That's the thing I really hope can fix this issue - can't we just set up our clusters with verbose sshd logging in the first place, and capture the logs on failed tests? I'd like to avoid tracking these flakes as cluster creation failures because that adds a lot of mess to roachtest and it may even distract from fixing the root cause.

craig bot pushed a commit that referenced this issue Apr 25, 2019
37125: roachprod: enable verbose logging for scp r=ajwerner a=ajwerner

This single '-v' flag enable debug1 level logging in ssh in hopes
of helping root cause issues like #37113.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@ajwerner
Copy link
Contributor

We could swallow the output if the command doesn't fail, nobody cares what scp says if it reports success (different for ssh, which sees the same problems)

We already do. Let's see where we are when we have some output from #37125, if it's not enough we can change the setting to -vvv.

That's the thing I really hope can fix this issue - can't we just set up our clusters with verbose sshd logging in the first place, and capture the logs on failed tests?

The sshd logging that's interesting I suspect is on the remote host, not the local runner. We're not guaranteed that a later ssh to go collect sshd logs will succeed but I'll type up a diff to go collect them.

@tbg
Copy link
Member

tbg commented Apr 25, 2019

We're not guaranteed that a later ssh to go collect sshd logs will succeed but I'll type up a diff to go collect them.

The dead node detection usually seems to get in just fine, so I think our chances are good that this will just work (🤞)

ajwerner added a commit to ajwerner/cockroach that referenced this issue May 15, 2019
This single '-v' flag enable debug1 level logging in ssh in hopes
of helping root cause issues like cockroachdb#37113.
@kenliu kenliu added A-testing Testing tools and infrastructure and removed C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 12, 2019
@kenliu kenliu removed this from the 19.2 milestone Jul 12, 2019
@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 30, 2019
@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

6 participants