Skip to content

Commit

Permalink
Merge #40821
Browse files Browse the repository at this point in the history
40821: cli/interactive_tests: deflake some tests r=andreimatei a=knz

Release justification: see individual commits

### cli/interactive_tests: make server shutdown more resilient

Fixes #40776.
Fixes #40788.
Fixes #40746.

This patch intends to resolve test flakiness in the CLI interactive
tests. The flakiness was observed while shutting down servers at the
end of tests. The symptom are:

- the `cockroach quit` command terminates successfully;
- after 5 seconds, the process is still there (its PID is still present);
- immediately when the 5-second loop stops and the script attempts
  to forcefully kill the process, the PID is not there any more;
- the server log file indicates the server has shut down successfully.

This combination of symptoms suggests that the server shutdown takes
approximately 5 seconds. Flakiness occurs because most of the time it
shuts down within 5 seconds, and only occasionally takes more than 5
seconds.

The patch thus bumps the delay for forceful shutdown up to 30
seconds. Since an escape from the loop (server fails to shut down)
incurs a failure of the test, we are expecting this to only occur
rarely.

Separately, the patch also changes the definition of the separate
action `force_stop_server` to immediately kill the process. The 3
callers of this action do not expect a graceful `quit` to succeed
anyway.

### cli/interactive_tests: deflake test_log_config_msg

Fixes #39720.

This test checks that the cluster ID is reported at the start of the
log file(s). Prior to this patch, it would check this by starting a
server, then quitting it immediately, and then looking at the
resulting log file.

This logic was flawed, because it is possible for `cockroach quit` to
instruct a server to shut down before it has completed
its (informational) report in the log file.

This patch resolves this bug by waiting for the expected log output
*before* quitting the server instead of afterwards.




Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
  • Loading branch information
craig[bot] and knz committed Sep 17, 2019
2 parents d21dad8 + 6c1e8f3 commit ad4d6f5
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 15 deletions.
20 changes: 7 additions & 13 deletions pkg/cli/interactive_tests/common.tcl
Original file line number Diff line number Diff line change
Expand Up @@ -96,20 +96,21 @@ proc start_server {argv} {
proc stop_server {argv} {
report "BEGIN STOP SERVER"
# Trigger a normal shutdown.
system "$argv quit"
# If after 5 seconds the server hasn't shut down, trigger an error.
system "for i in `seq 1 5`; do
# If after 30 seconds the server hasn't shut down, kill the process and trigger an error.
# Note: kill -CONT tests whether the PID exists (SIGCONT is a no-op for the process).
system "kill -TERM `cat server_pid` 2>/dev/null;
for i in `seq 1 30`; do
kill -CONT `cat server_pid` 2>/dev/null || exit 0
echo still waiting
sleep 1
done
echo 'server still running?'
# Send an unclean shutdown signal to trigger a stack trace dump.
kill -ABRT `cat server_pid`
kill -ABRT `cat server_pid` 2>/dev/null
# Sleep to increase the probability that the stack trace actually
# makes it to disk before we force-kill the process.
sleep 1
kill -KILL `cat server_pid`
kill -KILL `cat server_pid` 2>/dev/null
exit 1"

report "END STOP SERVER"
Expand All @@ -131,13 +132,6 @@ proc flush_server_logs {} {

proc force_stop_server {argv} {
report "BEGIN FORCE STOP SERVER"
system "$argv quit & sleep 1
if kill -CONT `cat server_pid` 2>/dev/null; then
kill -TERM `cat server_pid`
sleep 1
if kill -CONT `cat server_pid` 2>/dev/null; then
kill -KILL `cat server_pid`
fi
fi"
system "kill -KILL `cat server_pid`"
report "END FORCE STOP SERVER"
}
7 changes: 5 additions & 2 deletions pkg/cli/interactive_tests/test_log_config_msg.tcl
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,15 @@ source [file join [file dirname $argv0] common.tcl]
# flushed and not in the middle of a rotation.

start_server $argv
stop_server $argv

start_test "Check that the cluster ID is reported at the start of the first log file."
system "grep -qF '\[config\] clusterID:' logs/db/logs/cockroach.log"
spawn tail -n 1000 -F logs/db/logs/cockroach.log
eexpect "\\\[config\\\] clusterID:"
eexpect "node startup completed"
end_test

stop_server $argv


# Make a server with a tiny log buffer so as to force frequent log rotation.
system "$argv start-single-node --insecure --pid-file=server_pid --background -s=path=logs/db --log-file-max-size=2k >>logs/expect-cmd.log 2>&1;
Expand Down

0 comments on commit ad4d6f5

Please sign in to comment.