Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [empty prevented startup file] #67471

Closed
cockroach-teamcity opened this issue Jul 11, 2021 · 3 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 9ba3068987a973a92757db53800dfafbf8a262e7:

		  |   | # because memory limiting doesn't work in that mode. Instead we pass the uid and
		  |   | # gid that the process will run under.
		  |   | # The "notify" service type means that systemd-run waits until cockroach
		  |   | # notifies systemd that it is ready; NotifyAccess=all is needed because this
		  |   | # notification doesn't come from the main PID (which is bash).
		  |   | sudo systemd-run --unit cockroach \
		  |   |   --same-dir --uid $(id -u) --gid $(id -g) \
		  |   |   --service-type=notify -p NotifyAccess=all \
		  |   |   -p MemoryMax=95% \
		  |   |   -p LimitCORE=infinity \
		  |   |   -p LimitNOFILE=65536 \
		  |   | 	bash $0 run
		  |   | EOF
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210711 15:14:34.625080 1 (gostd) cluster_synced.go:1675  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError

	z_cluster.go:1230,context.go:89,z_cluster.go:1218,z_test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3170584-1625983725-62-n7cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 5: 1591
		3: 1572
		6: dead (exit status 1)
		4: 1576
		1: 1548
		2: 1565
		7: skipped
		Error: UNCLASSIFIED_PROBLEM: 6: dead (exit status 1)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1154
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2071
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 6: dead (exit status 1)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh tpccbench/nodes=6/cpu=16/multi-az

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 11, 2021
@tbg
Copy link
Member

tbg commented Jul 12, 2021

It looks as though n6 refused to restart due to the presence of a PreventStartupFile. We see

ERROR: startup forbidden by prior critical alert
DETAIL: From /mnt/data1/cockroach/auxiliary/_CRITICAL_ALERT.txt:
Failed running "start"

which is printed from

b, err := ioutil.ReadFile(path)
if err != nil {
if !oserror.IsNotExist(err) {
addError(errors.Wrapf(err, "%s", path))
}
continue
}
addError(errors.Newf("From %s:\n\n%s\n", path, b))

so the file was empty. It looks as that file is only ever written on replica inconsistencies (search for .PreventedStartupFile) as determined by the consistency checker, or on "unxpected" corruption errors:

func (r *Replica) setCorruptRaftMuLocked(
ctx context.Context, cErr *roachpb.ReplicaCorruptionError,
) *roachpb.Error {
r.readOnlyCmdMu.Lock()
defer r.readOnlyCmdMu.Unlock()
r.mu.Lock()
defer r.mu.Unlock()
log.ErrorfDepth(ctx, 1, "stalling replica due to: %s", cErr.ErrorMsg)
cErr.Processed = true
r.mu.destroyStatus.Set(cErr, destroyReasonRemoved)
auxDir := r.store.engine.GetAuxiliaryDir()
_ = r.store.engine.MkdirAll(auxDir)
path := base.PreventedStartupFile(auxDir)
preventStartupMsg := fmt.Sprintf(`ATTENTION:
this node is terminating because replica %s detected an inconsistent state.
Please contact the CockroachDB support team. It is not necessarily safe
to replace this node; cluster data may still be at risk of corruption.
A file preventing this node from restarting was placed at:
%s
`, r, path)
if err := fs.WriteFile(r.store.engine, path, []byte(preventStartupMsg)); err != nil {
log.Warningf(ctx, "%v", err)
}
log.FatalfDepth(ctx, 1, "replica is corrupted: %s", cErr)
return roachpb.NewError(cErr)
}

The consistency checker would have produced results for $ cockroach debug merge-logs logs/ | grep inconsistent (before actually going to write the startup prevention file), so I am guessing that it must have been the *ReplicaCorruptionError handling. That code path also does not write an empty file. Neither do we find evidence of the "replica was corrupted" log message in the logs.

At the end of all this, my best hypothesis is:

  • overloaded node produces some error that leads to replica corruption
  • gets stuck on the disk write while we reset the VM, leading to empty file (or something like that)

There are only three places where we create a ReplicaCorruptionError:

func checkIfTxnAborted(
ctx context.Context, rec batcheval.EvalContext, reader storage.Reader, txn roachpb.Transaction,
) *roachpb.Error {
var entry roachpb.AbortSpanEntry
aborted, err := rec.AbortSpan().Get(ctx, reader, txn.ID, &entry)
if err != nil {
return roachpb.NewError(roachpb.NewReplicaCorruptionError(
errors.Wrap(err, "could not read from AbortSpan")))
}

if err != nil {
return result.Result{}, roachpb.NewReplicaCorruptionError(err)
}

newMS, res, err := splitTrigger(
ctx, rec, batch, *ms, ct.SplitTrigger, txn.WriteTimestamp,
)
if err != nil {
return result.Result{}, roachpb.NewReplicaCorruptionError(err)
}
*ms = newMS
return res, nil

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 12, 2021
@tbg tbg changed the title roachtest: tpccbench/nodes=6/cpu=16/multi-az failed roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [empty prevented startup file] Jul 12, 2021
@tbg
Copy link
Member

tbg commented Jul 12, 2021

Going to leave this open despite the error not being actionable to make sure that if it reoccurs within, say, two weeks, the link to the analysis above is made.

@tbg
Copy link
Member

tbg commented Aug 26, 2021

I have to assume this was also a failed consistency check, such as the one here: #69414 (comment)

Going to close this issue as a duplicate.

@tbg tbg closed this as completed Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

2 participants