Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestNewTruncateDecision failed under stress [skipped] #38584

Closed
cockroach-teamcity opened this issue Jul 1, 2019 · 5 comments · Fixed by #98792
Closed

storage: TestNewTruncateDecision failed under stress [skipped] #38584

cockroach-teamcity opened this issue Jul 1, 2019 · 5 comments · Fixed by #98792
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss skipped-test
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 1, 2019

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

TAGS=
GOFLAGS=-race -parallel=2

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestNewTruncateDecision PKG=github.com/cockroachdb/cockroach/pkg/storage TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368520&tab=buildLog

=== RUN   TestNewTruncateDecision
I190701 12:12:45.945806 12152 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
I190701 12:12:45.946055 12152 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] intent_resolver_ir_batcher
--- FAIL: TestNewTruncateDecision (0.24s)
    raft_log_queue_test.go:570: truncation should not have occurred, but truncatableIndexes changed from 1 -> 2
    raft_log_queue_test.go:573: truncation should not have occurred, but oldestIndex changed from 116 -> 117

Jira issue: CRDB-5625

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 1, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 1, 2019
@tbg
Copy link
Member

tbg commented Jul 3, 2019

Looks like an empty entry is proposed sometimes here:

// Propose an empty command which will wake the leader.
data := encodeRaftCommand(raftVersionStandard, makeIDKey(), nil)
_ = r.mu.internalRaftGroup.Propose(data)
}

I initially thought this test had to have gotten flaky as a result of #38484 but I just reverted that locally and still get the repro, so now I'm thinking it's fallout from #38343

because this diff fixes it

diff --git a/pkg/storage/replica_raft.go b/pkg/storage/replica_raft.go
index 1266771077..22fa9b375a 100644
--- a/pkg/storage/replica_raft.go
+++ b/pkg/storage/replica_raft.go
@@ -428,13 +428,14 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
        defer r.updateProposalQuotaRaftMuLocked(ctx, lastLeaderID)

        err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) {
-               if err := r.mu.proposalBuf.FlushLockedWithRaftGroup(raftGroup); err != nil {
+               nEnts, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(raftGroup)
+               if err != nil {
                        return false, err
                }
                if hasReady = raftGroup.HasReady(); hasReady {
                        rd = raftGroup.Ready()
                }
-               return hasReady /* unquiesceAndWakeLeader */, nil
+               return hasReady && nEnts == 0 /* unquiesceAndWakeLeader */, nil
        })
        r.mu.Unlock()
        if err != nil {

What's happening is that the request that unquiesces the range sometimes flushes proposals, but then we were adding an extra empty proposal due to returning true from withRaftGroupLocked. The diff above prevents that extra entry (i.e. only sends it when there isn't already an entry).

I'll send a PR.

82444 runs so far, 0 failures, over 39m40s

tbg added a commit to tbg/cockroach that referenced this issue Jul 3, 2019
If proposals were stepped into Raft, we don't have to also step another
empty entry in. This was harmless, but it was confusing
TestNewTruncateDecision (which admittedly is not a very robust test,
though we have to thank it for catching this).

Fixes cockroachdb#38584.

Release note: None
@nvanbenschoten
Copy link
Member

Nice job tracking this down!

tbg added a commit to tbg/cockroach that referenced this issue Jul 3, 2019
We've identified the root cause, but some cleanup presents itself, and
in the meantime the test is too flaky to leave unskipped.

See cockroachdb#38652.

Touches cockroachdb#38584.

Release note: None
craig bot pushed a commit that referenced this issue Jul 3, 2019
38655: storage: skip TestNewTruncateDecision r=nvanbenschoten a=tbg

We've identified the root cause, but some cleanup presents itself, and
in the meantime the test is too flaky to leave unskipped.

See #38652.

Touches #38584.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020
@tbg tbg changed the title storage: TestNewTruncateDecision failed under stress storage: TestNewTruncateDecision failed under stress [skipped] Mar 11, 2020
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@tbg tbg added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Feb 1, 2022
@tbg tbg removed their assignment May 31, 2022
@shralex shralex added T-kv-replication and removed T-kv KV Team labels Mar 3, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 3, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

This test no longer exists.

@erikgrinaker
Copy link
Contributor

Oops, was looking at the wrong repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss skipped-test
Projects
None yet
6 participants