Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv: fix infinitely rebirthed secondary keys commit retry goroutine during tikv error #16061

Closed
wants to merge 5 commits into from
Closed

tikv: fix infinitely rebirthed secondary keys commit retry goroutine during tikv error #16061

wants to merge 5 commits into from

Conversation

lysu
Copy link
Contributor

@lysu lysu commented Apr 3, 2020

What problem does this PR solve?

TestCommitRetryLimit can not finish(it will wait secondary goroutines return after failpoint enabled, for normal user should not block but will see many goroutines), due to recursive retry call create new backoff every time and fork new goroutine every time, and backoff never meet max....

Issue Number: refer #15995

there are also some questions in 2pc rate limit impl need relook.

Problem Summary:

What is changed and how it works?

What's Changed:

secondary commit backoff should inherit totalSleep from parent.

but we can use backoff.fork method in here, because parent go routine exit should not can child retry goroutine here.

How it Works:

deep copy a backoff, only one line change https://github.com/pingcap/tidb/pull/16061/files#diff-499c236856cd9ce3300d3f5ccde41a23L506

other code make it reproducible and testable

Related changes

  • Need to cherry-pick to the release branch(2.1/3.0/3.1/4.0 have this problem)

Check List

Tests

  • Unit test

Side effects

  • N/A

Release note

fix infinitely rebirthed secondary keys commit retry goroutine during tikv error


This change is Reviewable

@lysu lysu added type/bugfix This PR fixes a bug. sig/transaction SIG:Transaction labels Apr 3, 2020
@sre-bot sre-bot added the contribution This PR is from a community contributor. label Apr 3, 2020
@lysu
Copy link
Contributor Author

lysu commented Apr 3, 2020

/run-all-tests

@lysu lysu removed the contribution This PR is from a community contributor. label Apr 3, 2020
@lysu lysu requested review from jackysp and tiancaiamao April 3, 2020 10:39
@codecov
Copy link

codecov bot commented Apr 3, 2020

Codecov Report

Merging #16061 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #16061   +/-   ##
===========================================
  Coverage   80.6390%   80.6390%           
===========================================
  Files           506        506           
  Lines        138185     138185           
===========================================
  Hits         111431     111431           
  Misses        18178      18178           
  Partials       8576       8576           

@lysu lysu linked an issue Apr 3, 2020 that may be closed by this pull request
@jackysp jackysp added the require-LGT3 Indicates that the PR requires three LGTM. label Apr 3, 2020
Copy link
Contributor

@tiancaiamao tiancaiamao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the goroutine is created by the region error here?

https://github.com/pingcap/tidb/pull/16061/files#diff-499c236856cd9ce3300d3f5ccde41a23R1007

handleSingleBatch -->
regionErr != nil -->
c.commitMutations -->
doActionOnMutations -->
go doActionOnBatches -->
handleSingleBatch -->

The goroutine increase one, but the work is all the same?

sender := NewRegionRequestSender(c.store.regionCache, c.store.client)
resp, err := sender.SendReq(bo, req, batch.region, readTimeoutShort)

// If we fail to receive response for the request that commits primary key, it will be undetermined whether this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, extracting this function is not a good idea, because this logic belongs to commit, it's not a general one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for extracting this function is mockable, make a piece of code(send request) can be replaced by a test logic a good reason to extract a method in the practice.

if not extract this function we need a new "if else" block to the parent method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember I had read some test-driven books about the question: "when to extract a method? or when to use interface?", they said "extract method or interface when you need to replace it" and when logic need replaced in a test, it has high probability be replaced in future requirement :D

store/tikv/2pc_test.go Outdated Show resolved Hide resolved
@lysu
Copy link
Contributor Author

lysu commented Apr 7, 2020

I guess the goroutine is created by the region error here?

https://github.com/pingcap/tidb/pull/16061/files#diff-499c236856cd9ce3300d3f5ccde41a23R1007

handleSingleBatch -->
regionErr != nil -->
c.commitMutations -->
doActionOnMutations -->
go doActionOnBatches -->
handleSingleBatch -->

The goroutine increase one, but the work is all the same?

yes, the question here is backoff maxSleep never can be reach, so handleSingleBatch-->...-> handleSingleBatch will never stop, maybe better we should avoid recreate goroutine in retry call, but let maxSleep works also solve question

@lysu lysu requested a review from cfzjywxk April 20, 2020 02:33
@sre-bot
Copy link
Contributor

sre-bot commented Apr 22, 2020

@tiancaiamao, @cfzjywxk, @jackysp, PTAL.

Copy link
Member

@jackysp jackysp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sre-bot
Copy link
Contributor

sre-bot commented Apr 24, 2020

@tiancaiamao, @jackysp, @cfzjywxk, PTAL.

1 similar comment
@sre-bot
Copy link
Contributor

sre-bot commented Apr 26, 2020

@tiancaiamao, @jackysp, @cfzjywxk, PTAL.

@tiancaiamao
Copy link
Contributor

LGTM

@tiancaiamao tiancaiamao added the status/LGT2 Indicates that a PR has LGTM 2. label Apr 26, 2020
@lysu
Copy link
Contributor Author

lysu commented Apr 27, 2020

we have a better solution at #16849, so close this

@lysu lysu closed this Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
require-LGT3 Indicates that the PR requires three LGTM. sig/transaction SIG:Transaction status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The goroutine count surge when region miss or TiKV server is busy
4 participants