Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: fill extra partition ID column in UnionScan executor #28666

Closed
wants to merge 1 commit into from

Conversation

tiancaiamao
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #28073

Problem Summary:

What is changed and how it works?

Before this commit, the union scan executor will not fill the extra partition ID column for the chunk.
Then the extra PID column is 0, and the lock key is incorrect.
So some cases like #28073 go wrong.

A typically case is begin; insert into pt values (...); select * from pt for update,
the modified key in the transaction will not be locked correctly.

What's Changed:

  • Modify the UnionScan executor to support fill extra PID column.
  • Add more tests for the left join case

How it Works:

The extra PID column of the chunk data will be set correctly, so the SelectLock can use it to construct the lock key.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Fix bug of the incorrect lock key when using 'select for update' on partitioned tables inside a modified transaction

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 8, 2021
@@ -931,6 +931,15 @@ func (e *SelectLockExec) Next(ctx context.Context, req *chunk.Chunk) error {
// The partition ID is returned as an extra column from the table reader.
if offset, ok := e.tblID2PIDColumnIndex[id]; ok {
physicalID = row.GetInt64(offset)

if physicalID == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be some unexpected errors if the physicalID is zero and this condition is a bit confusing. Do we have some other ways do check the left join result situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree it's confusing here. physicalID == 0 may be casue by left join, or maybe it's caused by bugs.

Distinguish those cases is unrealistic, because left join have several implementations: hash join / merge join / nest loop join / index join etc... and left join is one of the case we found (that will generate empty or null row), there might be other cases that fill empty row ... It's hard to find out all.

So ... let's look a step back.
In the past, we have bug for lock on partition ... (that's bad)
Then, we fix it ... #14921
Then, we find more bug (that's bad)
Then, we try to fix it ... #21148
And the the solution caused more serious problems and introduced more critical bugs ... (wow! worse)

After change, we come back from worse to bad, that's a big progress!
I mean, we fix some problems and make the solution (at least) not bad than before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qw4990
Do you have any ideas about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing I'm worried about is though the former bugs will make the query panic it will not have future impact on the data in storage. If we could not verify which is expected in some write statement, there could be some wrong data writting into storage, just like the issue listed above an invalid key is locked and the lock record is persisted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... I have the same worries that bugs break the data.
If there are some better ways to fix this problem, I'd like to choose that solution. But I can't come up with better ideas.
So we have to fix the current problem and add tests to cover more scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite necessary to add more tests by now seems there could be more unknown issues. BTW do we have bandwidth for the coverage enhancement or our QA team?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we let all OuterJoins set this column to a specified value (e.g. -1) explicitly when mismatching?
Then we can define pid=0 as the uninitialized state and we know it must be caused by some bug;
And then we can return an error like pid is not uninitialized in this case.
We don't have to find all OuterJoins at once; We can find them by our best effort this time, and then just wait for the uninitialized error and fix them.

}()

// Give chance for the goroutines to run first.
time.Sleep(80 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be unstable in the CI environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be stable.
The test wants to check 2, 3 is blocked by 1 ...
Here we give chance for 2 and 3 to run first, let 1 sleep for a while
Its purpose is to verify 2 and 3 is blocked and can't run, then we check the final order is
1 2 3 or 1 3 2 and we achieve the test goal: 2 3 is blocked by 1.

2, 3 blocked by 1 means the partition pessimistic lock works as expected, the partition key is constructed correctly.

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 11, 2021
@ti-chi-bot
Copy link
Member

@tiancaiamao: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tiancaiamao
Copy link
Contributor Author

Fix in another way, see #30732

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid keys may get locked
4 participants