This repository has been archived by the owner on Dec 8, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 66
lightning: split large csv file if possible #272
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
kennytm
reviewed
Feb 26, 2020
XuHuaiyu
force-pushed
the
split_large_csv
branch
from
February 27, 2020 02:34
4dd08ab
to
6a74575
Compare
kennytm
reviewed
Feb 27, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
PTAL @3pointer
kennytm
added
3.0-release-note
Should include in release note for next 3.0 release. Remove after release.
3.1-release-note
Should include in release note for next 3.1 release. Remove after release.
4.0-release-note
Should include in release note for next 4.0 release. Remove after release.
Should Update Ansible
The config in TiDB-Ansible should be updated
Should Update Docs
Should update docs after this PR is merged. Remove this label once the docs are updated
status/PTAL
This PR is ready for review. Add this label back after committing new changes
type/enhancement
Performance improvement or refactoring
priority/normal
labels
Feb 27, 2020
kennytm
added
status/LGT1
One reviewer already commented LGTM (LGTM1)
and removed
status/PTAL
This PR is ready for review. Add this label back after committing new changes
labels
Feb 28, 2020
have we used some checksum mechanisms to make esure the correctness of all data? |
3pointer
reviewed
Mar 3, 2020
lightning/mydump/region.go
Outdated
ioWorker *worker.Pool, | ||
) (prevRowIdMax int64, regions []*TableRegion, dataFileSizes []float64, err error) { | ||
maxRegionSize := cfg.Mydumper.MaxRegionSize | ||
dataFileSizes = make([]float64, 0, dataFileSize/maxRegionSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change
dataFileSizes = make([]float64, 0, dataFileSize/maxRegionSize) | |
dataFileSizes = make([]float64, 0, dataFileSize/maxRegionSize+1) |
3pointer
reviewed
Mar 3, 2020
…to split_large_csv
…ghtning into split_large_csv
PTAL @3pointer |
3pointer
approved these changes
Mar 12, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
for record:
note the progress |
kennytm
removed
the
Should Update Docs
Should update docs after this PR is merged. Remove this label once the docs are updated
label
Jun 17, 2020
5 tasks
5 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
3.0-release-note
Should include in release note for next 3.0 release. Remove after release.
3.1-release-note
Should include in release note for next 3.1 release. Remove after release.
4.0-release-note
Should include in release note for next 4.0 release. Remove after release.
Should Update Ansible
The config in TiDB-Ansible should be updated
status/LGT1
One reviewer already commented LGTM (LGTM1)
type/enhancement
Performance improvement or refactoring
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Split a large csv file if possible.
What is changed and how it works?
Split a large csv file into multiple regions, each region is about 256MB.
Then the multiple regions can be processed concurrently, and the import phase should be accelerated.
Before this commit, we need to take 40m15s to restore a 33G a tpcc csv file.
After this commit, we only need to take 8m15s to restore the same file.
Check List
Tests
Unit test
Manual test (add detailed scripts or steps below)
case 1:
import data using lightning with this feature
select count(*) from t; --> 60,000,000
import data using lightning without this feature
select count(*) from tbl_name; --> 60,000,000
case 2:
Side effects
N/A
Related changes
N/A