Skip to content
This repository has been archived by the owner on Dec 8, 2021. It is now read-only.

lightning: split large csv file if possible #272

Merged
merged 21 commits into from
Mar 12, 2020

Conversation

XuHuaiyu
Copy link
Contributor

@XuHuaiyu XuHuaiyu commented Feb 26, 2020

What problem does this PR solve?

Split a large csv file if possible.

What is changed and how it works?

Split a large csv file into multiple regions, each region is about 256MB.
Then the multiple regions can be processed concurrently, and the import phase should be accelerated.

Before this commit, we need to take 40m15s to restore a 33G a tpcc csv file.
After this commit, we only need to take 8m15s to restore the same file.

Check List

Tests

  • Unit test

  • Manual test (add detailed scripts or steps below)
    case 1:

  1. import data using lightning with this feature
    select count(*) from t; --> 60,000,000

  2. import data using lightning without this feature
    select count(*) from tbl_name; --> 60,000,000

case 2:

mysql> select c_since from bmsql_customer where c_w_id = 18 and c_d_id=1 and c_id=1;
+---------------------+
| c_since             |
+---------------------+
| 2020-02-26 20:06:00 | -- select the first line of csv file, result is correct
+---------------------+
1 row in set (2.26 sec)

mysql> select c_since from bmsql_customer where c_w_id = 2000 and c_d_id=10 and c_id=3000;
+---------------------+
| c_since             |
+---------------------+
| 2020-02-26 20:42:29 | -- select the last line of csv file, result is correct
+---------------------+

Side effects

N/A

Related changes

N/A

lightning/mydump/csv_parser.go Outdated Show resolved Hide resolved
lightning/mydump/region.go Outdated Show resolved Hide resolved
lightning/mydump/region.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@kennytm kennytm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

PTAL @3pointer

lightning/config/config.go Outdated Show resolved Hide resolved
lightning/mydump/csv_parser.go Outdated Show resolved Hide resolved
@kennytm kennytm added 3.0-release-note Should include in release note for next 3.0 release. Remove after release. 3.1-release-note Should include in release note for next 3.1 release. Remove after release. 4.0-release-note Should include in release note for next 4.0 release. Remove after release. Should Update Ansible The config in TiDB-Ansible should be updated Should Update Docs Should update docs after this PR is merged. Remove this label once the docs are updated status/PTAL This PR is ready for review. Add this label back after committing new changes type/enhancement Performance improvement or refactoring priority/normal labels Feb 27, 2020
@kennytm kennytm added status/LGT1 One reviewer already commented LGTM (LGTM1) and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Feb 28, 2020
@kennytm kennytm requested review from 3pointer and removed request for kennytm March 1, 2020 10:14
@siddontang
Copy link
Member

have we used some checksum mechanisms to make esure the correctness of all data?

ioWorker *worker.Pool,
) (prevRowIdMax int64, regions []*TableRegion, dataFileSizes []float64, err error) {
maxRegionSize := cfg.Mydumper.MaxRegionSize
dataFileSizes = make([]float64, 0, dataFileSize/maxRegionSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dataFileSizes = make([]float64, 0, dataFileSize/maxRegionSize)
dataFileSizes = make([]float64, 0, dataFileSize/maxRegionSize+1)

@CLAassistant
Copy link

CLA assistant check
All committers have signed the CLA.

@XuHuaiyu XuHuaiyu requested a review from 3pointer March 12, 2020 08:30
@XuHuaiyu
Copy link
Contributor Author

PTAL @3pointer

Copy link
Contributor

@3pointer 3pointer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kennytm kennytm merged commit 3c8f4d7 into pingcap:master Mar 12, 2020
@july2993
Copy link
Contributor

for record:

[2020/03/13 00:15:19.981 +08:00] [INFO] [restore.go:492] [progress] [files="1280/40 (3200.0%)"] [tables="0/10 (0.0%)"] [speed(MiB/s)=117.93346221716635] [state=post-processing] []

note the progress files="1280/40 (3200.0%) is wrong, guess it's caused by split files but having not to check it.

@XuHuaiyu XuHuaiyu deleted the split_large_csv branch March 13, 2020 02:37
@kennytm kennytm removed the Should Update Docs Should update docs after this PR is merged. Remove this label once the docs are updated label Jun 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
3.0-release-note Should include in release note for next 3.0 release. Remove after release. 3.1-release-note Should include in release note for next 3.1 release. Remove after release. 4.0-release-note Should include in release note for next 4.0 release. Remove after release. Should Update Ansible The config in TiDB-Ansible should be updated status/LGT1 One reviewer already commented LGTM (LGTM1) type/enhancement Performance improvement or refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants