Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve the COPY filting copied files performance #8586

Merged
merged 3 commits into from
Nov 1, 2022

Conversation

BohuTANG
Copy link
Member

@BohuTANG BohuTANG commented Nov 1, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Reuse the StageFile info when filtering the copied files.

List 300 files and do force copy:

--rows: 1000000, size: 100*43MB
mysql> COPY into bendlog.log from @log pattern='.*[.]csv$' FILE_FORMAT = (TYPE = 'CSV'  FIELD_DELIMITER = '\t' RECORD_DELIMITER = '\n' SKIP_HEADER=1 COMPRESSION=AUTO) force=true;
Query OK, 0 rows affected (17.80 sec)

--rows: 1000000, size: 100*23MB
mysql> COPY into bendlog.log from @log pattern='.*[.]csv.gz$' FILE_FORMAT = (TYPE = 'CSV'  FIELD_DELIMITER = '\t' RECORD_DELIMITER = '\n' SKIP_HEADER=1 COMPRESSION=AUTO) force=true;
Query OK, 0 rows affected (15.68 sec)

Fixes #8574

@vercel
Copy link

vercel bot commented Nov 1, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Nov 1, 2022 at 2:31PM (UTC)

@BohuTANG BohuTANG marked this pull request as draft November 1, 2022 12:39
@BohuTANG BohuTANG requested a review from Xuanwo November 1, 2022 12:39
@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label Nov 1, 2022
@BohuTANG BohuTANG force-pushed the dev-stat-file-parallel branch from ae59c8e to 7e6e2f2 Compare November 1, 2022 12:48
@Xuanwo
Copy link
Member

Xuanwo commented Nov 1, 2022

We can eliminate the extra stat here by reusing the StageFileInfo we already got during list.

@BohuTANG
Copy link
Member Author

BohuTANG commented Nov 1, 2022

We can eliminate the extra stat here by reusing the StageFileInfo we already got during list.

That's cool, and if you have an idea, I plan to continue this parallel after your PR.
From my test, this parallel will hang, I will address them tomorrow :/

@Xuanwo
Copy link
Member

Xuanwo commented Nov 1, 2022

That's cool, and if you have an idea, I plan to continue this parallel after your PR.

I will implement this tomorrow~

@BohuTANG
Copy link
Member Author

BohuTANG commented Nov 1, 2022

Test again. This PR seems no help(:).
This PR:

mysql> COPY into bendlog.log from @log pattern='.*[.]csv.gz$' FILE_FORMAT = (TYPE = 'CSV'  FIELD_DELIMITER = '\t' RECORD_DELIMITER = '\n' SKIP_HEADER=1 COMPRESSION=auto) force=true;
Query OK, 0 rows affected (17.36 sec)

main:

mysql> COPY into bendlog.log from @log pattern='.*[.]csv.gz$' FILE_FORMAT = (TYPE = 'CSV'  FIELD_DELIMITER = '\t' RECORD_DELIMITER = '\n' SKIP_HEADER=1 COMPRESSION=auto) force=true;
Query OK, 0 rows affected (16.57 sec)

@BohuTANG BohuTANG closed this Nov 1, 2022
@BohuTANG
Copy link
Member Author

BohuTANG commented Nov 1, 2022

That's cool, and if you have an idea, I plan to continue this parallel after your PR.

I will implement this tomorrow~

I have got the point, let me have try

@BohuTANG BohuTANG reopened this Nov 1, 2022
@BohuTANG BohuTANG force-pushed the dev-stat-file-parallel branch from 6a009b1 to a6371c8 Compare November 1, 2022 14:20
@BohuTANG BohuTANG changed the title feat: improve the stat performance with parallel feat: improve the COPY filter performance Nov 1, 2022
@BohuTANG BohuTANG force-pushed the dev-stat-file-parallel branch from a6371c8 to b8957e0 Compare November 1, 2022 14:30
@BohuTANG BohuTANG force-pushed the dev-stat-file-parallel branch from b8957e0 to f71d1be Compare November 1, 2022 14:31
@BohuTANG BohuTANG changed the title feat: improve the COPY filter performance feat: improve the COPY filting copied files performance Nov 1, 2022
@BohuTANG BohuTANG marked this pull request as ready for review November 1, 2022 14:52
@BohuTANG
Copy link
Member Author

BohuTANG commented Nov 1, 2022

@Xuanwo Done

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@yufan022
Copy link
Contributor

yufan022 commented Nov 1, 2022

A marked improvement!

before this PR:

COPY INTO import   FROM 's3://xx/10000-100/'   credentials=(aws_key_id='xx' aws_secret_key='xx')   pattern ='.*[.]tsv'   file_format = (type = 'tsv')   force=true;
Query OK, 0 rows affected (1 min 30.55 sec)

after this PR:

COPY INTO import   FROM 's3://xx/10000-100/'   credentials=(aws_key_id='xx' aws_secret_key='xx')   pattern ='.*[.]tsv'   file_format = (type = 'tsv')   force=true;
Query OK, 0 rows affected (12.19 sec)

@BohuTANG BohuTANG merged commit 44b2779 into databendlabs:main Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: COPY INTO CPU load takes a long time to rise
3 participants