Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add decompress support for COPY INTO and streaming loading #5655

Merged
merged 16 commits into from
May 30, 2022

Conversation

Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented May 29, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

In this PR, we will add decompress read support for COPY INTO and streaming leading! 🚀

This means the following query and requests will be functional:

  • copy from a location/internal stage/external stage:
copy into ontime200 from '@s1' FILES = ('ontime_200.csv.gz') FILE_FORMAT = (type = 'CSV' field_delimiter = ',' compression = 'gzip'  record_delimiter = '\n' skip_header = 1);
  • load via HTTP request:
curl -H "insert_sql:insert into ontime_streaming_load format Csv" -H "skip_header:1" -H "compression:zstd" -F  "upload=@/tmp/ontime_200.csv.zst" -u root: -XPUT "http://localhost:8000/v1/streaming_load"

This PR adds the basic support along with stateful tests.

We supports the following compression algorithm: gzip, bz2, zstd, brotli, deflate

There are still many works to do:

  • More compression algorithm support (auto, lzo, snappy, ...)
  • Performance (we will do a benchmark on the decompress load in the future)
  • Code refactor (stage async source and multipart load with format trait are different logic)

But I think this PR is a good start to getting this feature works!

Changelog

  • New Feature

Related Issues

Fixes #5380

Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@vercel
Copy link

vercel bot commented May 29, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) May 29, 2022 at 3:34PM (UTC)

@mergify
Copy link
Contributor

mergify bot commented May 29, 2022

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label May 29, 2022
@sundy-li
Copy link
Member

Impressive !

@BohuTANG BohuTANG requested a review from sundy-li May 29, 2022 05:38
@BohuTANG
Copy link
Member

💯

Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo

This comment was marked as resolved.

Signed-off-by: Xuanwo <github@xuanwo.io>
@BohuTANG

This comment was marked as resolved.

@drmingdrmer

This comment was marked as resolved.

@sundy-li
Copy link
Member

/LGTM

@BohuTANG BohuTANG merged commit b90b74e into databendlabs:main May 30, 2022
@Xuanwo Xuanwo deleted the decompress branch May 30, 2022 01:00
@BohuTANG
Copy link
Member

Hi @Xuanwo
It would be nice if we add the doc to the formattypeoptions option:
https://databend.rs/doc/reference/sql/dml/dml-copy#formattypeoptions

@wubx
Copy link
Member

wubx commented May 30, 2022

nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-review pr-feature this PR introduces a new feature to the codebase
Projects
Status: 📦 Done
Development

Successfully merging this pull request may close these issues.

Support read compressed files in COPY statemente
7 participants