Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing hive style partitioned files in COPY command #8493

Closed
alamb opened this issue Dec 11, 2023 · 5 comments · Fixed by #9240
Closed

Support writing hive style partitioned files in COPY command #8493

alamb opened this issue Dec 11, 2023 · 5 comments · Fixed by #9240
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

Is your feature request related to a problem or challenge?

A user asked on ASF Slack: https://the-asf.slack.com/archives/C04RJ0C85UZ/p1702248979379239

Does the COPY command support creating parquet files that are partitioned using hive style partitioning?

The usecase is creating Hive-sty;e partitioned datasets (e.g as described here)

DataFusion does not support this today, but you can use an external table like this https://github.com/apache/arrow-datafusion/blob/93b21bdcd3d465ed78b610b54edf1418a47fc497/datafusion/sqllogictest/test_files/insert.slt#L45-L57

Describe the solution you'd like

@devinjdangelo notes that

The COPY statement does not have a built in PARTITION BY clause in its syntax currently, but we could support syntax like:

COPY table to 'folder/location' (format parquet, partition_by year)

which is the same syntax that duckdb supports for this.

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Dec 11, 2023
@alamb
Copy link
Contributor Author

alamb commented Dec 11, 2023

I think this is a relative good project for intermediate contributors. It could be done in a few PRs as all the underlying code exists and we already have an example of writing to partitioned datasets, implementing this PR would be a matter of hooking up the APIs correctly

There are also already good examples of copy tests in https://github.com/apache/arrow-datafusion/blob/93b21bdcd3d465ed78b610b54edf1418a47fc497/datafusion/sqllogictest/test_files/copy.slt that can be extended

@alamb alamb added the good first issue Good for newcomers label Dec 11, 2023
@Veeupup
Copy link
Contributor

Veeupup commented Dec 11, 2023

hi @alamb Can I try this issue? It seems very interesting!

@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2023

hi @alamb Can I try this issue? It seems very interesting!

Can't wait to see what you come up wtih @Veeupup 🚀

@JacobOgle
Copy link
Contributor

@Veeupup are you still working on this?

@Veeupup
Copy link
Contributor

Veeupup commented Jan 4, 2024

@JacobOgle Hi, sorry then! I have been a little busy these days and I'll start it lately ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants