Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Vacuum command #393

Closed
Dandandan opened this issue Aug 17, 2021 · 3 comments
Closed

Parallel Vacuum command #393

Dandandan opened this issue Aug 17, 2021 · 3 comments
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed storage/aws AWS S3 storage related

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Aug 17, 2021

Description

Currently, the vacuum command deletes files one by one which is very slow on e.g. S3, especially if you have 100000s files. I had a case (with databricks/spark) with more than 8 million stale files, which took days, even with parallel calls (using spark.databricks.delta.vacuum.parallelDelete.enabled I got around 80 deletes / second)
The delete calls could be parallelized (e.g. 100/1000/10k concurrent deletes) to speed up the processing.

Use Case
More performant vacuum.

Related Issue(s)

@Dandandan Dandandan added the enhancement New feature or request label Aug 17, 2021
@houqp
Copy link
Member

houqp commented Aug 17, 2021

If only there is a rust distributed batch compute framework that we can leverage here ;)

On a serious note, I fully agree with you that we should parallelize the delete calls. We should be able to easily do couple thousands calls from a single rust process async, which should help bring down the vacuum time to under an hour for 8M items.

@Dandandan
Copy link
Contributor Author

It might also be wise/a good start to utilize the storage APIs better, e.g. such as using multi object delete which does up to 1000 deletes in one call: https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html

@houqp
Copy link
Member

houqp commented Aug 17, 2021

Good call @Dandandan , filed #394 for the batch delete support. With batch delete, we should be able to get it down from 1 hour to minutes!

@houqp houqp added binding/rust Issues for the Rust crate help wanted Extra attention is needed storage/aws AWS S3 storage related labels Aug 17, 2021
wjones127 pushed a commit that referenced this issue Jul 27, 2023
# Description
Bulk delete was added to the object store
apache/arrow-rs#2615 which deletes multiple
files within a single API call if the underlying store supports it. If
it is not supported then concurrent requests are performed underneath.

This PR updates vacuum with the object store changes. Currently on S3
will see any benefits since the default bulk delete is not overridden
for other backends.

# Related Issue(s)
- progresses #393
@rtyler rtyler closed this as completed Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed storage/aws AWS S3 storage related
Projects
None yet
Development

No branches or pull requests

3 participants