Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.18.1 reintroduces S3 multipart upload bug #2605

Closed
Zan-L opened this issue Jun 16, 2024 · 5 comments · Fixed by #2606
Closed

0.18.1 reintroduces S3 multipart upload bug #2605

Zan-L opened this issue Jun 16, 2024 · 5 comments · Fixed by #2606
Labels
bug Something isn't working

Comments

@Zan-L
Copy link

Zan-L commented Jun 16, 2024

Environment

Delta-rs version: 0.18.1

Binding: Python

Environment:

  • Cloud provider: AWS
  • OS: Linux/Windows
  • Other:

Bug

What happened: Same error as #890 but on S3 directly instead of non-S3

What you expected to happen:

How to reproduce it: Write regular size data to S3 with write_deltalake()

More details: 0.18.0 works fine

@Zan-L Zan-L added the bug Something isn't working label Jun 16, 2024
@abhiaagarwal
Copy link
Contributor

Can you give a reproducible example / any details about the size of the table / amount of writes? A log report? I wrote the modified code for 0.18.1 and it seems to have fixed other people's problems — I'll take a look at it :)

@Zan-L
Copy link
Author

Zan-L commented Jun 16, 2024

Hi,

Thank you for the prompt response. Unfortunately, that happened in our enterprise dev environment so I can't provide the proprietary data. I can provide two more observations though:

  • Extreme small data (less than thousands of rows) could still be written into S3, but after a certain threshold everything fails with that multipart upload error.
  • I checked the 0.18.1 change source code and there does seem to be removal of the old multipart upload related code. Maybe that old code contained the fix and a search with key word "multipart" is good to start with.

@abhiaagarwal
Copy link
Contributor

Understandable if you can't share enterprise data, but even a censored error code/log would be great!

But yeah, the behavior in 0.18.1 changed so that any writes buffer into an in-memory buffer, which flushes when it exceeds the threshold set. The threshold per the config is const DEFAULT_MAX_BUFFER_SIZE: usize = 4 * 1024 * 1024 ~ 4 MiB. Can you try tuning it higher by setting max_buffer_size key the storage_options dict when loading a table to a higher value?

object_store documentation says it should be at least 5 MiB, so this value should probably be tuned on our side anyways.

@Zan-L
Copy link
Author

Zan-L commented Jun 16, 2024

That is the root cause. I did a test write with 5*2**20 and it worked this time. Can you push another release with this fix?

Btw, the error message from before:

OSError: Generic S3 error: Error performing complete multipart request: Client error with status 400 Bad Request: <Error><Code>EntityTooSmall</Code><Message>Your proposed upload is smaller than the minimum allowed size</Message><ProposedSize>4328982</ProposedSize><MinSizeAllowed>5242880</MinSizeAllowed><PartNumber>1</PartNumber>

@abhiaagarwal
Copy link
Contributor

abhiaagarwal commented Jun 16, 2024

I don't control the release cycles, but you can compile from source with that fix! The current version will also work if you set that option.

@Zan-L Zan-L closed this as completed Jun 16, 2024
ion-elgreco pushed a commit that referenced this issue Jun 19, 2024
# Description

Object stores expected fixed lengths for all multipart upload parts
right up until the last part. The original logic just flushed when it
exceeded the threshold. Now, it flushes when the threshold is met
exclusively with the same fixed buffer, unless we're completing the
transaction, in which case the last piece is allowed to be smaller.

Bumps the constant to reflect that the minimum expected size by most
object stores is 5MiB. Also adds a UserWarning if a constant is
specified to be less.

Also releases the GIL in more places by moving the flushing logic to a
free function.

# Related Issue(s)
<!---
For example:

- closes #106
--->

Closes #2605 

# Documentation

<!---
Share links to useful documentation
--->

See:
[MultipartUpload](https://docs.rs/object_store/latest/object_store/trait.MultipartUpload.html)
docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants