Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python write_deltalake() to Non-AWS S3 failing #890

Closed
shazamkash opened this issue Oct 17, 2022 · 15 comments
Closed

Python write_deltalake() to Non-AWS S3 failing #890

shazamkash opened this issue Oct 17, 2022 · 15 comments
Labels
bug Something isn't working

Comments

@shazamkash
Copy link

Environment

Delta-rs version: 0.6.2

Binding: Python

Environment:
Docker container:
Python: 3.10.7
OS: Debian GNU/Linux 11 (bullseye)
S3: Non-AWS (Ceph based)


Bug

What happened:

Delta lake write is failing when trying to write table to Ceph based S3 (non-AWS). I am writing the table to a path which does not contain any delta table or any sort of file previously.

I have also tried different mode but writing the table still does not work and throws the same error.

My code:

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net"
                  }
df = pd.DataFrame({'x': [1, 2, 3]})
table_uri = "s3a://<bucket-name>/delta_test"
dl.writer.write_deltalake(table_uri, df, storage_options=storage_options)

Fails with the following error:

image

Any idea what might be the problem? I am able to read the delta tables with the same storage_options.

@shazamkash shazamkash added the bug Something isn't working label Oct 17, 2022
@joshuarobinson
Copy link

possibly very related to the issue I recently filed: #883

@roeap
Copy link
Collaborator

roeap commented Oct 19, 2022

I am not too deep into the S§ side of things, but for the error message it seems the underlying file system is trying to get credentials from an ecs metadata endpoint, which seems strage since in the snipplet that is not configured.

Just to eliminate - could there be en environment variable configured to cause this?

Then again I might be comepletly off - this is just after a quick scan.

@Thelin90
Copy link

Thelin90 commented Nov 1, 2022

I currently get SignatureDoesNotMatch when providing credentials.

@Thelin90
Copy link

Thelin90 commented Nov 2, 2022

I am working around this now by:

  • Create the delta table locally
  • Upload table to s3 via s3fs

@shazamkash
Copy link
Author

I am working around this now by:

  • Create the delta table locally
  • Upload table to s3 via s3fs

Thanks, I will give this a try.

@wjones127
Copy link
Collaborator

We just released 0.6.4, which includes several fixes related to passing down credentials. Could you check whether writing is working for you on that version?

@joshuarobinson
Copy link

joshuarobinson commented Nov 29, 2022

I have tried with 0.6.4 and face a new error, which seems to be with multi-part uploads. BUT I need to confirm if this is an issue with our object store (Swiftstack) or not. For reference, I'm able to successfully write the same dataframe to s3 with pyarrow directly (using write_dataset).

Traceback (most recent call last):
  File "/delta_write.py", line 28, in <module>
    write_deltalake('s3://joshuarobinson/test_deltalake/delta/', df, storage_options=storage_options)
  File "/usr/local/lib/python3.10/site-packages/deltalake/writer.py", line 250, in write_deltalake
    ds.write_dataset(
  File "/usr/local/lib/python3.10/site-packages/pyarrow/dataset.py", line 988, in write_dataset
    _filesystemdataset_write(
  File "pyarrow/_dataset.pyx", line 2859, in pyarrow._dataset._filesystemdataset_write
deltalake.PyDeltaTableError: Generic S3 error: Error performing complete multipart request: response error "<?xml version='1.0' encoding='UTF-8'?>
<Error><Code>EntityTooSmall</Code><Message>Your proposed upload is smaller than the minimum allowed object size.</Message></Error>", after 0 retries: HTTP status client error (400 Bad Request) for url (https://pbss.s8k.io/joshuarobinson/test_deltalake/delta/0-899c5178-cb3a-4f75-9072-aa68ce4b162d-0.parquet?uploadId=MWUwZWRiOTAtYTRlMC00MzcxLTk1NDItYzc1MjA4OGVmYzAy)

in the meantime, I wasn't able to find options to the PyArrow s3 filesystem to configure multi-part upload, so please let me know if you are aware of any options and I'll try to test with different configs.

update: our current theory is that the issue might be related to the "complete-multipart-upload" not including all necessary xml components.

Is it correct that the write path is using the rusoto s3 library? or is it using the arrow c++ s3 implementation?

Update 2: if I write a very small table (14KB), then everything works! So I think this points to the issue being multi-part uploads as presumably they're disabled below a certain threshold.

@joshuarobinson
Copy link

joshuarobinson commented Nov 29, 2022

I am also hitting another issue

Traceback (most recent call last):
  File "/delta_write.py", line 30, in <module>
    write_deltalake('s3://joshuarobinson/test_deltalake/delta/', df, storage_options=storage_options)
  File "/usr/local/lib/python3.10/site-packages/deltalake/writer.py", line 164, in write_deltalake
    storage_options = dict(
TypeError: dict() got multiple values for keyword argument 'AWS_ENDPOINT_URL'

storage_options is set like this:

storage_options = {"AWS_ENDPOINT_URL": ENDPOINT_URL, "AWS_REGION": 'us-east-1'}

failure happens here: https://github.com/delta-io/delta-rs/blob/main/python/deltalake/writer.py#L164

which I guess is happening because my source data is using a pyarrow filesystem that also defines the endpoint_url field and so the key is duplicated. Possibly?

@wjones127
Copy link
Collaborator

Yes, that will be fixed in #912.

@Thelin90
Copy link

Hello! I will give it a go, will let you know as soon as possible!

@shazamkash
Copy link
Author

shazamkash commented Nov 29, 2022

I can confirm that I have similar problem as @joshuarobinson


PyDeltaTableError: Generic S3 error: Error performing complete multipart request: response error "<?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooSmall</Code><BucketName>kafka-shazam</BucketName><RequestId>tx00000b6e0ebdcf5b90da8-0063869453-35b55ff45-default</RequestId><HostId>35b55ff45-default-default</HostId></Error>", after 0 retries: HTTP status client error (400 Bad Request) for url (https://xxxxxx/kafka-shazam/delta_test/rs_test/0-4b41fe34-3605-4b3c-8cd9-2ab61eb8f34a-0.parquet?uploadId=2%7EhY7nplOyckHBl7ePcT67gt77ySSzIpJ)

@shazamkash
Copy link
Author

I am also facing this error with some datasets; I am not entirely sure if this is related, or I must open a new issue. These datasets work fine when reading with pandas or pyarrow. Also, I am able to upload them to delta lake using pyspark.


IndexError: 0 out of bounds
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/deltalake/writer.py", line 208, in visitor
    stats = get_file_stats_from_metadata(written_file.metadata)
  File "/opt/conda/lib/python3.9/site-packages/deltalake/writer.py", line 369, in get_file_stats_from_metadata
    name = metadata.row_group(0).column(column_idx).path_in_schema
  File "pyarrow/_parquet.pyx", line 769, in pyarrow._parquet.FileMetaData.row_group
  File "pyarrow/_parquet.pyx", line 506, in pyarrow._parquet.RowGroupMetaData.__cinit__
IndexError: 0 out of bounds

@wjones127
Copy link
Collaborator

@shazamkash please open a new issue for that error.

@wjones127
Copy link
Collaborator

The EntityTooSmall error is a bug in the S3 implementation, and it's triggered if any files in the table are over 5 MB. (It doesn't seem to happen in the local emulators we use for testing, but does happen in AWS S3.) I have a fix ready in apache/arrow-rs#3234, which will hopefully be included in the next release. Thanks for reporting this!

@wjones127
Copy link
Collaborator

This will be fixed in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants