Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot read from public GCS bucket if non logged in #2859

Closed
lostmygithubaccount opened this issue Sep 7, 2024 · 3 comments
Closed

cannot read from public GCS bucket if non logged in #2859

lostmygithubaccount opened this issue Sep 7, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@lostmygithubaccount
Copy link

Environment

Delta-rs version: deltalake==0.19.2

Binding: Python

Environment:

  • Cloud provider: GCS for storage, but running locally
  • OS: MacOS
  • Other:

Bug

What happened:

I have a public GCS bucket with a bunch of Delta Lake tables. the bucket has viewer access for allUsers, meaning unauthenticated users can access it. you can easily test this with pandas or other libraries (I hit this with Ibis):

[ins] In [1]: import gcsfs

[ins] In [2]: import pandas as pd

[ins] In [3]: from deltalake import DeltaTable

[nav] In [4]: pd.read_parquet("gs://ibis-analytics/penguins.parquet", storage_options={"token": "anon"})
Out[4]:
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007
1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007
2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007
3       Adelie  Torgersen             NaN            NaN                NaN          NaN    None  2007
4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007
..         ...        ...             ...            ...                ...          ...     ...   ...
339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009
340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009
341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009
342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009
343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009

[344 rows x 8 columns]

[nav] In [5]: pd.read_csv("gs://ibis-analytics/penguins.csv", storage_options={"token": "anon"})
Out[5]:
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007
1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007
2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007
3       Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN  2007
4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007
..         ...        ...             ...            ...                ...          ...     ...   ...
339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009
340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009
341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009
342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009
343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009

[344 rows x 8 columns]

but trying to read a Delta Lake table in the same place -- if not authenticated with GCP it seems -- results in an error:

[ins] In [6]: DeltaTable("gs://ibis-analytics/penguins.delta", storage_options={"token": "anon"})
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[6], line 1
----> 1 DeltaTable("gs://ibis-analytics/penguins.delta", storage_options={"token": "anon"})

File ~/repos/ibis-analytics/.venv/lib/python3.12/site-packages/deltalake/table.py:380, in DeltaTable.__init__(self, table_uri, version, storage_options, without_files, log_buffer_size)
    360 """
    361 Create the Delta Table from a path with an optional version.
    362 Multiple StorageBackends are currently supported: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage (GCS) and local URI.
   (...)
    377
    378 """
    379 self._storage_options = storage_options
--> 380 self._table = RawDeltaTable(
    381     str(table_uri),
    382     version=version,
    383     storage_options=storage_options,
    384     without_files=without_files,
    385     log_buffer_size=log_buffer_size,
    386 )

OSError: Generic GCS error: Error performing token request: Error after 10 retries in 8.200125916s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken)

there's not a lot in the stacktrace to go on

What you expected to happen:

above works

How to reproduce it:

You can try it out on the bucket noted above: gs://ibis-analytics has penguins.csv, penguins.parquet, and penguins.delta in it

More details:

this was reproduced by others as well

@lostmygithubaccount lostmygithubaccount added the bug Something isn't working label Sep 7, 2024
@ion-elgreco
Copy link
Collaborator

We simply use the object store crate in Rust, if it's not working then it's because anon is not a supported config, you should try asking this in arrow-rs where object store belongs

@lostmygithubaccount
Copy link
Author

is there any documentation on what is supported in storage_options?

lostmygithubaccount added a commit to ibis-project/ibis-analytics that referenced this issue Sep 7, 2024
@ion-elgreco
Copy link
Collaborator

is there any documentation on what is supported in storage_options?

You can find that here: https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants