Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.read_parquet cannot read signed GCS signed url on 0.20, but can on <0.19 #14908

Closed
2 tasks done
hugokitano opened this issue Mar 7, 2024 · 16 comments · Fixed by #17774 or #18274
Closed
2 tasks done

pl.read_parquet cannot read signed GCS signed url on 0.20, but can on <0.19 #14908

hugokitano opened this issue Mar 7, 2024 · 16 comments · Fixed by #17774 or #18274
Assignees
Labels
A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@hugokitano
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

signed_url = https://storage.googleapis.com/bucket/path/to/file.parquet?X-Goog-Algorithm..."

pl.read_parquet(signed_url)"

This is on 0.20.14

Log output

File ~/.../polars/_utils/deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~/.../polars/_utils/deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~/.../polars/io/parquet/functions.py:171, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    158         return pl.DataFrame._read_parquet(
    159             source_prep,
    160             columns=columns,
   (...)
    167             rechunk=rechunk,
    168         )
    170 # For other inputs, defer to `scan_parquet`
--> 171 lf = scan_parquet(
    172     source,  # type: ignore[arg-type]
    173     n_rows=n_rows,
    174     row_index_name=row_index_name,
    175     row_index_offset=row_index_offset,
    176     parallel=parallel,
    177     use_statistics=use_statistics,
    178     hive_partitioning=hive_partitioning,
    179     rechunk=rechunk,
    180     low_memory=low_memory,
    181     cache=False,
    182     storage_options=storage_options,
    183     retries=retries,
    184 )
    186 if columns is not None:
    187     if is_int_sequence(columns):

File ~/.../polars/_utils/deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~/.../polars/_utils/deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~/.../polars/io/parquet/functions.py:311, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, cache, storage_options, retries)
    308 else:
    309     source = [normalize_filepath(source) for source in source]
--> 311 return pl.LazyFrame._scan_parquet(
    312     source,
    313     n_rows=n_rows,
    314     cache=cache,
    315     parallel=parallel,
    316     rechunk=rechunk,
    317     row_index_name=row_index_name,
    318     row_index_offset=row_index_offset,
    319     storage_options=storage_options,
    320     low_memory=low_memory,
    321     use_statistics=use_statistics,
    322     hive_partitioning=hive_partitioning,
    323     retries=retries,
    324 )

File ~/.../polars/lazyframe/frame.py:466, in LazyFrame._scan_parquet(cls, source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, low_memory, use_statistics, hive_partitioning, retries)
    463     storage_options = None
    465 self = cls.__new__(cls)
--> 466 self._ldf = PyLazyFrame.new_from_parquet(
    467     source,
    468     sources,
    469     n_rows,
    470     cache,
    471     parallel,
    472     rechunk,
    473     _prepare_row_index_args(row_index_name, row_index_offset),
    474     low_memory,
    475     cloud_options=storage_options,
    476     use_statistics=use_statistics,
    477     hive_partitioning=hive_partitioning,
    478     retries=retries,
    479 )
    480 return self

ComputeError: Generic HTTP error: Request error: Client error with status 405 Method Not Allowed: No Body

Issue description

This uri is from a signed url via the google.cloud.storage.Client package

blob = bucket.blob(filename)
blob.generate_signed_url(
    expiration=expiration, method="GET", version="v4", response_disposition=response_disposition
)

Expected behavior

on 0.19.12, for example, the reproducible example code works

Installed versions

--------Version info---------
Polars:               0.20.14
Index type:           UInt32
Platform:             macOS-14.3.1-arm64-arm-64bit
Python:               3.10.7 (v3.10.7:6cc6b13308, Sep  5 2022, 14:02:52) [Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              14.0.2
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.50
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@hugokitano hugokitano added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 7, 2024
@stinodego stinodego added the A-io-cloud Area: reading/writing to cloud storage label Mar 29, 2024
@rahij
Copy link

rahij commented Jun 27, 2024

Hi @stinodego we are having the same issue with the new version. Any idea when this might be prioritized? If you have any pointers on where we can start looking, happy to do that as well. Thanks!

@stinodego
Copy link
Member

It stopped working because read_parquet now uses our native engine rather than fsspec. We'll have to implement support ourselves but seems like it doesn't have priority right now.

If you cannot live without signed URL support, you'll have to use fsspec for now to load the data and feed it to Polars, e.g. something like:

import fsspec

with fsspec.open(url) as f:
    df = pl.read_parquet(f)

@rahij
Copy link

rahij commented Jun 27, 2024

I have the same issue with scan_parquet as well, which unfortunately won't work with fsspec. I was under the impression that the object_store crate was used for that?

@rahij
Copy link

rahij commented Jul 9, 2024

Hi @stinodego just following up on this again in case you had any ideas. Thanks!

@rahij
Copy link

rahij commented Jul 17, 2024

@stinodego friendly ping on this - would really appreciate it if you had any ideas about why scan_parquet does not work with signed urls.

@rahij
Copy link

rahij commented Jul 17, 2024

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/dist-packages/polars/lazyframe/frame.py", line 1942, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: Generic HTTP error: Request error: Client error with status 400 Bad Request: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>BadRequest</Code><Message>An error occurred when parsing the HTTP request PROPFIND at &#39;/lakefs-bucket/0190c1e7-73db-7203-af6a-9a33cc73d5da/data/gerg24hf8rpc738bs8gg/cqc0k69f8rpc738bs9vg,_AoRc1dMYeTHaY3sd5J6HK537SeTDUJ_pf5m_UoqCmA&#39;</Message><Resource>/lakefs-bucket/0190c1e7-73db-7203-af6a-9a33cc73d5da/data/gerg24hf8rpc738bs8gg/cqc0k69f8rpc738bs9vg,_AoRc1dMYeTHaY3sd5J6HK537SeTDUJ_pf5m_UoqCmA</Resource><RequestId></RequestId><HostId></HostId></Error>

@ritchie46
Copy link
Member

@nameexhaustion can you take a look here?

@deanm0000
Copy link
Collaborator

deanm0000 commented Jul 21, 2024

My guess is that polars sees the "=" and then infers that is a hive and so is trying to get a list of files which is where the 405 method not allowed is coming from. This is just a wild ass though, I'm just on mobile so haven't verified that at all.

@rahij
Copy link

rahij commented Jul 21, 2024

The only "=" I see in the URL are for the query params (e.g ?X-Amz-Algorithm=x&X-Amz-Credential=y) - everything else is already url encoded. But the theory makes sense to me, as I can read and scan public http urls which don't have query params just fine.

@deanm0000
Copy link
Collaborator

I think this needs to be more robust. Perhaps, have it split by ”/” and then don't look for "=" in the last part (the file and query parameter part). That, or use a full on URL parsing function to recognize query parameters.

@rahij
Copy link

rahij commented Jul 21, 2024

In case this helps, I tried calling read_parquet with a url where the hostname is a local server so I can capture the request logs.

Without query params (and hence no "="), it does a HEAD request to the url.
With query params (and hence with some "="), it does a PROPFIND request to the url - which matches my stacktrace that I posted earlier in this thread. This seems weird since PROPFIND is not even part of the HTTP spec and related to "webdav"?

@deanm0000
Copy link
Collaborator

The thing I thought was needed is here

let path_parts = $e[hive_start_idx..].split(sep);

After further review, it looks like object_store makes a propfind request when trying to list a directory. On the polars side it tries to list when it thinks there's a glob pattern.

.list(Some(&Path::from(prefix)))

I can't tell on mobile why it thinks it's a glob pattern.

@nameexhaustion nameexhaustion self-assigned this Jul 22, 2024
@nameexhaustion nameexhaustion added P-medium Priority: medium accepted Ready for implementation and removed needs triage Awaiting prioritization by a maintainer labels Jul 22, 2024
@rahij
Copy link

rahij commented Jul 22, 2024

Thank you for the prompt fix! I'll report back once this change is released.

@rahij
Copy link

rahij commented Aug 6, 2024

I've just tried this with 1.4.1 and unfortunately still does not work, but with a different error (same url works fine with pandas):
Stacktrace with read_parquet:

  File ".../python3.8/site-packages/polars/io/parquet/functions.py", line 208, in read_parquet
    return lf.collect()
  File "...//python3.8/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: Generic HTTP error: Request error: Client error with status 403 Forbidden: No Body

I'm able to create a lazyframe but when I collect it:

  File "<stdin>", line 1, in <module>
  File ".../python3.8/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: Generic HTTP error: Request error: Client error with status 403 Forbidden: No Body

@rahij
Copy link

rahij commented Aug 6, 2024

I've tried this with a CSV file from the same object store and weirdly, read_csv works fine whereas scan_csv followed by collect() fails with the same error as above.

@mrdaulet
Copy link

Hello, still getting the problem reading signed URLs from S3. I think there are two issues here and only the one concerning path expansion was resolved. The other problem with signed URLs remains.

I believe this has happened when Polars switched to using object_store in #15069. The code first does a HEAD request to get the size of the parquet file, followed by GET, but signed URLs allow only one method - that's why we see "405 Method not allowed" error (potentially other S3 providers return 403 in this case). Typically you wouldn't generate a signed URL against the HEAD method separately. Potentially scan_parquet could use GET method instead and set the range header to avoid getting any data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-cloud Area: reading/writing to cloud storage accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
7 participants