Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Add credential provider section and examples to user guide #19487

Merged
merged 1 commit into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 42 additions & 8 deletions docs/source/src/python/user-guide/io/cloud-storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,16 @@
df = pl.read_parquet(source)
# --8<-- [end:read_parquet]

# --8<-- [start:scan_parquet]
# --8<-- [start:scan_parquet_query]
import polars as pl

source = "s3://bucket/*.parquet"

df = pl.scan_parquet(source).filter(pl.col("id") < 100).select("id","value").collect()
# --8<-- [end:scan_parquet_query]


# --8<-- [start:scan_parquet_storage_options_aws]
import polars as pl

source = "s3://bucket/*.parquet"
Expand All @@ -17,17 +26,42 @@
"aws_secret_access_key": "<secret>",
"aws_region": "us-east-1",
}
df = pl.scan_parquet(source, storage_options=storage_options)
# --8<-- [end:scan_parquet]
df = pl.scan_parquet(source, storage_options=storage_options).collect()
# --8<-- [end:scan_parquet_storage_options_aws]

# --8<-- [start:credential_provider_class]
lf = pl.scan_parquet(
"s3://.../...",
credential_provider=pl.CredentialProviderAWS(
profile_name="..."
assume_role={
"RoleArn": f"...",
"RoleSessionName": "...",
}
),
)

# --8<-- [start:scan_parquet_query]
import polars as pl
df = lf.collect()
# --8<-- [end:credential_provider_class]

source = "s3://bucket/*.parquet"
# --8<-- [start:credential_provider_custom_func]
def get_credentials() -> pl.CredentialProviderFunctionReturn:
expiry = None

return {
"aws_access_key_id": "...",
"aws_secret_access_key": "...",
"aws_session_token": "...",
}, expiry

df = pl.scan_parquet(source).filter(pl.col("id") < 100).select("id","value").collect()
# --8<-- [end:scan_parquet_query]

lf = pl.scan_parquet(
"s3://.../...",
credential_provider=get_credentials,
)

df = lf.collect()
# --8<-- [end:credential_provider_custom_func]

# --8<-- [start:scan_pyarrow_dataset]
import polars as pl
Expand Down
12 changes: 9 additions & 3 deletions docs/source/src/rust/user-guide/io/cloud-storage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,18 @@ async fn main() {
}
// --8<-- [end:read_parquet]

// --8<-- [start:scan_parquet]
// --8<-- [end:scan_parquet]

// --8<-- [start:scan_parquet_query]
// --8<-- [end:scan_parquet_query]

// --8<-- [start:scan_parquet_storage_options_aws]
// --8<-- [end:scan_parquet_storage_options_aws]

// --8<-- [start:credential_provider_class]
// --8<-- [end:credential_provider_class]

// --8<-- [start:credential_provider_custom_func]
// --8<-- [end:credential_provider_custom_func]

// --8<-- [start:scan_pyarrow_dataset]
// --8<-- [end:scan_pyarrow_dataset]

Expand Down
32 changes: 24 additions & 8 deletions docs/source/user-guide/io/cloud-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,39 @@ To read from cloud storage, additional dependencies may be needed depending on t

## Reading from cloud storage

Polars can read a CSV, IPC or Parquet file in eager mode from cloud storage.
Polars supports reading Parquet, CSV, IPC and NDJSON files from cloud storage:

{{code_block('user-guide/io/cloud-storage','read_parquet',['read_parquet','read_csv','read_ipc'])}}

This eager query downloads the file to a buffer in memory and creates a `DataFrame` from there. Polars uses `fsspec` to manage this download internally for all cloud storage providers.

## Scanning from cloud storage with query optimisation

Polars can scan a Parquet file in lazy mode from cloud storage. We may need to provide further details beyond the source url such as authentication details or storage region. Polars looks for these as environment variables but we can also do this manually by passing a `dict` as the `storage_options` argument.
Using `pl.scan_*` functions to read from cloud storage can benefit from [predicate and projection pushdowns](../lazy/optimizations.md), where the query optimizer will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`.

{{code_block('user-guide/io/cloud-storage','scan_parquet',['scan_parquet'])}}
{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}}

This query creates a `LazyFrame` without downloading the file. In the `LazyFrame` we have access to file metadata such as the schema. Polars uses the `object_store.rs` library internally to manage the interface with the cloud storage providers and so no extra dependencies are required in Python to scan a cloud Parquet file.
## Cloud authentication

If we create a lazy query with [predicate and projection pushdowns](../lazy/optimizations.md), the query optimizer will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`.
Polars is able to automatically load default credential configurations for some cloud providers. For
cases when this does not happen, it is possible to manually configure the credentials for Polars to
use for authentication. This can be done in a few ways:

{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}}
### Using `storage_options`:

- Credentials can be passed as configuration keys in a dict with the `storage_options` parameter:

{{code_block('user-guide/io/cloud-storage','scan_parquet_storage_options_aws',['scan_parquet'])}}

### Using one of the available `CredentialProvider*` utility classes

- There may be a utility class `pl.CredentialProvider*` that provides the required authentication functionality. For example, `pl.CredentialProviderAWS` supports selecting AWS profiles, as well as assuming an IAM role:

{{code_block('user-guide/io/cloud-storage','credential_provider_class',['scan_parquet'])}}

### Using a custom `credential_provider` function

- Some environments may require custom authentication logic (e.g. AWS IAM role-chaining). For these cases a Python function can be provided for Polars to use to retrieve credentials:

{{code_block('user-guide/io/cloud-storage','credential_provider_custom_func',['scan_parquet'])}}

## Scanning with PyArrow

Expand Down
Loading