-
Notifications
You must be signed in to change notification settings - Fork 400
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
87 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
# GCS Storage Backend | ||
|
||
`delta-rs` offers native support for using Google Cloud Storage (GCS) as an object storage backend. | ||
|
||
You don’t need to install any extra dependencies to red/write Delta tables to S3 with engines that use `delta-rs`. You do need to configure your AWS access credentials correctly. | ||
|
||
## Note for boto3 users | ||
|
||
Many Python engines use [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to connect to AWS. This library supports reading credentials automatically from your local `.aws/config` or `.aws/creds` file. | ||
|
||
For example, if you’re running locally with the proper credentials in your local `.aws/config` or `.aws/creds` file then you can write a Parquet file to S3 like this with pandas: | ||
|
||
```python | ||
import pandas as pd | ||
df = pd.DataFrame({'x': [1, 2, 3]}) | ||
df.to_parquet("s3://avriiil/parquet-test-pandas") | ||
``` | ||
|
||
The `delta-rs` writer does not use `boto3` and therefore does not support taking credentials from your `.aws/config` or `.aws/creds` file. If you’re used to working with writers from Python engines like Polars, pandas or Dask, this may mean a small change to your workflow. | ||
|
||
## Passing AWS Credentials | ||
|
||
You can pass your AWS credentials explicitly by using: | ||
|
||
- the `storage_options `kwarg | ||
- Environment variables | ||
- EC2 metadata if using EC2 instances | ||
- AWS Profiles | ||
|
||
## Example | ||
|
||
Let's work through an example with Polars. The same logic applies to other Python engines like Pandas, Daft, Dask, etc. | ||
|
||
Follow the steps below to use Delta Lake on S3 with Polars: | ||
|
||
1. Install Polars and deltalake. For example, using: | ||
|
||
`pip install polars deltalake` | ||
|
||
2. Create a dataframe with some toy data. | ||
|
||
`df = pl.DataFrame({'x': [1, 2, 3]})` | ||
|
||
3. Set your `storage_options` correctly. | ||
|
||
```python | ||
storage_options = { | ||
"AWS_REGION":<region_name>, | ||
'AWS_ACCESS_KEY_ID': <key_id>, | ||
'AWS_SECRET_ACCESS_KEY': <access_key>, | ||
'AWS_S3_LOCKING_PROVIDER': 'dynamodb', | ||
'DELTA_DYNAMO_TABLE_NAME': 'delta_log', | ||
} | ||
``` | ||
|
||
4. Write data to Delta table using the `storage_options` kwarg. | ||
|
||
```python | ||
df.write_delta( | ||
"s3://bucket/delta_table", | ||
storage_options=storage_options, | ||
) | ||
``` | ||
|
||
## Delta Lake on AWS S3: Safe Concurrent Writes | ||
|
||
You need a locking provider to ensure safe concurrent writes when writing Delta tables to AWS S3. This is because AWS S3 does not guarantee mutual exclusion. | ||
|
||
A locking provider guarantees that only one writer is able to create the same file. This prevents corrupted or conflicting data. | ||
|
||
`delta-rs` uses DynamoDB to guarantee safe concurrent writes. | ||
|
||
Run the code below in your terminal to create a DynamoDB table that will act as your locking provider. | ||
|
||
``` | ||
aws dynamodb create-table \ | ||
--table-name delta_log \ | ||
--attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \ | ||
--key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \ | ||
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 | ||
``` | ||
If for some reason you don't want to use DynamoDB as your locking mechanism you can choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true` in order to enable S3 unsafe writes. | ||
Read more in the [Usage](../../usage/writing/writing-to-s3-with-locking-provider.md) section. | ||
## Delta Lake on GCS: Required permissions |