create gcs docs

delta-io · Sep 24, 2024 · 7f7e3cd · 7f7e3cd
1 parent 7783f66
commit 7f7e3cd
Showing 1 changed file with 87 additions and 0 deletions.
diff --git a/docs/integrations/object-storage/gcs.md b/docs/integrations/object-storage/gcs.md
@@ -0,0 +1,87 @@
+# GCS Storage Backend
+
+`delta-rs` offers native support for using Google Cloud Storage (GCS) as an object storage backend.
+
+You don’t need to install any extra dependencies to red/write Delta tables to S3 with engines that use `delta-rs`. You do need to configure your AWS access credentials correctly.
+
+## Note for boto3 users
+
+Many Python engines use [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to connect to AWS. This library supports reading credentials automatically from your local `.aws/config` or `.aws/creds` file.
+
+For example, if you’re running locally with the proper credentials in your local `.aws/config` or `.aws/creds` file then you can write a Parquet file to S3 like this with pandas:
+
+```python
+    import pandas as pd
+    df = pd.DataFrame({'x': [1, 2, 3]})
+    df.to_parquet("s3://avriiil/parquet-test-pandas")
+```
+
+The `delta-rs` writer does not use `boto3` and therefore does not support taking credentials from your `.aws/config` or `.aws/creds` file. If you’re used to working with writers from Python engines like Polars, pandas or Dask, this may mean a small change to your workflow.
+
+## Passing AWS Credentials
+
+You can pass your AWS credentials explicitly by using:
+
+- the `storage_options `kwarg
+- Environment variables
+- EC2 metadata if using EC2 instances
+- AWS Profiles
+
+## Example
+
+Let's work through an example with Polars. The same logic applies to other Python engines like Pandas, Daft, Dask, etc.
+
+Follow the steps below to use Delta Lake on S3 with Polars:
+
+1. Install Polars and deltalake. For example, using:
+
+   `pip install polars deltalake`
+
+2. Create a dataframe with some toy data.
+
+   `df = pl.DataFrame({'x': [1, 2, 3]})`
+
+3. Set your `storage_options` correctly.
+
+```python
+storage_options = {
+    "AWS_REGION":<region_name>,
+    'AWS_ACCESS_KEY_ID': <key_id>,
+    'AWS_SECRET_ACCESS_KEY': <access_key>,
+    'AWS_S3_LOCKING_PROVIDER': 'dynamodb',
+    'DELTA_DYNAMO_TABLE_NAME': 'delta_log',
+}
+```
+
+4. Write data to Delta table using the `storage_options` kwarg.
+
+   ```python
+   df.write_delta(
+       "s3://bucket/delta_table",
+       storage_options=storage_options,
+   )
+   ```
+
+## Delta Lake on AWS S3: Safe Concurrent Writes
+
+You need a locking provider to ensure safe concurrent writes when writing Delta tables to AWS S3. This is because AWS S3 does not guarantee mutual exclusion.
+
+A locking provider guarantees that only one writer is able to create the same file. This prevents corrupted or conflicting data.
+
+`delta-rs` uses DynamoDB to guarantee safe concurrent writes.
+
+Run the code below in your terminal to create a DynamoDB table that will act as your locking provider.
+
+```
+    aws dynamodb create-table \
+    --table-name delta_log \
+    --attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \
+    --key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \
+    --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
+```
+
+If for some reason you don't want to use DynamoDB as your locking mechanism you can choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true` in order to enable S3 unsafe writes.
+
+Read more in the [Usage](../../usage/writing/writing-to-s3-with-locking-provider.md) section.
+
+## Delta Lake on GCS: Required permissions