This repository contains modified example code for the "Data Preparation for Spark Analytics" tutorial.
The changes were made to ensure the code can run in the cloud using the Pathway BYOL or Dockerhub container. This version of the code is designed to extract data from GitHub, transform it by removing sensitive information, and load the cleaned data into a Delta Lake hosted on S3.
To run this example locally, follow these steps:
- Obtain your GitHub Personal Access Token (PAT) from the Personal access tokens page.
- Insert the token into the
personal_access_token
field in the./github-config.yaml
file. A comment in the file provides guidance. Alternatively, you can store the token in theGITHUB_PERSONAL_ACCESS_TOKEN
environment variable. - Get your Pathway License Key here (free tier of "Scale" plan) and store it in the
PATHWAY_LICENSE_KEY
environment variable. - Run the Python code in the
main.py
file. Ensure that the required environment variables are set. If you prefer to set them only for a single run, you can use the following command:PATHWAY_LICENSE_KEY=YOUR_LICENSE_KEY GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN python main.py
.
If you have pathway
installed, there's an easier way to run the code. You can use the --repository-url
parameter in the command-line tool to launch it directly from the repository. However, you still need to provide your GitHub credentials and Pathway's license key. The launch command will look like this:
PATHWAY_LICENSE_KEY=YOUR_LICENSE_KEY \
GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
pathway spawn --repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py
Since this example is intended to run in the cloud, local output isn't very useful. The data will remain on the cloud machine and be deleted once the container terminates. To retain the output, it's better to save it to S3 bucket. You can enable S3 output by specifying the following environment variables:
AWS_S3_OUTPUT_PATH
: The full output path in S3AWS_S3_ACCESS_KEY
: Your S3 access keyAWS_S3_SECRET_ACCESS_KEY
: Your S3 secret access keyAWS_BUCKET_NAME
: The name of your S3 bucketAWS_REGION
: The region of your S3 bucket
To run Docker locally, use the following command:
docker run \
-e PATHWAY_LICENSE_KEY=YOUR_LICENSE_KEY \
-e GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
-e AWS_S3_OUTPUT_PATH=YOUR_OUTPUT_PATH_IN_S3_BUCKET \
-e AWS_S3_ACCESS_KEY=YOUR_S3_ACCESS_KEY \
-e AWS_S3_SECRET_ACCESS_KEY=YOUR_S3_SECRET_ACCESS_KEY \
-e AWS_BUCKET_NAME=YOUR_BUCKET_NAME \
-e AWS_REGION=YOUR_AWS_REGION \
-e PATHWAY_SPAWN_ARGS='--repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py' \
-t pathwaycom/pathway:latest
For cloud deployments, the command will vary depending on the cloud provider you use.