This workshop will show you how to get data from Kaggle and process it for use in your projects.
The following advanced Flyte features will be covered:
- Raw ContainerTasks
- AWS Secrets Manager integration
- Imagespec
- Integration Testing
- CI/CD
- Docker
- Flyte
- ghcr.io / Hosted image Registry access
- An AWS-based Flyte cluster (GCP and Azure will be supported in the future by this workshop)
- Clone this repository
- Create a Kaggle Account
- Create a Kaggle API Token
- Update images.py with your image registry information, you may redefine both package names
- Ensure that both the Deduplication Package, and get_dataset Package are publicly available for Flyte to be able to access
- Build an AWS Secret containing your kaggle api auth, by following this guide
- run
docker build --platform linux/amd64 -f Dockerfile -t your_image_registry.com/dedupe:latest .
- run
docker push your_image_registry.com/dedupe:latest
- update your dependencies by installing all the local dependencies
pip install -r requirements.txt
- update images.py with your image registry information
- run
pyflyte register kaggle_data_processing --project <your-project-name> --domain <your-domain>
- Perform all steps in Setup
- Update test_workflows.py with your project and domain of the registered workflow
- Run
pytest
and wait for the tests to complete