Skip to content

Latest commit

 

History

History
36 lines (30 loc) · 1.83 KB

README.md

File metadata and controls

36 lines (30 loc) · 1.83 KB

Getting Data from Kaggle Workshop

This workshop will show you how to get data from Kaggle and process it for use in your projects.

Features

The following advanced Flyte features will be covered:

  • Raw ContainerTasks
  • AWS Secrets Manager integration
  • Imagespec
  • Integration Testing
  • CI/CD

Prerequisites

  • Docker
  • Flyte
  • ghcr.io / Hosted image Registry access
  • An AWS-based Flyte cluster (GCP and Azure will be supported in the future by this workshop)

Setup

  1. Clone this repository
  2. Create a Kaggle Account
  3. Create a Kaggle API Token
  4. Update images.py with your image registry information, you may redefine both package names
  5. Ensure that both the Deduplication Package, and get_dataset Package are publicly available for Flyte to be able to access
  6. Build an AWS Secret containing your kaggle api auth, by following this guide
  7. run docker build --platform linux/amd64 -f Dockerfile -t your_image_registry.com/dedupe:latest .
  8. run docker push your_image_registry.com/dedupe:latest
  9. update your dependencies by installing all the local dependencies pip install -r requirements.txt
  10. update images.py with your image registry information
  11. run pyflyte register kaggle_data_processing --project <your-project-name> --domain <your-domain>

Pytest

  1. Perform all steps in Setup
  2. Update test_workflows.py with your project and domain of the registered workflow
  3. Run pytest and wait for the tests to complete