Skip to content

danhenriquex/Kubernetes-for-Data-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🤖 Data Engineering Project

Learning Data Engineering.

OverviewTechnologies and Tools UsedProject StructureGetting StartedWhat I LearnedAuthor

🚧 Data Engineering Project 🚀 Finished 🚧

wakatime

Overview

This project demonstrates how Kubernetes and Apache Airflow were used to manage DAGs (Directed Acyclic Graphs) for data pipelines in an automated and scalable environment. The primary goal of this project is to implement robust orchestration for data engineering tasks, providing easy scheduling, monitoring, and error handling.

Technologies and Tools Used

  • Kubernetes: A system for automating deployment, scaling, and management of containerized applications.
  • Apache Airflow: Used to manage and orchestrate workflows through DAGs, making it easier to monitor and schedule pipelines.
  • Python: The primary language used for scripting the data pipelines and interactions with Airflow.
  • Docker: Ensures consistent deployment environments and allows seamless development across machines.
  • Helm: Used for managing Kubernetes packages, simplifying the deployment of Airflow.

Project Structure

├── dags/                          # Contains the DAGs for the Airflow scheduler
│   ├── hello.py	           # Example DAG definition
│   ├── fetch_and_preview.py       # Example DAG definition
├── k8s                            # Contains the configuration for deploying Airflow using Helm
│   ├── dashboard-adminuser.yaml	            
│   ├── dashboard-clusterrole.yaml       
│   ├── dashboard-secret.yaml 
├── README.md                      
└── .gitignore                     # Files ignored by Git

Scripts Overview

  • dashboard-adminuser.yaml: This file creates a ServiceAccount called admin-user in the kubernetes-dashboard namespace. Service accounts are used to provide an identity for processes that run inside pods, and here it's specifically for admin-level access to the Kubernetes dashboard.
  • dashboard-clusterrole.yaml: This file creates a ClusterRoleBinding, which binds the admin-user ServiceAccount to the cluster-admin role. The cluster-admin role grants the highest level of access to the Kubernetes cluster, allowing full control over all resources.
  • dashboard-secret.yaml: This file generates a Secret containing the token for the admin-user ServiceAccount. This token is used to authenticate and access the Kubernetes Dashboard with admin privileges.
  • fetch_and_preview.py: Automates the process of fetching sales data from a URL, processing it using Pandas, and then previewing the results.
  • hello.py: Defines a simple Airflow DAG that schedules two tasks. These tasks use BashOperator to execute bash commands, demonstrating a basic workflow where one task prints "Hello World" and the other prints "Hello Data Mastery Lab".

Getting Started

To get started with this project, follow these steps:

  1. Clone the Repository:

    git clone <repository_url>
    cd <repository_directory>
  2. User Configurations:

    # Create the ServiceAccount.
    kubectl apply -f dashboard-adminuser.yaml
    #  bind the admin-user to the cluster-admin role
    kubectl apply -f dashboard-clusterrole.yaml
    # Generate the token
    kubectl apply -f dashboard-secret.yaml
  3. Installing Kubernetes Dashboard:

    # Deploys the Kubernetes Dashboard to the cluster.
    kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml
    
    # Starts a proxy server that allows you to access the Kubernetes Dashboard locally
    kubectl proxy
    
    # Retrieves the authentication token required to log in to the Kubernetes Dashboard with the admin-user account.
    kubectl get secret admin-user -n kubernetes-dashboard -o jsonpath={".data.token"} | base64 -d
  4. Configuring Airflow:

    # Makes the Airflow charts available for installation.
    helm repo add apache-airflow https://airflow.apache.org\n
    
    # Ensures you have the latest version of Airflow charts.
    helm repo update
    
    # Deploys Apache Airflow into your Kubernetes cluster.
    helm install airflow apache-airflow/airflow --namespace airflow --create-namespace --debug
    
    # Enables you to access the Airflow web UI locally through port 8080.
    kubectl port-forward svc/airflow-webserver 8080:8080 --namespace airflow

What I learned

  • Kubernetes: Mastered deployment and orchestration of containerized applications using Kubernetes.
  • Airflow: Developed an understanding of workflow orchestration and scheduling using Apache Airflow.
  • Helm: Gained experience in managing Kubernetes packages and simplifying complex deployments..
  • Docker: Improved my ability to create consistent environments for development, testing, and production.
  • Python: Enhanced my skills in Python for data pipeline automation and management.

Author


Fonte: https://www.youtube.com/@CodeWithYu/videos

About

Data Engineering with Kubernetes and Airflow

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages