Skip to content

An educational project to build an end-to-end pipline for near real-time and batch processing of data further used for visualisation and a machine learning model.

Notifications You must be signed in to change notification settings

MarieeCzy/METAR-Data-Engineering-and-Machine-Learning-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Progress Stage1

🌦️ METAR Data Engineering and Machine Learning Project 🛫

TechnologiesAbout the projectConceptual architecturePhase 1Phase 2Phase 3 - Final StageData sourceLooker reportSetup


Technologies

Python Docker Terraform Google Cloud Pandas Shell Script Jupyter Notebook


About the project

An educational project to build an end-to-end pipline for near real-time and batch processing of data further used for visualisation 👀 and a machine learning model 🧠.

The project is designed to enable the preparation of an analytical summary of the variability of METAR weather reports over the years for airports of European countries.

Read more about METAR here ➡️ METAR

In addition, the aim is to prepare a web application using the Streamlit library and machine learning algorithms to predict the trend of change of upcoming METAR reports.


Conceptual architecture

view1


The project is divided into 3 phases according to the attached diagrams:

👉 Phase 1

Retrieval of archive data from source. Initial transformation. Transfer of data to Data Lake - Google Cloud Storage. Transfer of data to Data Warehouse. Transformations using PySpark on Dataproc cluster. Visualisation of aggregated data on an interactive dashboard in Looker.

view2


👉 Phase 2

Preparing the environment for near-real-time data retrieval. Transformations of archived and live data using PySpark, preparation of data for machine learning model. Training and tuning stage of the model.

view3


👉 Phase 3 - Final stage 🥳

Collection of analytical reports for historical data, preparation of web dashboard with the ability to display the prediction of the nearest METAR report for a given airport and the likely trend of change.

view4


Data source

💿 IOWA STATE UNIVERSITY ASOS-AWOS-METAR Data


📊 Looker report

The report generated in Looker provides averages of METAR data, broken down by temperature, winds, directions, and weather phenomena, with accompanying charts. The data was scraped via URL and stored in raw form in Cloud Storage. PySpark and Dataproc were then used to prepare SQL tables with aggregation functions, which were saved in BigQuery. The Looker report directly utilizes these tables from BigQuery.

Additionally, it's possible to prepare a similar report for other networks. Below is an example for PL__ASOS.

Check: PL__ASOS

For more information, please refer to the "Setup" section.

view5


🛠️ Setup

  1. Make sure you have Spark, PySpark, Google Cloud Platform SDK, Prefect and Terraform installed and configured.

  2. Clone the repo

    $ git clone https://github.com/MarieeCzy/METAR-Data-Engineering-and-Machine-Learning-Project.git
  3. Create a new python virtual environment.

    $ python -m venv venv
  4. Activate the new virtual environment using source (Unix systems) or .\venv\Scripts\activate (Windows systems).

    $ source venv/bin/activate
  5. Install packages from requirements.txt using pip. Make sure the requirements.txt file is in your current working directory.

    $ pip install -r requirements.txt
  6. Create new project on the GCP platform, assign it as default and authorize:

    $ gcloud config set project <your_project_name>
    $ gcloud auth login
  7. Configure variables for Terraform:

    6.1. In:

    terraform.tfvars

    replace project name to the name of your project created within the Google Cloud Platform:

    project = <your_project_name>

    go to terraform directory:

    $ cd terraform/

    initialize, plan and apply cloud resource creation:

    $terraform init
    $terraform plan
    $terraform apply
  8. Configure the upload data, go to: ~/prefect_orchestration/deployments/flows/config.json

    8.1. Complete the variables:

    • network select one network e.g. FR__ASOS,

    • start_year, start_month, start_day - complete the start date, make sure that the digits are not preceded by "0"

    • batch_bucket_name - enter the name of the created Google Cloud Storgage bucket

  9. Set up Perfect, the task orchestration tool:

    9.1. Generate new KEY for storage service account:

    On Google Platform go to

    IAM & Admin > Service Accounts, click on

    "storage-service-acc" go to

    KEYS and click on ADD KEY > Create new key in JSON format.

    Save it in a safe place, do not share it on GitHub or any other public place.

    In order not to change the code in the gcp_credentials_blocks.py block, create a .secrets directory: ~/METAR-Data-Engineering-and-Machine-Learning-Project/.secrets and put the downloaded key in it under the name: gcp_credentials_key.json

    9.2. Run Prefect server

    $ prefect orion start

    Go to: http://127.0.0.1:4200

    9.3. In ~/prefect_orchestration/prefect_blocks run below commands in console to create Credentials and GCS Bucket blocks:

    $ python gcp_credentials_blocks.py
    $ python gcs_buckets_blocks.py 

    9.4. Configure Perfect Deployment:

    $ python prefect_orchestration/deployments/deployments_config.py

    9.5. Run Prefect Agent to enable deployment in "default" queue

    $ prefect agent start -q "default"
  10. Start deployment stage 1 - S1: Downloading data and uploading to the Google Cloud Storage bucket

    Go to: ~/prefect_orchestration/deployments and run in command line:

    $ python deployments_run.py --stage="S1"
    

    ☝️ You can observe the running deployment flow in Prefect UI :

    view6 After the deployment is complete, you will find the data in the GCS bucket.

  11. Configuration and commissioning stage 2 - S2: data transformation using PySpark and moving to BigQuery using Dataproc

    11.1. Go to ~/prefect_orchestration/deployments in gcloud_submit_job.sh and check if given paths and names are correct:

    As long as you haven't changed other names/settings other than those listed in this manual, everything should be fine.

    $ gcloud dataproc jobs submit pyspark \
    --cluster=metar-cluster \
    --region=europe-west2 \
    --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
    --files=gs://code-metar-bucket-2/code/sql_queries_config.yaml\
    gs://code-metar-bucket-2/code/pyspark_sql.py \
    -- \
        --input=gs://batch-metar-bucket-2/data/ES__ASOS/*/* \
        --bq_output=reports.ES__ASOS \
        --temp_bucket=dataproc-staging-bucket-metar-bucket-2

    11.2. Upload the pyspark_sql.py and config file sql_queries_config.yaml to the bucket code.

    In ~/prefect_orchestration/deployments/flows:

    $ gsutil cp pyspark_sql.py gs://code-metar-bucket-2/code/pyspark_sql.py
    $ gsutil cp sql_queries_config.yaml gs://code-metar-bucket-2/code/sql_queries_config.yaml

    11.3. Run deployment stage S2 GCS -> BigQuery on Dataproc cluster:

    $ python deployments_run.py --stage="S2"

    If the Job was successful, you can go to BigQuery, where the generated data is located. Now you can copy my Looker report and replace the data sources, or prepare your own. 😎

About

An educational project to build an end-to-end pipline for near real-time and batch processing of data further used for visualisation and a machine learning model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published