An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
-
Updated
Jan 31, 2023 - Python
An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
Script to run and find similarities between movies from a movie lens data set using Python & Spark Clustering.
👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊
This is my contribution in the project Diastema
In this study, we propose to use a distributed storage and computation system in order to track money transfers instantly. In particular, we keep our transaction history in a distributed file system as a graph data structure. We try to detect illegal activities by using Graph Neural Networks (GNN) in distributed manner.
To facilitate the initial setup of Apache Spark, this repository provides a beginner-friendly, step-by-step guide on setting up a master node and two worker nodes.
Add a description, image, and links to the spark-cluster topic page so that developers can more easily learn about it.
To associate your repository with the spark-cluster topic, visit your repo's landing page and select "manage topics."