An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
-
Updated
Jan 31, 2023 - Python
An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
Repository containing Docker images for create a cluster Spark on Hadoop Yarn.
KMeans, Cure and Canpoy algorithms are demonstrated using Pyspark.
A spark cluster based on docker-compose.
Script to run and find similarities between movies from a movie lens data set using Python & Spark Clustering.
👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊
Steps to deploy a Spark app to Kubernetes cluster using spark-submit or a pod template
This is my contribution in the project Diastema
This is a self-documentation of learning distributed data storage, parallel processing, and Linux OS using Apache Hadoop, Apache Spark and Raspbian OS. In this project, 3-node cluster will be setup using Raspberry Pi 4, install HDFS and run Spark processing jobs via YARN.
I'll walk you through launching a cluster manually using Spark standalone deploy mode, as well as connecting an app to the cluster, launching the app, where to view the monitoring and logging.
A spark cluster containing multiple spark masters based on docker-compose.
Steps to deploy a local spark cluster w/ Docker. Bonus: a ready-to-use notebook for model prediction on Pyspark using spark.ml Pipeline() on a well known dataset
In this study, we propose to use a distributed storage and computation system in order to track money transfers instantly. In particular, we keep our transaction history in a distributed file system as a graph data structure. We try to detect illegal activities by using Graph Neural Networks (GNN) in distributed manner.
A distributed application to identify top 50 taxi pickup locations in New York by analyzing over 1 billion records using apache spark, hadoop file system and scala.
Terraform module to create managed, full-spectrum, open-source analytics service Azure HDInsight. This module creates Apache Hadoop, Apache Spark, Apache HBase, Interactive Query (Apache Hive LLAP) and Apache Kafka clusters.
Start clusters in virtualbox VMs
Spark standalone architecture, local architecture and reading hadoop file formats i.e. avro, parquet and ORC
To facilitate the initial setup of Apache Spark, this repository provides a beginner-friendly, step-by-step guide on setting up a master node and two worker nodes.
Spark submit extension from bde2020/spark-submit for Scala with SBT
Add a description, image, and links to the spark-cluster topic page so that developers can more easily learn about it.
To associate your repository with the spark-cluster topic, visit your repo's landing page and select "manage topics."