Spark

Spark Documentation
Databricks Spark Knowledge Base
Spark Programming Guide
advanced dependency management
Custom API Examples For Apache Spark - The examples are basic and only for newbies in Scala and Spark
Welcome to Spark Python API Docs!
github.com/apache/spark
SparkTutorials.net - Apache Spark For the Common * Man!
sparkjava.com/tutorials
learn hadoop spark by examples
Running Spark Korean flintrock, pyspark, aws s3, spark sql, jupyter, hadoop, yarn, tuning
Spark 시작하기 (유용한 사이트 링크)
Learning Spark With Scala
Apache Spark Scala Tutorial For Korean
Apache Spark Tutorial 2018 | Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training
Big Data and Hadoop Tutorial For Beginners | Hadoop Spark Tutorial For Beginners
Apache Spark Tutorial
Apache Spark Tutorials
Apache Spark 101
(16) Learn Apache Spark ( Databricks ) - Step by Step Guide | LinkedIn
Spark Internals
Introduction to Spark Internals
Start Your Journey with Apache Spark — Part 1
Start Your Journey with Apache Spark — Part 2
Start your Journey with Apache Spark — Part 3
Getting started with Spark & batch processing frameworks | by Hoa Nguyen | Insight
Spark Internal
- Part 1. RDD의 내부 동작
- Part 2. Spark의 메모리 관리(1)
- Part 2. Spark의 메모리 관리(2)
- Spark Internal Part 3. Spark SQL’s Catalyst Optimizer
52. Apache Spark Internal architecture jobs stages and tasks || Spark Cluster Architecture Explained - YouTube
pubdata.tistory.com/category/Lecture_SPARK
Apache Spark - Executive Summary
Teach yourself Apache Spark – Guide for nerds!
Apache Spark - cyber.dbguide.net
Stanford CS347 Guest Lecture: Apache Spark
BerkeleyX: CS100.1x Introduction to Big Data with Apache Spark
- mooc-setup
- Spark로 빅데이터 입문, 1-2주차 노트
- Spark로 빅데이터 입문, 3주차 노트
bigdatauniversity.com
- Spark Fundamentals I
- Spark Fundamentals II
Apache Spark Full Course | Spark Tutorial For Beginners | Complete Spark Tutorial | Simplilearn - YouTube
Top 5 Online Courses to Learn Apache Spark in 2022 - Best of Lot
Introduction to Spark
Introduction to Apache Spark with Scala
Python and Bigdata - An Introduction to Spark (PySpark)
Spark Programming
Intro to Apache Spark Training - Part 1
Cloudera
- Cloudera Engineering Blog · Spark Posts
- How-to: Tune Your Apache Spark Jobs (Part 1)
- How-to: Tune Your Apache Spark Jobs (Part 2)
- LSA-ing Wikipedia with Apache Spark
- Making Apache Spark Testing Easy with Spark Testing Base
- Getting Apache Spark Customers to Production
- Why Your Apache Spark Job is Failing
- How to use Apache Spark with CDP Operational Database Experience - Cloudera Blog
The Apache Spark @youtube
Apache spark 소개 및 실습
Spark 소개 1부
Spark 소개 2부
RE: ShootingStar TV 1회 - 아파치 스파크와 RDD
- 스터디용 아파치 스파크 환경구성 - 윈도우
- 스터디용 아파치 스파크 환경구성 - 인텔리J
databricks
- sparkhub.databricks.com
- Examples for Learning Spark
- Project Tungsten: Bringing Spark Closer to Bare Metal
- Simplifying Big Data Analytics with Apache Spark
- Databricks Announces General Availability of Its Cloud Platform
- A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
- DEVOPS ADVANCED CLASS
- 스파크의 사용 환경 내용 - data bricks
- databricks community edition Hands-On Training for Data Science and Machine Learning - YouTube
What is shuffle read & shuffle write in Apache Spark
Spark Shuffle Partition과 최적화 – tech.kakao.com
Scrap your MapReduce! (Or, Introduction to Apache Spark)
Learning Spark
Introduction to Data Science with Apache Spark
HPC is dying, and MPI is killing it
Spark은 왜 이렇게 유명해지고 있을까?
Analytics With Apache Spark Is Coming
Interactive Analytics using Apache Spark
bicdata
- 고급 분석을 '현실'로 만드는 스파크 -> 머신런닝 알고리즘이 포함 있지만, 고급분석가의 관점으로는 기초적인 알고리즘만 포함
- 모든 것을 더 편하게 만들어주는 스파크 -> M/R 형식의 프로그램은 많이 편해짐. MPI 방식은 지원하지 않음
- 하나 이상의 언어를 말하는 스파크 -> scala, java, python을 지원하지만, scala에 최적화되어 있고 나머지 언어는 좀 불편
- 더 빨리 결과를 도출하는 스파크 -> 성능 테스트를 해보면, SparkStream은 storm보다 느리고, SparkSQL은 Hive보다 느림. 일반적인 Spark 프로그램이 성능이 좋음
- 하둡 개발업체를 가리지 않는 스파크 -> 오픈소스는 대부분 업체를 가리지 않고, 용도와 장단점이 다름
- 실시간 고급 분석 -> 기존(하둡)보다는 빠른 고급분석(??)이기 하지만, 준실시간
VCNC가 Hadoop대신 Spark를 선택한 이유
(25) 라인플러스 게임보안개발실...스파크+메소스로 10분 당 15TB 처리
bcho.tistory.com/tag/Apache Spark
- Spark 노트
- Apache Spark이 왜 인기가 있을까?
- Apache Spark 설치 하기
- Apache Spark 소개 - 스파크 스택 구조
- Apache Spark 클러스터 구조
- Apache Spark - RDD (Resilient Distributed DataSet) 이해하기 - #1/2
- Apache Spark RDD 이해하기 #2 - 스파크에서 함수 넘기기 (Passing function to Spark)
- Apache Spark(스파크) - RDD Persistence (스토리지 옵션에 대해서)
- Apache Spark - Key/Value Paris (Pair RDD)
- Apache Spark-Python vs Scala 성능 비교
blog.madhukaraphatak.com
- Introduction to Spark Data Source API - Part 1
Spark Summit
- Using Cascading to Build Data-centric Applications on Spark
  - Expressing ETL workflows via Cascading
- spark-summit.org/2015
- spark-summit.org/east-2016/schedule
  - Spark Summit East 2016 첫 날 덤프
  - Spark Summit East 2016 둘째 날 덤프
- spark-summit.org/2016/schedule
  - A Deep Dive into Structured Streaming
  - How-to: Analyze Fantasy Sports using Apache Spark and SQL
- Spark Summit 2016 West Training
- Spark Summit Europe 2016 참관기
- OrderedRDD: A Distributed Time Series Analysis Framework for Spark (Larisa Sawyer)
- Just Enough Scala for Spark (Dean Wampler)
- TensorFrames: Deep Learning with TensorFlow on Apache Spark (Tim Hunter)
- SPARK SUMMIT EAST 2017
- SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE
- 비트윈 데이터팀의 Spark Summit EU 2017 참가기
- 2018-spark-summit-ai-keynotes-2
- Netflix at Spark+AI Summit 2018
Spark(1.2.1 -> 1.3.1) 을 위한 Mesos(0.18 -> 0.22.rc) - Upgrade
RDDS ARE THE NEW BYTECODE OF APACHE SPARK
Spark RDD Operations-Transformation & Action with Example
Microbenchmarking Big Data Solutions on the JVM – Part 1
Spark, Mesos, Zeppelin, HDFS를 활용한 대용량 보안 데이터 분석
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
IBM, 오픈소스 커뮤니티에 머신러닝 기술 기증
Productionizing Spark and the Spark Job Server
is Hadoop dead and is it time to move to Spark
Spark + S3 + R3 을 이용한 데이터 분석 시스템 만들기 by VCNC
Parallel Programming with Spark (Part 1 & 2) - Matei Zaharia
3 Methods for Parallelization in Spark
Stream All the Things! Architectures for Data Sets that Never End
- 스트리밍 중심 응용 프로그램 및 데이터 플랫폼 구축
- 서비스를 함께 연결하는 단순성을 보여줌으로써 이벤트 소싱 아키텍처에 대해 동기를 부여
- 실시간 및 분석 사례에 대한 다양한 시스템(Akka, Spark, Flink 및 기타)의 절충에 대해 설명
Petabyte-Scale Text Processing with Spark
Combining Druid and Spark: Interactive and Flexible Analytics at Scale
Interactive Audience Analytics With Spark and HyperLogLog
Apache Spark Creator Matei Zaharia Interview
New Developments in Spark
Spark와 Hadoop, 완벽한 조합 (한국어)
Spark Architecture: Shuffle
Deep-dive into Spark Internals & Architecture
Naytev Wants To Bring A Buzzfeed-Style Social Tool To Every Publisher With Spark
Spinning up a Spark Cluster on Spot Instances: Step by Step
Spark Meetup at Uber
Bay Area Apache Spark Meetup @ Intel
- Easy, scalable, fault tolerant stream processing with structured streaming - spark meetup at intel in santa clara
Can Apache Spark process 100 terabytes of data in interactive mode?
넷플릭스 빅데이터 플랫폼 아파치 스팍 통합 경험기
Succinct Spark from AMPLab: Queries on Compressed RDDs
How-to: Build a Complex Event Processing App on Apache Spark and Drools
Tuning Spark
Tuning Java Garbage Collection for Spark Applications
Improving Spark application performance
Spark performance tuning eng
Spark performance tuning Part#2 병렬처리
Spark performance tuning from the trenches
Spark tuning for Enterprise System Administrators
SPARK 설정 Tuning 하기 : 네이버 블로그
“Fast food” and tips for RDD
스칼라ML - 스칼라를 이용한 기계학습 기초(+Spark)
Secondary Sorting in Spark
Distributed computing with spark
Comparing the Dataflow/Beam and Spark Programming Models
Apache Spark Architecture
Scala vs. Python for Apache Spark
Natural Language Processing With Apache Spark
맵알, ‘아파치 스파크’ 교육 과정 무료로 공개
Spark HDFS Integration
spark textfile load file instead of lines
Reading Text Files by Lines
Evening w/ Martin Odersky! (Scala in 2016) +Spark Approximates +Twitter Algebird
ScalaJVMBigData-SparkLessons.pdf
Introduction to Spark 2.0 : A Sneak Peek At Next Generation Spark
- Spark Release 2.0.0
- Spark SQL, DataFrames and Datasets Guide
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - When to use them and why
- Introducing Apache Spark 2.0
- Spark 2.0 Technical Preview: Easier, Faster, and Smarter
- Apache Spark 2.0 presented by Databricks co-founder Reynold Xin
- APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL
  - RDD 보다 DataFrame, DataSet 이 속도는 두배 이상, 메모리 사용량은 1/4 미만
  - 도저히 DataFrame, DataSet 을 쓸수 없는 데이타(예를들어 기본 API 가 제공하지 않는 변환작업을 해야 하거나, 데이타가 뉴스 본문같은 구조화 할수없는 데이타이거나)가 아니면 RDD 말고 DataFrame, DataSet 사용
  - RDD; 자유도가 높음(Programming)
  - DataFrame; 자유도가 낮음(SQL-like) 대신 데이터 저장공간, 병렬화, 메모리 사용, 복합 쿼리 실행 플랜 등 아주 여러 부분에서 최적화 작업이 가능하고, 많이 최적화 작업이 되어있음
- Spark 2.0 – Datasets and case classes
- Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs
- Generating Flame Graphs for Apache Spark
- Apache Spark 2.0 Tuning Guide
- Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data
- Modern Spark DataFrame & Dataset | Apache Spark 2.0 Tutorial
- Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust
- Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
- Spark 2.0 - by Matei Zaharia
- Spark 2.x Troubleshooting Guide
Introducing Apache Spark 2.1 Now available on Databricks
What's New in the Upcoming Apache Spark 2.3 Release?
Introducing Stream-Stream Joins in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
The easiest way to run Spark in production
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Takes On Dataflow in Benchmark Test
Stock inference engine using Spring XD, Apache Geode / GemFire and Spark ML Lib. http://pivotal-open-source-hub.github.io/StockInference-Spark
Learning Spark - 아키텍트를 꿈꾸는 사람들
- 2015_LearningSpark
Tutorial: Spark-GPU Cluster Dev in a Notebook A tutorial on ad-hoc, distributed GPU development on any Macbook Pro
GPU Acceleration on Apache Spark™
Spark에서 GPU를 사용해야하는 이유는 무엇입니까?
Cluster - spark
Apache Spark Key Terms, Explained
스파크 클라우데라 하둡 클러스터 원격 입출력 예제
이렇게 코딩 하면 안된다
spark를 이용한 hadoop cluster 원격 입출력
Best Practices for Using Apache Spark on AWS
Working effectively with Apache Spark on AWS - Singapore Apache Spark+AI Meetup
How to export millions of records from Mysql to AWS S3?
Build a Prediction Engine Using Spark, Kudu, and Impala
Deep Dive: Apache Spark Memory Management
Deep Dive: Apache Spark Memory Management
Apache Spark Memory Management: Deep Dive | LinkedIn
A Developer’s View into Spark's Memory Model - Wenchen Fan
option
- spark.executor.cores; node의 코어수
- spark.cores.max 전체 갯수
- e.g.
  - worker node가 2개이고 각 node당 8core cpu인데 spark.cores.max를 8로 주면 1개의 노드만 동작
  - 두개의 node에서 동작하게 하려면 spark.cores.max를 16으로
Apache Spark @Scale: A 60 TB+ production use case
How Do In-Memory Data Grids Differ from Spark?
Spark에서의 Data Skew 문제
Skew Mitigation For Facebook PetabyteScale Joins - YouTube
처음해보는 스파크(spark)로 24시간안에 부동산 과열 분석해보기
Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)
Achieving a 300% speedup in ETL with Apache Spark
- Spark의 CSV 파일 작업에 대한 스니펫 소개
- non-distributed version에 비해 Spark는 뛰어난 속도 향상 기능을 제공하며 Parquet과 같은 최적화된 형식으로 변환 할 수 있는 기능을 제공
Parsing CSV Files in Spark
Diving into Spark and Parquet Workloads, by Example
parquet 사용 예제
Apache Spark에서 컬럼 기반 저장 포맷 Parquet(파케이) 제대로 활용하기
입 개발 Spark에서 Parquet 파일 Custom Schema 로 읽어들이기 | Charsyam's Blog
Writing parquet on HDFS using Spark Streaming
Experimenting with Neo4j and Apache Zeppelin (Neo4j)-[:LOVES]-(Zeppelin)
Time-Series Missing Data Imputation In Apache Spark
Tempo: Distributed Time Series Analysis with Apache Spark™ and Delta Lake - YouTube
Data Science How-To: Using Apache Spark for Sports Analytics
- Using Spark To Analyze the NBA and the 3-point Shot
Hive on Spark: Getting Started
Working with UDFs in Apache Spark
- Python, Java, Scala에서 Apache Spark의 UDF, UDAF를 사용하는 간단한 예제
Apache Spark은 어떻게 가장 활발한 빅데이터 프로젝트가 되었나
Using Apache Spark for large-scale language model training
- Facebook에서 ngram 모델의 traing pipeline을 Apach Hive에서 Apache Spark으로 전환 시도 중
- 두 가지 솔루션에 대한 설명과 Spark DSL 과 Hive QL의 유연성 비교 및 성능 수치
Hive and Spark Integration Tutorial
Working with multiple partition formats within a Hive table with Spark
- Hive는 파티션별로 다른 데이터 형식을 지원, 데이터를 쓰기 최적화된 형식에서 읽기 최적화된 형식으로 변환할 때 사용 가능
- Spark에서 멀티 포맷 테이블을 쿼리할 때 실행 계획이 어떻게 동작하는지 내부 동작 방식에 대해 설명
On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies
Integrating Apache Hive with Apache Spark - Hive Warehouse Connector
How to access Hive from Spark2 on HDP3?
WRITING TO A DATABASE FROM SPARK
Processing Solr data with Apache Spark SQL in IBM IOP 4.3
- Apache Spark을 Apach Solr로 연결하는 방법 소개
Blacklisting in Apache Spark
Tracking the Money — Scaling Financial Reporting at Airbnb
The Benefits of Migrating HPC Workloads To Apache Spark
- Spark 작업을 실행하기위한 Apache Zeppelin과 Livy 작업 서버 간의 통합에 대한 최근 개선 사항 설명
데이터분석 인프라 구축기 (1/4)
데이터분석 인프라 구축기 (2/4)
데이터분석 인프라 구축기 (3/4)
데이터분석 인프라 구축기 (4/4)
zipWithIndex, for-yield 예제
Cloudera session seoul - Spark bootcamp
Benchmarking Big Data SQL Platforms in the Cloud
- Vanilla Spark, Presto, Impala 보다 DataBricks 플랫폼이 더 빠르다는 주장
Building QDS: AIR Infrastructure
- Qubole이 Data Platforms 2017 conference 발표한 Air라는 플랫폼에 대한 내용입니다.
스파크 스터디 ParkS
- ParkS
Cost Based Optimizer in Apache Spark 2.2
- Apache Spark 2.2의 Cost Based Optimizer와 TPC-DS benchmark에서 CBO 사용 여부에 관계없이 쿼리 수행 시간을 비교한 결과와 통계 정보 수집 방법 등에 대해 설명
Apache Spark Core-Deep Dive-Proper Optimization - Daniel Tomes, Databricks
spark 프레임워크를 활용해 자바 기반 웹 애플리케이션 개발 맛보기
Bay Area Apache Spark Meetup at HPE/Aruba Networks Summary
- Aruba에서 PySpark 및 GraphFrames의 Databricks를 사용한 데이터 상관 관계에 관한 프레젠테이션
Apache Spark Professional Training with Hands On Lab
IBM Cloud 환경에서 DSX Spark를 사용한 데이터 분석 시작하기
Spark-overflow - A collection of Spark related information, solutions, debugging tips and tricks, etc. PR are always welcome! Share what you know about Apache Spark
Debugging a long-running Apache Spark application: A War Story
- 장기간 실행되는 Apache Spark 응용 프로그램의 성능 문제를 디버깅하는 방법 대해 설명
- JVM 내부 (예 : 사용자 정의 클래스 로더 및 GC), Spark internal (예 : driver가 broadcast data를 정리하는 방법) 및 이러한 버그를 찾아 내고 확인하는 메트릭 및 모니터링 전략
A step-by-step guide for debugging memory leaks in Spark Applications | by Shivansh Srivastava | disney-streaming | Nov, 2020 | Medium
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
Using Apache Spark to Analyze Large Neuroimaging Datasets
Goal Based Data Production: The Spark of a Revolution - Sim Simeonov
Spark Job On Mesos - Log Handling programtic하게 log level별로 원하는 장소에 로그 남기기
How to log in Apache Spark log4j
Spark: Web Server Logs Analysis with Scala
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Extreme Apache Spark: how in 3 months we created a pipeline that can process 2.5 billion rows a day
Locality Sensitive Hashing By Spark
Partition Index - Selective Queries On Really Big Tables Hive, Impala, Spark 등으로 데이터를 조회할 때 전체 테이블을 검색하지 않도록 클라이언트와 데몬 사이에서 인덱스 맵을 만들고 유지 관리를 하며 쿼리를 파싱해주는 balancer를 구현
Practical Apache Spark in 10 minutes
- Part 1 - Ubuntu installation
- Part 2 - RDD
- Part 3 - Data Frames and SQL
- Part 4 — MLlib
- Part 5 - Streaming
- Part 6 - GraphX
partition의 개수가 지나치게 적게 잡혀서 worker 역시 부족하게 할당되면서 성능 하락 problem e.g
- 다양한 경우에서 자주 발생
- spark sql optimizer가 업그레이드 되는 게 가장 확실한 해법이지만 그걸 기다릴 수 없기 때문에 repartition을 사용해 강제로 partition 수를 증가
```
val dataset: Dataset[XXX] = ...
dataset.repartition(dataset.rdd.getNumPartitions * 2).map(YYY)...
```
Apache Spark Scheduler
Deep Dive into the Apache Spark Scheduler - Xingbo Jiang
Apache Spark: Scala vs. Java v. Python vs. R vs. SQL
- Scala, Java, Python, R 및 SQL에서 Apache Spark API의 차이점 설명
- 예상대로, JVM 언어를 사용하면 성능 향상
Exploratory Data Analysis in Spark with Jupyter
아파치 스팍 관련 문제점 이야기 + 자바로 게으른 초기화 (2018-07-06) 케빈TV Live
Working with Nested JSON Using Spark | Parsing Nested JSON File in Spark
Working with JSON in Apache Spark
practice - sc.textFile로 gzipped hdfs file을 읽을 경우 성능 저하 or job 실패
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Spark 2.4!
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency
- Uber가 Hadoop과 Spark을 이용하여 빅데이터를 수집, 관리, 분석하는 방법 정리
Spark study notes: core concepts visualized
- Spark과 YARN이 상호 작용하는 방식과 작업이 각 단계에서 어떻게 작동하는지 설명하는 기초 문서
Just Enough Spark! Core Concepts Revisited !! | LinkedIn
Python vs. Scala
- a comparison of the basic commands (Part I)
- Pandas vs. Spark: how to handle dataframes (Part II)
Points to remember while processing streaming timeseries data in order using Kafka and Spark
A Journey Into Big Data with Apache Spark
- Part 1
- Part 2
Write to multiple outputs by key Spark - one Spark job
Things I Wish I’d Known About Spark When I Started (One Year Later Edition)
Brian Clapper—Spark for Scala Developers
Movie recommendation using Apache Spark
NPE from Spark App that extends scala.App
입 개발 spark-submit 시에 –properties-file 와 파라매터에서의 우선 순위
Which Language to choose when working with Apache Spark
Procesando Datos con Spark
Which Career Should I Choose — Hadoop Admin or Spark Developer?
Efficient geospatial analysis with Spark
Dealing with null in Spark
'.NET for Apache Spark' Debuts for C#/F# Big Data
Announcing Version 1.0 of .NET for Apache Spark | .NET Blog
Parallel Cross Validation in Spark
Vedant Jain: Smart Streams: A Real-time framework for scoring Big and Fast Data | PyData Miami 2019
Jakub Hava: Productionizing H2O Models with Apache Spark | PyData Miami 2019
Spark로 알아보는 빅데이터 처리
Multi Source Data Analysis using Spark and Tellius : Meetup Video
- spark2.0-examples/src/main/scala/com/madhukaraphatak/examples/sparktwo/multisource
- Multi Source Data Analysis using Spark and Tellius
Spark 성능 최적화 및 튜닝 방법 - Part 1
Master Spark fundamentals & optimizations
Apache Spark Optimization Techniques | by Nabarun Chakraborti | Jun, 2020 | Medium
ClickHouse Clustering for Spark Developer
Data Modeling in Apache Spark - Part 1 : Date Dimension
Data Modeling in Apache Spark - Part 2 : Working With Multiple Dates
Concurrency in Spark
How we reduced our Apache Spark cluster cost using best practices
Big Data file formats explained
Why Spark on Ceph? (Part 1 of 3)
Why Spark on Ceph? (Part 2 of 3)
Why Spark on Ceph? (Part 3 of 3)
Spark Delight — We’re building a better Apache Spark UI | by Jean Yves | Jun, 2020 | Towards Data Science
Overcoming Apache Spark’s biggest pain points | by Edson Hiroshi Aoki | Oct, 2020 | Towards Data Science
Speeding Time to Insight with a Modern ETL Approach - YouTube ETL -> ELT
Scale-Out Using Spark in Serverless Herd Mode! - YouTube
DBIOTransactionalCommit - Databricks
입 개발 EMR에서는 sc.addFile, Databricks에서는 그냥 dbfs 폴더를 이용하자. | Charsyam's Blog
Spark interview Q&As with coding examples in Scala - part 1 | Java-Success.com
How to Extract Deeper Value from Data in Legacy Applications with Analytics in a Cloud Data Lake - YouTube
Scala 3 and Spark?. After the release of Scala 3, one of… | by Filip Zybała | VirtusLab | Oct, 2021 | Medium
Using Scala 3 with Spark | 47 Degrees
Apache Spark #1 - 아키텍쳐 및 기본 개념
Practical Spark – Intro (1) – 1ambda
Practical Spark – Tutorial (2) – 1ambda
Practical Spark – Concept (3) – 1ambda
Practical Spark – Architecture (4) – 1ambda
Practical Spark – DataFrame (5) – 1ambda
Practical Spark – Persistence (6) – 1ambda
Practical Spark – Cache (7) – 1ambda
Practical Spark – SQL & Table (8) – 1ambda
Practical Spark – Join (9) – 1ambda
Practical Spark – Memory (10) – 1ambda
Practical Spark – Versions (11) – 1ambda
Practical Spark – 자주 묻는 질문들 (12) – 1ambda

Apache Livy

Apache Livy A REST Service for Apache Spark
Apache Livy에서 Spark job stdout log를 보는 법 - Nephtyw’S Programming Stash

API

Spark Programming Model : Resilient Distributed Dataset (RDD) - 2015
Apache Spark: Examples Of Transformations
The RDD API By Example
backtobazics.com/category/big-data/spark example of API
APACHE SPARK: RDD, DATAFRAME OR DATASET?
Learn Spark Scala with Clara: RDD
A modern guide to Spark RDDs
Anatomy of Apache Spark's RDD | LinkedIn
Apache Spark’s Hidden REST API
Spark Session vs Spark Context
Exploring Spark DataSource V2
- Part 1 : Limitations of Data Source V1 API
- Part 2 : Anatomy of V2 Read API
- Part 3 : In-Memory DataSource
- Part 4 : In-Memory DataSource with Partitioning
- Part 5 : Filter Push
- Part 6 : Anatomy of V2 Write API
- Part 7 : Meetup Talk
Migrating to Spark 2.4 Data Source API
Map Filter Reduce & Lambda in Python & Scala| Comparison of Lambda Syntaxes Step by Step| Beginners - YouTube

aggregate

scala> val rdd = sc.parallelize(List(1, 2, 3, 3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:21

scala> rdd.aggregate((0, 0))((x, y) => (x._1 + y, x._2 - y), (x, y) => (x._1 + y._1, x._2 + y._2))
res10: (Int, Int) = (9,-9)

scala> rdd.map(t => (t, -t)).reduce((a, b) => (a._1 + b._1, a._2 + b._2))
res11: (Int, Int) = (9,-9)

aggregateByKey
- AggregateByKey implements Collect_list in Spark 1.4
Array Deep Dive into Apache Spark Array Functions | by Neeraj Bhadani | Expedia Group Technology | Medium
combineByKey
- Using combineByKey in Apache-Spark
- Spark PairRDDFunctions: CombineByKey
- Apache Spark combineByKey Explained
DataFrames
- Spark SQL, DataFrames and Datasets Guide
- spark2.0 dataframe의 filter,where,isin,select,contains,col,between,withColumn, 예제
- Spark: Connecting to a jdbc data-source using dataframes
- 입 개발 Spark 에서 Database 빨리 덤프하는 법(Parallelism) | Charsyam's Blog Spark JDBC
- where과 filter의 차이
- Using spark data frame for sql
- Selecting Dynamic Columns In Spark DataFrames (aka Excluding Columns)
- Spark: Elegantly Aggregate DataFrame by One Key Column
- A practical introduction to Spark’s Column- part 1
- A practical introduction to Spark’s Column- part 2
- Different approaches to manually create Spark DataFrames
- Sending Spark DataFrame via mail
- How I achieved 3x speedup for joins over Spark dataframes
- Deep dive into Apache Spark Window Functions | by Neeraj Bhadani | Expedia Group Technology | Medium
- Making the Spark DataFrame composition type safe(r) | by Iaroslav Zeigerman | Feb, 2021 | Medium
- How to add row numbers to a Spark DataFrame? | Data Programmers
Datasets
- Introducing Spark Datasets
- Spark SQL, DataFrames and Datasets Guide
- RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016
- Spark2.0 New Features
  - (1) DataSet
  - (2) Structured Streaming – 1편
- Transforming Spark Datasets using Scala transformation functions
- (2) Solution to Spark Auto Schema inference (String) for JSON Array / JSON Object/Record/Row Problem | LinkedIn
DateTime Deep Dive into Apache Spark DateTime Functions
distinct
- ~~동일성, 동등성, Spark의 distinct~~
groupByKey
- Avoid GroupByKey
HashPartitioner
- Apache Spark - HashPartitioner : How does it work?
- Partition by Hash on Keys
join
- RDD join 예제
- join 예제
- Joins in Apache Spark — Part 1
- Joins in Apache Spark — Part 2
- Joins in Apache Spark — Part 3
persist
- RDD persist() or cache() 시 주의사항
SQL
- Spark SQL, DataFrames and Datasets Guide
  - Column
  - Dataset
  - Row
- spark-csv - CSV Data Source for Apache Spark 1.x
  - TextFileSuite.scala
- Spark SQL CSV Examples
- github.com/yhuai/spark/tree/eb77ee39b8616cb367541503baf7c07695ef1ec0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv
- Dataframes from CSV files in Spark 1.5: automatic schema extraction, neat summary statistics, & elementary data exploration
- Spark 2.0 read csv number of partitions (PySpark)
- How to read csv file as DataFrame?
- How to change column types in Spark SQL's DataFrame?
- Working with Nested Data Using Higher Order Functions in SQL on Databricks
  - Hadoop과 Spark은 nested structs, array, map 등과 같은 복잡하고 다양한 데이터를 처리하는 훌륭한 도구이지만 SQL에서 사용하는 건 어려움
  - Databricks 3.0에 추가된 TRANSFORM 연산과 Spark SQL에 추가된 "Higher Order Functions"를 소개(SPARK-19480)
- Spark SQL under the hood – part I
- Five Spark SQL Utility Functions to Extract and Explore Complex Data Types - Tutorial on how to do ETL on data from Nest and IoT Devices
- Querying our Data Lake in S3 using Zeppelin and Spark SQL
- Learning Spark SQL with Zeppelin
- SQL Pivot: Converting Rows to Columns 2.4
- SQL at Scale with Apache Spark SQL and DataFrames — Concepts, Architecture and Examples
- A Deep Dive into Query Execution Engine of Spark SQL - Maryann Xue
- A Deep Dive into Spark SQL's Catalyst Optimizer - Yin Huai
trigger Spark Trigger Options

Book

더북(TheBook): 스파크를 다루는 기술 4~6장만
Mastering Apache Spark 2.0
Advanced Analytics with Spark Source Code
Best Apache Spark and Scala Books for Mastering Spark Scala
Spark for Data Analyst Spark SQL

Conference

Spark Day 2017@Seoul(Spark Bootcamp)
Spark Day 2017- Spark 의 과거, 현재, 미래
Spark & Zeppelin을 활용한 한국어 텍스트 분류
- Spark & Zeppelin을 활용한 한국어 텍스트 분류
Zeppelin 노트북: NSMC Word2Vec & Sentiment Classification
Spark, Mesos, Zeppelin, HDFS를 활용한 대용량 보안 데이터 분석
2020 데이터 컨퍼런스 "Spark+Cassandra 기반 빅데이터를 활용한 추천시스템 서빙 파이프라인 최적화" / 박수성 SSG.COM 파트너 - YouTube
Tale of Scaling Zeus to Petabytes of Shuffle Data @Uber - YouTube
Sub-Second Analytics for User-Facing Applications with Apache Spark and Rockset - YouTube

Deep Learning

yahoo/CaffeOnSpark
CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters
Large Scale Distributed Deep Learning on Hadoop Clusters
SparkNet: Training Deep Networks in Spark
- Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet
large scale deep-learning_on_spark
DeepSpark: Spark-Based Deep Learning Supporting Asynchronous Updates and Caffe Compatibility
The Unreasonable Effectiveness of Deep Learning on Spark
GPU Acceleration in Databricks Speeding Up Deep Learning on Apache Spark
Deep Learning on Databricks - Integrating with TensorFlow, Caffe, MXNet, and Theano
Deep Learning With Apache Spark
- Part 1
- Part 2
Deep Learning Pipelines for Apache Spark

Docker

practice
- practice - install zeppelin docker image, read and adjust json files
DIT4C image for Apache Zeppelin
hub.docker.com/r/k3vin/polynote-spark
spark-scala-tutorial A free tutorial for Apache Spark docker jupyter notebook
Apache Spark on Docker
Distributed Pricing Engine using Dockerized Spark on YARN w/ HDP 3.0
- Part 1/4
- Part 2/4
- Part 3/4
- Part 4/4
Getting Started with PySpark for Big Data Analytics, using Jupyter Notebooks and Docker
DIY: Apache Spark & Docker. Set up a Spark cluster in Docker from… | by Shane De Silva | Towards Data Science

GraphX

GraphX
Spark Streaming and GraphX at Netflix - Apache Spark Meetup, May 19, 2015
스사모 테크톡 - GraphX
Computing Shortest Distances Incrementally with Spark
Strata 2016 - This repo is for MLlib/GraphX tutorial in Strata 2016
Processing Hierarchical Data using Spark Graphx Pregel API
- GraphX API를 사용하는 예제와 방법
Community detection in graph Girvan newman algorithm

Hbase

example
- HBaseTest.scala, hbase_inputformat.py
I simple API to interact with HBase with Spark
Apache Spark Comes to Apache HBase with HBase-Spark Module
HBase Integration with Spark | How to Integrate HBase with Spark | Spark Integration with HBase
How to create Spark Dataframe on HBase table

Ignite

Ignite - Spark Shared RDDs

Installation

Installing Apache Spark 2.3.0 on macOS High Sierra
How to install and run Spark 2.0 on HDP 2.5 Sandbox
Apache Spark installation on Windows 10
Spark StandAlone 설치부터 예제 테스트까지
Hadoop, Spark 설치
Spark (scala) 개발환경 설정 (window)
How to Install Scala and Apache Spark on MacOS
Apache Spark setup with Gradle, Scala and IntelliJ
Create Spark Scala SBT project in Intellij Idea. 1-minute tutorial - YouTube
pocketcluster - One-Step Spark/Hadoop Installer v0.1.0
Spark 2: How to install it on Windows in 5 steps
Apache Spark Setup in Windows|Intellij IDE|CommandLine|Databricks|Zeppelin|All Methods Covered 2021. - YouTube

Kubernetes

Introduction to Spark on Kubernetes
What’s New for Apache Spark on Kubernetes in the Upcoming Apache Spark 2.4 Release
- 2.4 preview. Kubernetes 지원 강화, PySpark/Spark R 지원 추가 등
Spark day 2017@Seoul - Spark on Kubernetes
Scalable Spark Deployment using Kubernetes
Auto Scaling Spark in Kubernetes
The anatomy of Spark applications on Kubernetes
- Kubernetes에 대한 Spark의 실험적 지원과 인-클러스터 클라이언트 모드에 대한 향후 지원에 대해 설명
- Spark driver, Executor, Executor Shuffle Service, Resource Staging Server
How to build Spark from source and deploy it to a Kubernetes cluster in 60 minutes
Apache Spark workloads on Kubernetes
Apache Spark Streaming in K8s with ArgoCD & Spark Operator - YouTube
Spark on Kubernetes - Gang Scheduling with YuniKorn - Cloudera Blog
Superworkflow of Graph Neural Networks with K8S and Fugue - YouTube word2vec node2vec

Library

Hadoop Tutorial: the new beta Notebook app for Spark & SQL
AWS Athena Data Source for Apache Spark
BigDL: Distributed Deep learning on Apache Spark
- BigDL: Distributed Deep learning on Apache Spark
CLOUD DATAPROC - Google Cloud Dataproc is a managed Spark and Hadoop service that is fast, easy to use, and low cost
- 구글, 스파크·하둡 관리 클라우드 서비스 공개
- [Google Cloud Dataproc 사용하기(http://whitechoi.tistory.com/48)
couchbase-spark-connector - The Official Couchbase Spark Connector
CueSheet - a framework for writing Apache Spark 2.x applications more conveniently
- No More "sbt assembly": Rethinking Spark-Submit using CueSheet
Delta Lake - Reliable Data Lakes at Scale
- Delta Lake on Databricks - Databricks
- Tutorial: How Delta Lake Supercharges Data Lakes - YouTube
- SmartSQL Queries powered by Delta Engine on Lakehouse - YouTube
- Making Apache Spark™ Better with Delta Lake - YouTube
- Tech Talk: Top Tuning Tips for Spark 3.0 and Delta Lake on Databricks - YouTube
- Delta Lakehouse Data Profiler and SQL Analytics Demo - YouTube
- Optimising Geospatial Queries with Dynamic File Pruning - YouTube
- Demystifying Delta Lake. Data Brew | Episode 3 - YouTube
- Delta Lake on Databricks Demo - YouTube
- Make Reliable ETL Easy on Delta Lake - YouTube
- Building Lakehouses on Delta Lake with SQL Analytics Primer - YouTube
- Massive Data Processing in Adobe Experience Platform Using DeltaLake | by Jaemi Bremner | Adobe Tech Blog | Medium
- Multi-Table Transactions with LakeFS and Delta Lake - YouTube
Dr. Elephant Self-Serve Performance Tuning for Hadoop and Spark
EMR
- Large-Scale Machine Learning with Spark on Amazon EMR
- Amazon EMR, Apache Spark 지원 시작
- Spark on EMR
- (BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
- Starburst’s Presto on AWS up to 18x faster than EMR Presto의 엔터프라이즈 빌드를 제공하는 Starbust에서 AWS와 EMR 환경에서 벤치마크한 결과 소개
- Optimize Spark jobs on EMR Cluster
Envelope - a configuration-driven framework for Apache Spark that makes it easy to develop Spark-based data processing pipelines on a Cloudera EDH
- Envelope과 함께 Apache Spark, Apache Kudu 및 Apache Impala를 사용하여 Cloudera enterprise data hub (EDH)에 구현하는 방법
- Configuration specification
- Bi-temporal data modeling with Envelope
- Cloudera Enterprise Data Hub - Our flagship can now be yours
flambo - A Clojure DSL for Apache Spark
GraphFrames: DataFrame-based Graphs
- On-Time Flight Performance with GraphFrames for Apache Spark
- An introduction to Spark GraphFrame with examples analyzing the Wikipedia link graph
Hail: Scalable Genomics Analysis with Apache Spark
- Apache Spark로 유전체 분석을 수행하는 도구 인 Hail에 대한 개요
- 샘플의 품질을 계산하고 간단한 게놈 차원의 연관 연구를 수행하는 예제 실행으로 시연하는 간단하고 강력한 프로그래밍 모델을 보유
Hudi - Spark Library for Hadoop Upserts And Incrementals https://uber.github.io/hudi
- The Evolution of Uber’s 100+ Petabyte Big Data Platform
- Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework and Library for Apache Hadoop
- Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads
- Building an analytical data lake with Apache Spark and Apache Hudi - Part 1
Hydrogen
- Project Hydrogen: State of the Art Deep Learning on Apache Spark - Singapore Spark+AI Meetup
Infinispan Spark connector 0.1 released!
- infinispan-spark
- infinispan-spark-connector-examples
IMLLIB - Factorization Machines (LibFM) Field-Awared Factorization Machine (FFM) Conditional Random Fields (CRF) Adaptive learning rate optimizer (AdaGrad, Adam)
Lighthouse - a library for data lakes built on top of Apache Spark
- Organize your data lake using Lighthouse
Livy, the Open Source REST Service for Apache Spark, Joins Cloudera Labs
- Livy: A REST Web Service For Apache Spark
MapR-DB Spark Connector with Secondary Indexes
- MapR-DB Spark Connector Performance Tests
- Extending MapR Database Queries Using Scala Polymorphic Types
native_spark - new arguably faster implementation of Apache Spark from scratch in Rust
- Spark implemented in Rust with promising results
snappydata - Unified Online Transactions + Analytics + Probabilistic Data Platform
- SnappyData: OLTP + OLAP Database built on Apache Spark http://www.snappydata.io
spark-annoy: Building Annoy Index on Apache Spark
spark cassandra connector - 스파크에 카산드라 연동하는 라이브러리
spark-fatJAR-example: scala-spark build fat-jar example
- spark-submit을 위한 스파크 앱 JAR 생성하기 - Mk’s Blog
spark-indexed - An efficient updatable key-value store for Apache Spark
Sparkline SNAP
- Introducing Sparkline SNAP: An Integrated OLAP platform on Spark
spark-nkp Natural Korean Processor for Apache Spark
Spark Notebook
SparMysqlSample
spark-nlp - Natural Language Understanding Library for Apache Spark
- Spark NLP: Getting Started With The World’s Most Widely Used NLP Library In The Enterprise
- Spark NLP 101: Document Assembler
- Spark NLP: Installation on Mac and Linux (Part-II)
- Introduction to Spark NLP: Foundations and Basic Components
- Spark NLP 101: LightPipeline
- Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
spark-packages - A community index of packages for Apache Spark
- 스칼라 의존성, 패키지 검색하는 웹 - http://spark-packages.org
spark-ts - Time Series for Spark (The spark-ts Package)
spark-xml - XML data source for Spark SQL and DataFrames
StreamSets Transformer - an execution engine within the StreamSets DataOps platform that allows any user to create data processing pipelines that execute on Spark
- Custom Scala Project for StreamSets Transformer
zio
- SF Scala: Enhancing Spark's Power with ZIO, Qubism and NLP at Scale, Using Nix for Haskell
- Accelerating Spark with ZIO by Leo Benkel - YouTube

Library Monitoring

Deep Dive into Monitoring Spark Applications Using Web UI and SparkListeners (Jacek Laskowski)
Apache Spark performance - All relevant key performance metrics about your Apache Spark instance in minutes
HTRACE TUTORIAL: HOW TO MONITOR YOUR DISTRIBUTED SYSTEMS
delight: A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source
- Delight: An Improved Apache Spark UI, Free, and Cross-Platform - YouTube
spark-dependencies - Spark job for dependency links http://jaegertracing.io
spark-jobs-rest-client - Fluent client for interacting with Spark Standalone Mode's Rest API for submitting, killing and monitoring the state of jobs
Sparklint - The missing Spark Performance Debugger that can be drag and dropped into your spark application!
- SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs (Simon Whitear)
sparkoscope - Enabling Spark Optimization through Cross-stack Monitoring and Visualization
zipkin-dependencies - Spark job that aggregates zipkin spans for use in the UI

Machine Learning

BerkeleyX: CS190.1x Scalable Machine Learning
- Spark: Cluster Computing with Working Sets
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Feature Engineering at Scale With Spark
Audience Modeling With Spark ML Pipelines
Spark + AI Summit 2018 — Overview
Using Native Math Libraries to Accelerate Spark Machine Learning Applications
- Spark ML용 네이티브 라이브러리를 사용해 모델 훈련 속도를 높이는 방법
- 네이티브 라이브러리가 Spark ML에 이점이 되는 이유
- CDH Spark로 네이티브 라이브러리를 활성화하는 방법
- 여타 네이티브 라이브러리 사용 시 Spark ML 성능과의 비교 분석
Machine Learning with Jupyter using Scala, Spark and Python: The Setup
Spark Day 2017 Machine Learning & Deep Learnig With Spark
Building a Big Data Machine Learning Spark Application for Flight Delay Prediction
Apache Spark 2.0 Preview: Machine Learning Model Persistence by Databricks
Ranking Algorithms for Spark Machine Learning Pipeline BM 25 + Wilson score on spark 2.2.0
An Introduction to Machine Learning with Apache Spark™
Multiple Column Feature Transformations in Spark ML
End to End Spark TensorFlow PyTorch Pipelines with Databricks DeltaJim Dowling Logical Clocks ABKim
Accelerating Deep Learning on the JVM with Apache Spark and NVIDIA GPUs
Spark ML hyperparameter tuning
Scaling and Unifying SciKit Learn and Apache Spark Pipelines - YouTube
Sawtooth Windows for Feature Aggregations - YouTube
Run Your Queries Instantly in One of the Most Optimized Environments - YouTube Nephos
KeystoneML - Machine Learning Pipeline
Meson: Netflix's framework for executing machine learning workflows
MLflow
- Manage your Machine Learning Lifecycle with MLflow — Part 1
- MLflow: An Open Platform to Simplify the Machine Learning Lifecycle
MLLib
- Decision Trees
- MLlib: Machine Learning in Apache Spark
- movie recommendation with mllib
- WSO2 Machine Learner: Why would You care?
- Strata 2016 - This repo is for MLlib/GraphX tutorial in Strata 2016
- Spark ML Lab
- Machine Learning with Spark (Spark로 머신러닝하기)
- Apache Spark로 시작하는 머신러닝 입문
  - Apache Spark 입문에서 머신러닝까지
- Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
- Introduction to Machine Learning on Apache Spark MLlib
- Introduction to Machine learning with Spark
  - Introduction to Machine Learning with Spark
  - Code and setup information for Introduction to Machine Learning with Spark
- Introduction to ML with Apache Spark MLib by Taras Matyashovskyy
- pipelineio - End-to-End Spark ML and Tensorflow AI Data Pipelines
- Extend Spark ML for your own model/transformer types
- Accelerating Apache Spark MLlib with Intel® Math Kernel Library (Intel® MKL)
- Improving BLAS library performance for MLlib
- Extend Spark ML for your own model/transformer types
- Machine Learning with Apache Spark
- Building A Linear Regression with PySpark and MLlib
- Building Custom ML PipelineStages for Feature Selection - Marc Kaminski
- Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem
- Dataset deduplication using spark’s MLlib
- Deep Learning with Apache Spark and TensorFlow
- TensorFlow On Spark: Scalable TensorFlow Learning on Spark Clusters - Andy Feng & Lee Yang
  - BigData와 결합한, 분산 Deep Learning 그 의미와 접근 방법에 대하여
- github.com/yahoo/TensorFlowOnSpark
  - Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters
- Deep learning for Apache Spark
- Spark machine learning & deep learning
- Spark Deep Learning Pipelines
- Deep Learning With Apache Spark
  - Part 1
  - Part 2
- Converting Spark ML Vector to Numpy Array
- PyData Tel Aviv Meetup: Learning Large Scale Models for Content Recommendation - Sonya Liberman
MMLSpark - Microsoft Machine Learning for Apache Spark
- Accelerated Spark on GPU-enabled clusters in Azure
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning http://oryx.io
- Production Recommendation Systems with Cloudera 기계 학습 기능을 위한 인프라 및 데이터 파이프라인을 구축하기 위해 Cloudera Oryx 프로젝트를 사용하는 예제
- Kafka + Spark + Cloudera Hadoop 를 이용한 추천시스템
raydp: RayDP: Distributed data processing library that provides simple APIs for running Spark on Ray and integrating Spark with distributed deep learning and machine learning frameworks
- Build Large-Scale Data Analytics and AI Pipeline Using RayDP - YouTube
spark-vlbfgs - an implementation of the Vector-free L-BFGS solver and some scalable machine learning algorithms for Apache Spark
TransmogrifAI Chetan Khatri - TransmogrifAI - Automate ML Workflow with power of Scala and Spark at massive scale

Mesos

Spark + Mesos cluster mode, who uploads the jar?

PySpark

PySpark
PySpark & Hadoop: 1) Ubuntu 16.04에 설치하기
PySpark & Hadoop: 2) EMR 클러스터 띄우고 PySpark로 작업 던지기
PySpark Cheat Sheet: Spark in Python
Big Data Analytics using Python and Apache Spark | Machine Learning Tutorial
troubleshooting
- A Beginner's Guide on Troubleshooting Spark Applications
- Caused by: java.lang.ClassNotFoundException: * org.elasticsearch.spark.package sbt configuration such as resolvers
  - Spark Runtime Error - ClassDefNotFound: SparkConf
- java.lang.OutOfMemoryError: GC overhead limit exceeded increase driver memory
- org.apache.spark.SparkException: Could not find BlockManagerEndpoint1 or it has been stopped 검색해도 특별히 나오는게 없음
- spark java.io.IOException: Filesystem closed usually result RDD is too big
- Task not serializable
- TypeError: 'bool' object is not callable Use PYSPARK_PYTHON=...
  - Check Python version in worker before run PySpark job
  - spark-runs-in-local-but-not-in-yarn
- yarn.scheduler.maximum.allocation-mb
  - increase configuration for yarn-site.xml
  - empty disk (not enough free space may cause this too)
- Cannot submit Spark app to cluster, stuck on “UNDEFINED”
  - yarn.nodemanager.resource.memory-mb 조정 후 동작 확인
- contains a task of very large size warning
  - 문제; Dataframe으로 읽어 온 row들을 텍스트 처리 해서 row끼리 비교를 해야 하는데, a task of very large size warning 발생
  - 해결; 텍스트 처리 된 중간 결과물을 Redis에 저장한 뒤 별도 Spark 애플리케이션을 사용해서 Row by Row 처리
  - 원인
    - Spark는 각 Executor가 수행해야 할 작업을 Task라는 단위로 관리
    - RDD에 가해지는 연산을 상호 의존성에 따라 묶은 뒤 (Logical Planning) 여기에 최적화 룰을 적용해서 실제로 Executor가 처리해야 할 Task의 형태로 생성 (Physical Planning)
    - 이걸 내부 queue에 넣어 뒀다가 순차적으로 Executor에 보내서 처리
    - 이 과정을 좀 더 구체적으로 설명하자면, Driver 프로세스가 작업 루틴과 작업 대상 위치를 TaskDescription 객체로 만든 뒤 Serialize를 해서 Worker 프로세스에 네트워크 상으로 전송
    - 문제는 Task당 100kb를 넘으면 "contains a task of very large size warning" 경고 발생
    - 이 제한은 소스코드 안에 하드 코딩되어 있어 변경 불가능
    - broadcast 기능을 사용할 경우 상황은 더 악화
    - broadcast 기능은 task를 전송할 때와는 달리 데이터 값 그 자체를 Worker에 하나하나 보내는 방식으로 동작
    - 이 경우 보내야 할 row가 한두 개가 아니므로, 당연히 성능에 문제 발생
    - 이런 이유 때문에 자연어 처리가 된 중간 결과물을 별도 스토리지에 저장한 뒤 별도 애플리케이션에서 읽어와서 처리하는 방법만 가능
    - 여러 storage 중에서 굳이 Redis를 추천하는 이유는 빠르고, Key-Value Store라 관리하기 좋고, Sharding 기능 덕분에 읽기 분산도 잘 동작하기 때문
    - 최근 Spark ML에서 학습된 모델이 Redis에 저장되는 식으로 개발되고 있음
- Spark Interpreter 이슈 해결
Getting started with PySpark - Part 1
Getting started with PySpark - Part 2
PySpark Internals
Fast Data Analytics with Spark and Python
pyspark-hbase.py
Deploying PySpark on Red Hat Storage GlusterFS
practice - weird case from pyspark-hbase (utf8 & unicode mixed)
Python Versus R in Apache Spark
biospark
Plagiarizing and Paraphrasing Code From an Online Class for Content Marketing
How-to: Use IPython Notebook with Apache Spark
Configuring IPython Notebook Support for PySpark
pyADAM - This is a wrapper to load Parquet data in PySpark
PySpark: 손상된 parquet파일 무시하기
Accessing PySpark in PyCharm
pyspark-project-example - A simple example for PySpark based project
Recommendation Systems for Implicit Feedback
Hassle Free ETL with PySpark
안명호 : Python + Spark, 머신러닝을 위한 완벽한 결혼 - PyCon APAC 2016
Fully Arm Your Spark with Ipython and Jupyter in Python 3
- Installation
PySpark Cheat Sheet: Spark in Python
Apache Spark for Data Science
BigDL on CDH and Cloudera Data Science Workbench BigDL (Apache Spark의 심층 학습 라이브러리)을 워크 벤치와 함께 사용하는 방법
Distributed Deep Learning At Scale On Apache Spark With BigDL
Deep Learning to Big Data Analytics on Apache Spark Using BigDL - Yuhao Yang & Xianyan Jia
Deep Learning on Qubole Using BigDL for Apache Spark – Part 2
- 딥러닝 라이브러리인 BigDL을 사용하여 모델을 학습하고 평가하는 방법을 보여주는 간단한 자습서
Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench Python 라이브러리를 사용하는 PySpark 작업을 작성하는 방법
Install Spark on Windows (PySpark)
pyspark 로컬 설치
Get Started with PySpark and Jupyter Notebook in 3 Minutes
Best Practices Writing Production-Grade PySpark Jobs
- PySpark-Boilerplate
How to use PySpark on your computer
Spark Python Performance Tuning
Getting The Best Performance With PySpark
Improving Python and Spark Performance and Interoperability: Spark Summit East talk by Wes McKinney
- Improving Python and Spark Performance and Interoperability: Spark Summit East talk by: Wes McKinney
High Performance Python On Spark
- High Performance Python on Apache Spark
Comparing Performance between Apache Spark and PySpark
Keynote: Making the Big Data ecosystem work together with Python - Holden Karau
Downloading spark and getting started with python notebooks (jupyter) locally on a single computer
A Brief Introduction to PySpark - A primer on PySpark for data science
Introducing Pandas UDF for PySpark
Reading CSV & JSON files in Spark – Word Count Example
How to Upload/Download Files to/from Notebook in my Local machine
Analyze MongoDB Logs Using PySpark
Real-world Python workloads on Spark: EMR clusters
First Steps With PySpark and Big Data Processing
New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0™
How to create a simple ETL Job locally with PySpark, PostgreSQL and Docker
Data Collab Lab: Automate Data Pipelines with PySpark SQL - YouTube
Data Quality: Especially important with the medallion architecture with PySpark data testing - YouTube
How to Manage Python Dependencies in Spark - The Databricks Blog
데이터 분석 라이브러리 개발기 (1)
데이터 분석 라이브러리 개발기 (2) - 통합 테스팅과 문서화를 동시에 잡는 방법
04b: Databricks – Spark SCD Type 2 with Merge | Java-Success.com
웹로그 히스토리 데이터를 통한 데이터 분석 꼼수 : 네이버 블로그
Pandas API on Apache Spark - Part 1: Introduction
Pandas API on Apache Spark - Part 2: Hello World
Simplifying Testing of Spark Applications - Megan Yow | PyData Global 2021 - YouTube
Pyspark Functions - YouTube
Koalas: pandas API on Apache Spark
- Koalas: Easy Transition from pandas to Apache Spark
- 10 Minutes from pandas to Koalas on Apache Spark With demonstrable Python how-to Koalas code snippets and Koalas best practices
- New Developments in the Open Source Ecosystem: Apache Spark 3 0, Delta Lake, and Koalas
- pandas 코드로 대규모 클러스터에서 더 빠르게 빅데이터를 분석 해보자 - Koalas - 박현우 - PyCon Korea 2020 - YouTube
- The Jungle of Koalas, Pandas, Optimus and Spark | by Favio Vázquez | Towards Data Science
- Project Zen: Making Spark Pythonic | Reynold Xin | Keynote Data + AI Summit EU 2020 - YouTube
Petastorm - a library enabling the use of Parquet storage from Tensorflow, Pytorch, and other Python-based ML training frameworks
- Introducing Petastorm: Uber ATG's Data Access Library for Deep Learning | Uber Engineering Blog
Snowflake
- Read from Kafka & Write to Snowflake via Spark Databricks | LinkedIn
- 입 개발 Spark SQL Query to Snowflake Query | Charsyam's Blog

R

Spark 1.4 for RStudio
Python Versus R in Apache Spark
SparkR 설치 사용기 1 - Installation Guide On Yarn Cluster & Mesos Cluster & Stand Alone Cluster
MS R(구 Revolution R) on Spark - 설치 및 가능성 엿보기(feat. SparkR)
sparklyr — R interface for Apache Spark
sparklyr — R interface for Apache Spark
sparklyr
xwMOOC 기계학습 - dplyr을 Spark 위에 올린 sparklyr
sparklyr – An R interface for Apache Spark
spark + R
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
On-Demand Webinar and FAQ: Parallelize R Code Using Apache Spark
Vectorized R Execution in Apache Spark - Hyukjin Kwon (Databricks)
How to Improve R Performance in SparkR at Apache Spark 3.0

Spark 3.0

Data Source V2 API in Spark 3.0 - Part 1 : Motivation for New Abstractions
Data Source V2 API in Spark 3.0 - Part 2 : Anatomy of V2 Read API
Data Source V2 API in Spark 3.0 - Part 3 : In-Memory Data Source
Data Source V2 API in Spark 3.0 - Part 4 : In-Memory Data Source with Partitioning
Data Source V2 API in Spark 3.0 - Part 5 : Anatomy of V2 Write API
Data Source V2 API in Spark 3.0 - Part 6 : MySQL Source
Introduction to Spark 3.0 - Part 1 : Multi Character Delimiter in CSV Source
Introduction to Spark 3.0 - Part 2 : Multiple Column Feature Transformations in Spark ML
Introduction to Spark 3.0 - Part 3 : Data Loading From Nested Folders
Introduction to Spark 3.0 - Part 4 : Handling Class Imbalance Using Weights
Introduction to Spark 3.0 - Part 5 : Easier Debugging of Cached Data Frames
Introduction to Spark 3.0 - Part 6 : Min and Max By Functions
Introduction to Spark 3.0 - Part 7 : Dynamic Allocation Without External Shuffle Service
Introduction to Spark 3.0 - Part 8 : DataFrame Tail Function
Introduction to Spark 3.0 - Part 9 : Join Hints in Spark SQL
Introduction to Spark 3.0 - Part 10 : Ignoring Data Locality in Spark
Spark Plugin Framework in 3.0 - Part 1: Introduction
Spark Plugin Framework in 3.0 - Part 2 : Anatomy of the API
Spark Plugin Framework in 3.0 - Part 3 : Dynamic Stream Configuration using Driver Plugin
Spark Plugin Framework in 3.0 - Part 4 : Custom Metrics
Spark Plugin Framework in 3.0 - Part 5 : RPC Communication
Adaptive Query Execution in Spark 3.0 - Part 1 : Introduction
Adaptive Query Execution in Spark 3.0 - Part 2 : Optimising Shuffle Partitions
AQE: Coalescing Post Shuffle Partitions – tech.kakao.com
Distributed TensorFlow on Apache Spark 3.0
Barrier Execution Mode in Spark 3.0 - Part 1 : Introduction
Barrier Execution Mode in Spark 3.0 - Part 2 : Barrier RDD
Webinar: A preview of Apache Spark 3.0
Spark & AI summit and a glimpse of Spark 3.0 - Towards Data Science
Spark 3.0에 새로 추가된 기능 소개 및 설명 - Nephtyw’S Programming Stash
NVIDIA Accelerates Spark Data Analytics Platform | NVIDIA Blog
Spark 3.0 — New Functions in a Nutshell - Javarevisited - Medium
Spark & AI summit and a glimpse of Spark 3.0 | by Adi Polak | Towards Data Science
Apache Spark 3.0 변경 사항

Spark DL

A Vision for Making Deep Learning Simple From Machine Learning Practitioners to Business Analysts

Spark SQL

Spark SQL, DataFrames and Datasets Guide
Deep Dive into Spark SQL’s Catalyst Optimizer
- DataFrame이 RDD와 다르게 최적화를 적용할 수 있는 이유
SparkSQL cacheTable 메소드 사용 성능 비교 - default vs cacheTable vs cacheTable (with columnar Compression)
SparkSQL Internals
Spark Data Source API. Extending Our Spark SQL Query Engine
Five Spark SQL Utility Functions to Extract and Explore Complex Data Types
- JSON 및 중첩 구조를 처리하기 위해 탑재된 Spark SQL 함수를 사용하기 위한 튜토리얼
Spark SQL: Another 16x Faster After Tungsten
Windowing Functions in Spark SQL Part 1 | Lead and Lag Functions | Windowing Functions Tutorial
Windowing Functions in Spark SQL Part 2 | First Value & Last Value Functions | Window Functions
Windowing Functions in Spark SQL Part 3 | Aggregation Functions | Windowing Functions Tutorial
Windowing Functions in Spark SQL Part 4 | Row_Number, Rank and Dense_Rank in SQL
Simplifying Change Data Capture with Databricks Delta
Spark DataFrameWriter에서 saveAsTable 의 동작
Dynamic Shuffle Partitions in Spark SQL
Tech Chat: Faster Spark SQL: Adaptive Query Execution in Databricks - YouTube
Sentiment Analysis on Demonetization in India using Apache Spark - Projects Based Learning
FLARE: SCALE UP SPARK SQL WITH NATIVE COMPILATION AND SET YOUR DATA ON FIRE!
- 실험 단계
- 쿼리플랜을 native code로 바꾸고 spark runtime system도 수정해 Spark SQL성능을 대폭 향상
- Flare: Native Compilation for Heterogeneous Workloads in Apache Spark
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL
- Apache Spark™ Distributed Matrix Computation

Streaming

Improved Fault-tolerance and Zero Data Loss in Spark Streaming
Four Things to know about Reliable Spark Streaming
Improved Fault-tolerance and Zero Data Loss in Spark Streaming
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real-Time Analytics with Spark Streaming
- Diving into Spark Streaming’s Execution Model
Can Spark Streaming survive Chaos Monkey?
RecoPick 실시간 데이터 처리 시스템 전환기 (Storm에서 Spark Streaming으로 전환)
From Big Data to Fast Data in Four Weeks or How Reactive Programming is Changing the World – Part 2
Spark Streaming으로 유실 없는 스트림 처리 인프라 구축하기
Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1
Handling empty batches in Spark streaming
Spark Streaming Example(예제로 알아보는 Spark Streaming)
Long-running Spark Streaming Jobs on YARN Cluster
- spark-submit으로 장기간 streaming 분석 작업 실행하기
Spark Streaming 운영과 회고
Deep Learning and Streaming in Apache Spark 2 x - Matei Zaharia & Sue Ann Hong
24/7 Spark Streaming on YARN in Production
Running multiple Spark Streaming jobs of different DStreams in parallel
Arbitrary Stateful Processing in Apache Spark’s Structured Streaming
- 'exactly once' 주제에서 Apache Spark의 Structured Streaming으로 중복 제거를 구현하는 방법에 대해 설명
- 워터마크 기반으로 한 중복 제거 외에도 mapGroupsWithState를 사용하여 상태 저장 집계에 사용자 정의 로직을 추가 할 수 있는 방법에 대해 간략하게 설명
Internals of Spark Streaming
Why is My Stream Processing Job Slow?
How we built a data pipeline with Lambda Architecture using Spark/Spark Streaming 월마트 랩에서 Apache Kafka, Spark Streaming/Batch로 Lambda 아키텍처를 구현하기 위해 구축된 A/B 테스트 플랫폼 소개
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Ingesting Raw Data with Kafka-connect and Spark Datasets
Introduction to Spark Structured Streaming - Part 15: Meetup Talk on Time and Window API
번역글 Spark Streaming의 내부
Comparing Apache Spark, Storm, Flink and Samza stream processing engines - Part 1 Apache Spark, Storm, Flink, Samze를 비교 분석
Kafka Streams vs. Spark Structured Streaming
Kafka Streams vs. Spark Structured Streaming (extended)
Kafka offset committer for Spark structured streaming
- Structured Streaming은 Kafka 에서 데이터를 가져올 때 사용하는 경우가 많음
- Spark가 Kafka consumer group ID를 임의로 지정하고 commit도 하지 않아 별도의 streaming query listener를 구현해 추적하는 방안 외에는 적당한 방도가 없음
- commit할 group ID를 지정하면 개별 batch의 commit된 offset정보를 Kafka로 commit, 기존 Kafka 툴들과 조합하면 lag등을 추적하는 데 도움
Scaling Spark Streaming for Logging Event Ingestion
State Storage in Spark Structured Streaming
State Management in Spark Structured Streaming
Watermarking in Spark Structured Streaming
Structured streaming in a flash
File sink and Out-Of-Memory risk on waitingforcode.com - articles about Apache Spark Structured Streaming
입 개발 Kafka 와 Spark Structured Streaming 에서 checkpoint 에서 아주 과거의 Offset이 있으면 어떻게 동작할까? | Charsyam's Blog
입 개발 Spark Structured Streaming 에서 Offset 은 어떻게 관리되는가(아주 간략한 버전)? | Charsyam's Blog
입 개발 Spark Kafka Streaming 에서의 BackPressure 에 대한 아주 간단한 정리. | Charsyam's Blog
Structured Streaming Use-Cases at Apple - YouTube

TDD, Test

Unit Testing Apache Spark Applications using Hive Tables
How I test with Apache Spark?

YARN

Running Spark on YARN
Apache Spark Resource Management and YARN App Models
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark Yarn Cluster vs Spark Mesos Cluster (vs 기타 다양한 모드) 수행성능 및 활용성 비교
Dynamic Resource Allocation Spark on YARN
Investigation of Dynamic Allocation in Spark
Spark Cluster Settings On Yarn : Spark 1.4.1 + Hadoop 2.7.1
Spark logging configuration in YARN
Understanding Apache Spark on YARN
Spark on YARN: a Deep Dive - Sandy Ryza, Cloudera
Apache Spark Performance Benchmarks show Kubernetes has caught up with YARN - Data Mechanics Blog

Zeppelin

Zeppelin
Apache Zeppelin Release 0.7.0
www.zepl.com previously www.zeppelinhub.com
practice
- meetup
Introduction to Zeppelin
Zeppelin overview
Zepplin (제플린) 설치하기
도커로 간단 설치하는 Zeppelin
5. 웹 기반 명령어 해석기 Zeppelin Install
How-to: Install Apache Zeppelin on CDH
Angular display system dashboard on Zeppelin
Apache Zeppelin으로 데이터 분석하기 by VCNC
Zeppelin Context
Apache Tajo 데스크탑 + Zeppelin 연동 하기
제플린 탑재한 이엠알 16년 4월
Zeppelin at Twitter
아파치 제플린, 한국에서 세계로 가기까지
Zeppelin Lab
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 시스템 구축 사례(1)
Serving Shiro enabled Apache Zeppelin with Apache mod_proxy + SSL (https)
Analyzing BigQuery datasets using BigQuery Interpreter for Apache Zeppelin
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
- 제플린 걸음마 서울시립대학교 데이터마이닝 활용사례 제플린 노트북 통계 추출 코드
노트7의 소셜 반응을 분석해 보았다. #3 제플린 노트북을 이용한 상세 분석
9월 발렌타인 웨비너 - 민경국님의 Apache Zeppelin 입문 온라인 헨즈온강의
오픈소스 일기 2: Apache Zeppelin 이란 무엇인가?
How Apache Zeppelin runs a paragraph
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
- Zeppelin 화재 뉴스 기사 분류 예제
스파크-제플린으로 통계 그래프 출력하기(윈도우환경) 실패 이야기
Apache Zeppelin Data Science Environment 1/21/16
Zeppelin Build and Tutorial Notebook
zdairi is zeppelin CLI tool
Zeppelin Paragraph 공유 시 자동 로그인 구현
25분 만에 Apache Zeppelin 으로 대시보드 만들기 - 박훈(@1ambda)
Using Amazon Athena with Apache Zeppelin
ZEPL - How to Configure a JDBC Interpreter
www.zepl.com/resources how-to videos
Spark Scala Note 1
Journey to the Continuous and Scalable Big Data Platform
Big Data Tools 소개 – IntelliJ IDEA 내에서 Spark 통합 및 Zeppelin 노트북 지원
K-Means clustering with Apache Spark and Zeppelin notebook on Docker
Zeppelin notebook shortcuts - Mk’s Blog
Using Apache Zeppelin with SQL Server | by Mike Moritz | Medium
Zeplin ML: a ML plugin for Zeplin - YouTube
📊데이터 시각화 플랫폼 제플린 #zeppelin #dataviz - YouTube
- zeppelin
📊제플린 쉽게 시작하기 #zeppelin #dataviz - YouTube
📊제플린과 DB 연결하기 #mysql #dataviz - YouTube
Setup Zeppelin with K8S mode on NAVER Container Cluster | by EuiYul Song | Apr, 2021 | Medium
Dynamic Forms 동작 시 랜덤하게 paragraph 내용이 사라지는 문제와 임시 해결안 | by Sinjin | Feb, 2021 | Medium
Incorporating Plotly into your Zeppelin notebooks with Spark and Scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark.md

spark.md

Spark

Apache Livy

API

Book

Conference

Deep Learning

Docker

GraphX

Hbase

Ignite

Installation

Kubernetes

Library

Library Monitoring

Machine Learning

Mesos

PySpark

R

Spark 3.0

Spark DL

Spark SQL

Streaming

TDD, Test

YARN

Zeppelin

Files

spark.md

Latest commit

History

spark.md

File metadata and controls

Spark

Apache Livy

API

Book

Conference

Deep Learning

Docker

GraphX

Hbase

Ignite

Installation

Kubernetes

Library

Library Monitoring

Machine Learning

Mesos

PySpark

R

Spark 3.0

Spark DL

Spark SQL

Streaming

TDD, Test

YARN

Zeppelin