Apache

apache bigdata europe
Apache 프로젝트 만들기(1)
Apache 프로젝트 만들기(2)
Projects by category

Airflow

Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark
- Apache Airflow에서 REST API를 사용하여 Databricks 클러스터를 관리하는 예를 소개
데이타 워크플로우 관리를 위한 Apache Airflow #1 - 소개
Airflow Tutorial for Data Pipelines
- Apache Airflow를 시작할 때 참고할만한 튜토리얼
Apache Airflow를 이용한 데이터 워크플로우 자동화
ETL best practices with Airflow documentation site
Integrating Apache Airflow with Apache Ambari
Modern Data Pipelines with Apache Airflow (Momentum 2018 talk) Apache Airflow의 개념, 몇 가지 예제
When Airflow isn’t fast enough: Distributed orchestration of multiple small workloads with Celery
Apache Airflow in the Cloud: Programmatically orchestrating workloads w/ Py - Satyasheel, Kaxil Naik
Advanced Data Engineering Patterns with Apache Airflow AirBnB 데이터 엔지니어링팀의 A/B test, AutoDAG, Engagement & Growth metrics, Scaling 등을 구현하는 Apache Airflow 구축 방법 소개
How to start automating your data pipelines with Airflow
Building a Big Data Pipeline With Airflow, Spark and Zeppelin
Airflow: Lesser Known Tips, Tricks, and Best Practises
Airflow를 이용한 데이터 Workflow 관리
우분투(Ubuntu)에 아파치 에어플로우 (Apache Airflow) 설치
실무에 바로 사용하는 Airflow 2.0 설치
docker-compose로 Airflow 한방에 설치하기
jwon.org/tag/airflow
Getting started with Apache Airflow
Data pipelines, Luigi, Airflow: everything you need to know
Cloud Composer 에서 Airflow Web Server REST API 로 외부에서 DAG 트리거하기
입 개발 airflow 의 schedule_interval 에 대해서
AWS EMR과 Airflow를 이용한 Batch Data Processing | by Min Jo | 101-devs | Aug, 2020 | Medium
CLASS101에서 Airflow와 Amazon EMR을 활용한 ETL 파이프라인 구축 - 조민구(CLASS101) :: 제32회 AWSKRUG DataScience모임 - YouTube
Introducing Airflow 2.0 | Astronomer
Airflow 실패여부 slack알람으로 받기 (python)
airflow CPU가 높게 점유되는 현상
airflow dag의 task를 실행하고 동작하지 않는 현상
Airflow의 execution_date에 대하여 - Nephtyw’S Programming Stash
버킷플레이스 Airflow 도입기 - 오늘의집 블로그
쏘카 데이터 그룹 - Airflow와 함께한 데이터 환경 구축기(feat. Airflow on Kubernetes) - SOCAR Tech Blog
- 처음에는 Rundeck 이용, Airflow를 사용하기로 하면서 매니지드 서비스인 GCP의 Composer 사용, 회사와 데이터 팀이 커지면서 문제 발생
- 결국 Kubernetes 위에서 Airflow를 구축해서 운영하기로 결정. Kubernetes 위에서 운영하는 방법을 자세히 설명
Hello, Apache Airflow
후기 이미지 자동 검수 모델, 어떻게 서비스할까? | by MUSINSA tech | Medium | MUSINSA tech
airflow 파라미터 튜닝
나만의 Airflow 클러스터 만들기 (feat. k3d)
Apache Airflow와 Amazon SageMaker Feature Store 연동하기 | by Sungin Lee | Cloud Villains | Sep, 2021 | Medium
Misconfigured, old Airflow instances leak Slack, AWS credentials | ZDNet
ETL Pipelines with Airflow: the Good, the Bad and the Ugly | Airbyte
배치 파이프라인 도입을 위한 Workflow 리서치 (Airflow VS Azkaban VS Oozie)
Apache Airflow Tutorials for Beginner

Ambari

3 GREAT REASONS TO TRY APACHE HIVE VIEW 2.0
- Apache Ambari에서 Apache Hive 2.5와 상호 작용할 수 잇는 새로운 기능을 소개
- Optimizer가 사용하는 테이블과 컬럼 통계를 보고 연산 가능, Explain pland 시각화 포함
WHY SHOULD YOU CARE ABOUT AMBARI 2.5?
- Apache Ampari 2.5 공개. 서비스 자동 재시작, 로그 로테이션/로그 검색, 개선된 구성 관리와 새로운 모니터링 기능 등이 포함
How to upgrade Apache Ambari 2.6.2 to Apache Ambari 2.7.3

Apex

Apex 스트림 및 배치 프로세스 엔진
Real-time Stream Processing using Apache Apex
Throughput, Latency, and Yahoo! Performance Benchmarks. Is there a winner? - See more at: https://www.datatorrent.com/blog/throughput-latency-and-yahoo
SQL on Apache Apex
Writing to Apache Kudu from Apache Apex
- Apache Apex를 사용하여 Apache Kafka에서 Apache Kudu로 데이터를 쓰는 방법

Arrow

Arrow
Apache Arrow - Powering Columnar In-Memory Analytics - Arrow is a set of technologies that enable big-data systems to process and move data fast
Why pandas users should be excited about Apache Arrow
Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow
Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard
Improving Python and Spark Performance and Interoperability with Apache Arrow
- Apache Arrow 프로젝트는 cross-language columnar in-memory alanytics를 구현
- 대부분의 개발자는 Arrow를 직접 다루지 않지만 PySpark와 같은 여러 가지 작업을 빠르게 처리 가능(하다고 주장)
- 이 프레젠테이션은 Arrow가 무엇인지, 그리고 그것이 어떻게 속도 향상을 이룰 수 있는지 소개
Apache Arrow (Python)
- Reading and Writing the Apache Parquet Format
Wes McKinney: Ursa Labs and Apache Arrow in 2019 | PyData Miami 2019
Apache Arrow and Java: Lightning Speed Big Data Transfer
Apache Arrow: Read DataFrame With Zero Memory | Towards Data Science

Atlas

Atlas 데이터 거버넌스, 표준, 계보 관리 플랫폼
Apache Atlas — Using the v2 Rest API Atlas의 Rest API를 사용하여 데이터를 기록하는 방법 소개

Beam

Former DataFlow
The Beam Model : Streams & Tables
- 스트림 및 테이블을 기반으로 작성된 Apache Beam 모델에 대한 내용
bcho.tistory.com/search/dataflow
구글 데이타 스트리밍 데이타 분석 플랫폼 dataflow - #1 소개
데이타 스트리밍 분석 플랫폼 Dataflow 개념 잡기 #1/2
데이타 스트리밍 분석 플랫폼 Dataflow 개념 잡기 #2/2
GOOGLE DATA FLOW - Google의 Data Flow 개념 및 프로그래밍 방법
데이타 플로우 #4 개발환경 설정하기
데이타 플로우 #5 프로그래밍 모델의 이해
Face recognition Image Cropping and Filtering notebook
- Apache Beam 기반의 전처리 코드
Comparing the Dataflow/Beam and Spark Programming Models
Type safe BigQuery in Apache Beam with Spotify’s Scio

BookKeeper

Apache BookKeeper: A High Performance and Low Latency Storage Service

Brooklyn

Brooklyn

Camel

Camel
Apache Camel 소개
Streaming in the Cloud With Camel and Strimzi
How Apache Camel simplified our process integrations
Top 5 Courses to Learn Apache Camel in 2022 - Best of Lot
5 Best Apache Camel Courses for Java Developers in 2022 | Java67

Commons

Commons

Cordova

Apache Cordova: after 10 months, I won't be using it anymore
Cordova 환경 구성 & Git Ignore 설정
ionic cordova emulate 실행 시 Cannot read property 'replace' of undefined 에러 해결하기

Crunch

Crunch

Drill

Drill
Apache Drill SQL Query Optimization | Whiteboard Walkthrough
A Gentle introduction to Apache Drill

Druid

Druid
druid.io
임플라이, 드루이드 기반 오픈소스 분석 플랫폼 공개
Imply - Exploratory Analytics Powered By Druid
Druid is a high-performance, column-oriented, distributed data store
An Introduction to Druid
Aggregated queries with Druid on terrabytes and petabytes of data
Combining Druid and Spark: Interactive and Flexible Analytics at Scale
Time series OLAP
- Druid 입문(1)
- Druid 실시간 수집(2)
- Druid Batch Ingestion(3)
- Druid Segment(4)
- Glue Architecture(5)
- Druid Trouble Shooting(6)
Scalable Real-time analytics using Druid
Druid 성능 엿보기. Spark이랑 같이 보자
JDBC를 통한 하둡 적재, 알면 도움되는 삽질 이야기 1편
Hive 와 Druid로 울트라-빠른 OLAP 분석하기
벤치마크 Apache Hive와 Druid를 통한 sub-second 분석 -2편
Ultra-Fast OLAP Analytics With Apache Hive and Druid (Part 1)
Ultra-Fast OLAP Analytics With Apache Hive and Druid (Part 2)
4th Druid Meetup 참석 후기
Comparison of the Open Source OLAP Systems for Big Data: ClickHouse, Druid and Pinot
- Open Source 분산 스토리지 엔진인 ClickHouse, Druid, Pinot을 비교
- 시스템 간의 유사성(예: 저장 및 인덱스), 성능 특성, 데이터 처리, 데이터 복제 및 쿼리 실행의 유사성과 차이점 설명
Web analytics at scale with Druid at naver.com
An introduction to Druid, your Interactive Analytics at (big) Scale
How Druid enables analytics at Airbnb
- Airbnb에서 분석을 위해 Druid를 사용한 경험담 소개
- Druid를 통해 다른 빅데이터 시스템 보완 방법, Spark Streaming으로 데이터 수집 방법, Presto 통합 방법, 모니터링 그리고 문제점 및 향후 개선 사항 설명
Realtime Data in Apache Druid — Choosing the Right Strategy
How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience
What Makes Apache Druid Great for Realtime Analytics?
입개발 Druid에서 transform 시 알아야 할 팁. | Charsyam's Blog
PyData Triangle January 2022 Meetup - YouTube
metatron.app Self-service Solution for Big Data Discovery. All-in-one analytics from easy data preparation to fast visualization
- github.com/metatron-app

Eagle

Apache Eagle

Falcon

Falcon - Simplifying Managing Data Jobs on Hadoop

Flink

Flink
Apache Flink Training
Juggling with Bits and Bytes
스사모 테크톡 - Apache Flink 둘러보기
Off-heap Memory in Apache Flink and the curious JIT compiler
Stream Processing with Apache Flink
High-throughput, low-latency, and exactly-once stream processing with Apache Flink
Continuous Processing with Apache Flink - Strata London 2016
Introduction to Flink Streaming
- Part 1 : WordCount
- Part 2 : Discretization of Stream using Window API
- Part 3 : Running Streaming Applications in Flink Local Mode
- Part 4 : Understanding Flink's Advanced Stream Processing using Google Cloud Dataflow
- Part 5 : Window API in Flink
- Part 6 : Anatomy of Window API
- Part 7 : Implementing Session Windows using Custom Trigger
- Part 8 : Understanding Time in Flink Streaming
- Part 9 : Event Time in Flink
- Part 10 : Meetup Talk
- Introduction to Flink Streaming
- Flink Examples
A Deep Dive into Rescalable State in Apache Flink
- 체크 포인트 기능을 사용하여 작업을 조정 (예 : 병렬 처리를 늘리거나 줄이기)하는 방법에 대해 설명
Stream Processing with Apache Flink and DC/OS
- DC/OS를 사용하여 Mesos에서 Apache Flink 스트리밍 작업을 실행하는 방법에 대해 소개
StreamING Machine Learning Models: How ING Adds Fraud Detection Models at Runtime with Apache Flink®
- ING 생명이 리스크 분석 엔진으로 Apache Flink를 어떻게 사용하는지 설명
- Apache Spark, Knime 및 Apache Zeppelin을 일괄 처리 모델로 사용하지만 실시간 구성 요소는 Flink를 사용
PREDICTIVE MAINTENANCE WITH APACHE FLINK
- Keras로 만든 time-series prediction model을 Flink와 연동한 이야기
- python deep learning library(tensorflow, keras)를 이용해서 만든 모델을 JVM에서 어떻게 사용하는지
- Apache Spark에 비해서 Apache Flink가 가지는 장점에는 어떤 것들이 있는지
Complex Event Processing with Flink: An Update on the State of Flink CEP
- Flink는 이벤트 패턴을 감지하는 고급 API를 제공하여 복잡한 이벤트 처리를 지원
- API에 대한 개요와 온라인 소매 업체의 선적 추적에 대한 예제
An Overview of End-to-End Exactly-Once Processing in Apache Flink® (with Apache Kafka, too!)
Apache Flink Basic Transformation Example 파일 데이터를 읽어 대문자로 변환한 후 다른 파일에 쓰는 예제
Flink Forward San Francisco 2018 Videos and Slides
STREAM ANALYTICS PLATFORM FOR A TELCO
- Apache Flink와 Flink 기반으로 스트림 처리 시스템 구축을 한 사례 설명
- 합성 데이터로 시스템을 테스트하는 방법과 ELK를 사용하여 모니터링 하는 방법도 설명
- PART 1
- PART 2
Flink at netflix paypal speaker series
- Netflix의 (수천 대 규모의) 스트림 처리 시스템은 하루 약 4조개 이상(36GB/sec)의 이벤트를 처리
- 이 시스템은 Apache Flink와 Apache Kafka 기반으로 하는 셀프 서비스 인프라로 구축
- Flink를 사용하는 이유와 구현과 운영에 대해 설명
State TTL for Apache Flink: How to Limit the Lifetime of State Flink 1.6.0 TTL 지원
Flink Forward Berlin 2018: Dongwon Kim - "Real-time driving score service using Flink"
- Real-time driving score service using Flink
Automatic Apache Flink deployments in Golang
Automating Flink Deployments to Kubernetes
Introduction to Apache Flink
Flink or Flunk? Why Ele.me Is Developing a Taste for Apache Flink
- Alibaba의 Ele.me 팀에서 데이터 스트림 처리 시스템으로 Apache Flink를 도입한 사례
- Apache Storm, Apache Spark와 비교하여 Flink를 선택한 배경 설명
Introduction of apache flink kosscon2018
Introduction to Flink in 30 minutes - YouTube
About Flink streaming
A Brief History of Flink: Tracing the Big Data Engine’s Open-source Development
Patterns of Streaming Applications
Better to Give and to Receive: Alibaba’s Open-source Contributions to Flink
Running Apache Flink on Kubernetes
- 모니터링(prometheus) 연결; flink /opt 안에 있는 prometheus jar 파일을 /lib 에 옮기고 flink-confi.yaml 에 metrics 부분 설정해준후에 job/task pod annotation 에 prometheus.io/port 와 prometheus.io/scrape 만 설정하면 prometheus sd가 잘 수집
Berlin 2019
europe-2019.flink-forward.org/conference-program
Flink Forward Global 2021
Apache Flink® SQL Training
Do Flink on Web with FLOW
0x90e.github.io/tags/Flink 사용자 코드가 어떻게 Graph로 만들어지고 JobManager로 submit 되는지 코드 단위로 설명한 포스트라고 하는데 중국어
T map에 Flink 이식하기
Flink Source 부터 Sink 까지
Deep dive into flink interval join
Here’s What Makes Apache Flink scale A glance at the Memory management and Network flow control
- GC를 줄이기 위해 로딩시 Heap을 크게 잡아놓고 관리 (memory manager)
  - Operator에서 메모리가 필요할때 memory manager에 메모리(segment) 요청해서 꺼내쓰고 반환
  - 또한 network, disk I/O 속도 향상을 위해 off-heap으로 변환할수 있는 기능 제공 (stateful)
  - 커다란 segment를 Disk에 저장했다가 다시 읽기 가능. OOM 방지
- 데이터 이동 최소화 Operator chain 이용
- 자체 serialize/deserialize 구현. object, 관련키(?), 해시 값을 인접하게 저장 가능. Data prefetch 가능
  - 값의 순서를 보장하기 때문에 정렬시 ser/dser 필요 없음. Values로 되어있는 코드로 추정
- SubTask 중 한곳에 일이 몰려 backpressure로 인해 작업이 block되는것을 credit-based flow control로 방지
  - Subtask는 지금 buffer가 얼마남았는지 전단계의 SubTask에게 알려주고 전단계의 SubTask는 이를 고려하여 task 분배 1 2
OptimizedText.java
Improving throughput and latency with Flink's network stack - Nico Kruber flink flow mechanism
Apache Flink Virtual Meetup Seoul July 23, 2020 - YouTube
- Flink on Kubernetes operator
Enriching your Data Stream Asynchronously with Apache Flink - YouTube
Keynote | Flink Ahead 2.0: The Sequel - Konstantin Knauf - YouTube
Flink SQL in 2020: Time to show off! - Fabian Hueske & Timo Walther - YouTube
Unified APIs for Batch and Stream Processing on Flink - YouTube
2021 Apache Flink Meetup - Hosted by Netflix - YouTube
Flink setup for development (and some IntelliJ Idea cool tricks)
Flink Concept - Operator 간 데이터 교환 | leeyh0216's devlog
Flink Concept - Checkpointing(1) | leeyh0216's devlog
Flink Concept - pipeline.object_reuse | leeyh0216's devlog
Flink Concept - Flink의 Kafka Consumer 동작 방식(1) | leeyh0216's devlog
글로벌 기업이 더 주목하는 스트림 프로세싱 프레임워크 - 플링크(Flink) 이해하기 : 네이버 포스트
5 years of Flink at Mux | Mux blog
Docker를 사용한 Apache Flink와 Flink Job 올리기(1) - Docker Setting | woolog - 개발자 울이
Docker를 사용한 Apache Flink와 Flink Job 올리기(2) - Flink Job Example | woolog - 개발자 울이
flink-ai-extended
flink_feature_radar.svg at feature_radar · StephanEwen/flink-web flink에서 제거/추가될 기능들
Flink Job Listener: Run a task After Flink Job is Completed | CodersTea
HRFS On-demand low-latency feature generation at Hyperconnect - YouTube

Flume

Scaling a flume agent to handle 120K events/sec
- Apache Flume용 새로운 channel selector인 "Round-Robin Channel Selector" 설명
- 이 선택기를 사용하면 기본 배치 처리량의 약 10배까지 확장

Geode

Geode - an open source, distributed, in-memory database for scale-out applications
Apache Geode Lab

Goblin

Goblin

HAWQ - advanced enterprise SQL-on-Hadoop query engine and analytic database

The Apache Software Foundation Announces Apache® HAWQ® as a Top-Level Project
Apache HAWQ 2.4.0.0 Release

Hivemall

Hivemall
hivemall.incubator.apache.org/userguide/index.html
Scalable machine learning library for Hive/Hadoop
Apache Hivemall: Machine Learning Library for Apache Hive/Spark/Pig

Iceberg

Iceberg - a table format for large, slow-moving tabular data
넷플릭스, 대용량 자료 저장공간을 빠른 DB 테이블처럼 사용하는 기술
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3

Ignite

Ignite - Spark Shared RDDs
Accelerate Apache Spark SQL Queries
Performance Tuning of an Apache Kafka/Spark Streaming System

Impala

Impala
Apache Impala (Incubating)
Contributing to Impala
The Impala Cookbook
What’s Next for Impala: More Reliability, Usability, and Performance at Even Greater Scale
How-to: Prepare Unstructured Data in Impala for Analysis
New SQL Benchmarks: Apache Impala (incubating) Uniquely Delivers Analytic Database Performance
Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive
Build a Prediction Engine Using Spark, Kudu, and Impala
Visualize your massive data with Impala and Redash
Latest Impala Cookbook
Ibis on Impala: Python at Scale for Data Science
SQL-on-Hadoop: Impala vs Drill
- Apache Impala와 Apach Drill의 주요 구성 요소와 쿼리 처리 메커니즘에 대해 소개
Apache Impala Leads Traditional Analytic Database
- Live, Spark, Presto와 TPC-DS 밴치마크 비교
How to read Impala query plan and profile? Part 1 and 2
Faster Performance for Selective Queries
Performance Optimizations in Apache Impala
- 쿼리 최적화, 정렬 스캔(ordering scan & Top-N), 조인 패턴 및 이상적인 조인 유형 및 조인 순서 결정, 해시 조인, 집계을 위한 LLVM codegen, 런타임 블룸필터
Benchmarking Impala on Kudu vs Parquet
Hotspotting In Hadoop — Impala Case Study
Apache Impala: My Insights and Best Practices
How to read Impala query plan and profile Part 1 by Juan Yu
5 Main Missing Features in Impala (Opinion)
Assessment of Apache Impala Performance using Cloudera Manager Metrics – Part 1 of 3
- Cloudera Manger의 차트와 메트릭 기능을 사용하여 Impala 성능 이슈를 해결하는 방법
Impala At Scale - 임상배 이사 (Cloudera)
practice - extract hour from unixtimestamp

Jena

Jena

Kafka

Kafka
kafka-tutorials.confluent.io
Confluent Developer: Your Apache Kafka® Journey begins here
Docker Quick Start
practice - Kafka on Python
- Kafka Python and Google Analytics
- Getting started with Apache Kafka in Python
Kafka For Beginners
주니어 개발자의 storm kafka 시작하기
Understanding Kafka with Factorio | by Ruurtjan Pul | Medium
Kafka 시작하기 | FUREWEB
Learn Kafka - Apache Kafka Tutorials and Resources | Confluent Developer
Apache Kafka and Confluent Platform examples and demos
Apache Kafka Best Practices
kafka-console-consumer.sh
Kafka - kafka-console-consumer
Vertically scaling Kafka consumers
- A look at the inner workings of the Kafka consumers, with some real world recommendations for deploying them when there's high latency in talking to the Kafka cluster and/or a large number of partitions. There are tips on important metrics to monitor, configurations, garbage collector settings, and changing the partition.class to improve unbalanced consumers.
KAFKA TUTORIAL: USING KAFKA FROM THE COMMAND LINE
Kafka Tutorial - Quick Start Demo
ClickHouse Kafka Engine Tutorial
Introduction to Apache Kafka by James Ward
Kafka frequent commands
Kafka in a Nutshell
빅데이터의 기본 아파치 카프카! 개요 및 설명 | What is apache kafka?
How To Install Apache Kafka on Ubuntu 14.04
Apache Kafka. MacOS installation guide
Install Kafka in RHEL 7
Using Apache Kafka Docker
Kafka Docker - Run multiple Kafka brokers in Docker
kafka-stack-docker-compose
A Simple Apache Kafka Cluster With Docker, Kafdrop, and Python | by Leo Brack | Better Programming | Oct, 2020 | Medium
HANDS-FREE KAFKA REPLICATION: A LESSON IN OPERATIONAL SIMPLICITY
Distributed Consensus Reloaded: Apache ZooKeeper and Replication in Apache Kafka
Changing Replication Factor of a Topic in Apache Kafka
Bottled Water: Real-time integration of PostgreSQL and Kafka
Apache Kafka, Samza, and the Unix Philosophy of Distributed Data
Apache Kafka: Case of Large Messages, Many Partitions, Few Consumers
The Power of Kafka Partitions : How to Get the Most out of Your Kafka Cluster
From Kafka to ZeroMQ for real-time log aggregation
SQL on Kafka
Kafka at HubSpot: Critical Consumer Metrics
Bottled Water: Real-time integration of PostgreSQL and Kafka
빅데이터 윤활유 '아파치 카프카', 왜 주목받나
Why I am not a fan of Apache Kafka
What’s New in Cloudera’s Distribution of Apache Kafka?
Apache Kafka 성능 테스트
Using Golang and JSON for Kafka Consumption With High Throughput
Golang에서 카프카 컨슈머 그룹과 재시도로 결과적 일관성 구현하기 | Popit
대용량 스트리밍 데이터 실시간 분석
Monitoring Kafka performance metrics
How to Monitor Kafka
MONITORING APACHE KAFKA WITH GRAFANA / INFLUXDB VIA JMX
카프카 커넥트 JMX + 로그스태시로 모니터링 하기
Monitoring Kafka Consumer Offsets
- Kafka consumer offset을 간단하게 모니터링하는 방법
- Kafka consumer offset을 HTTP를 통해 내보내고 Prometheus를 사용하여 Grafana로 시각화
MONITORING KAFKA CONSUMER LAG IN SECONDS
Apache Kafka Monitoring – Methods & Tools
Just Enough Kafka for the Elastic Stack, Part 1
Elastic Stack에는 Kafka면 충분합니다 - 2부
Kafka New Producer API를 활용한 유실 없는 비동기 데이터 전송
Kafka 0.9 Consumer 클라이언트 소개
Presto SQL을 이용하여 Kafka topic 데이터 조회하기
New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest
Kafka Python client 성능 테스트
Understanding of Apache Kafka – Part.1
From Big Data to Fast Data in Four Weeks or How Reactive Programming is Changing the World – Part 1
Apache Kafka, Data Pipelines, and Functional Reactive Programming with Node.js
Building/Running Netflix's Data Pipeline using Apache Kafka
코드 한줄 없이 서비스 Dashboard 만들기(1)
코드 한줄 없이 서비스 Dashboard 만들기(2)
Kafka 운영자가 말하는
- 처음 접하는 Kafka
- Kafka Consumer Group
- Producer ACKS
- 카프카 서버 실전 로그 분석
- TIP
- Topic Replication
- Replication Factor 변경
- 카프카 매니저 소개
Kafka Summit Americas 2021 Recap | Confluent
Kafka Summit New York
Kafka Summit New York 2019 Session Videos
Kafka Summit San Francisco
- Kafka Needs no Keeper
  - Kafka 2.4 들어가면서 zookeeper 가 사라지고 kafka controller broker 가 그 역할을 대신하는데, 어떻게 없앴고 어떤 변화가 있는지에 대한 세션
  - Elasticsearch 운영 경험이 있으신 분들은 kafka controller 가 es master-eligible node 와 비슷한 느낌
- Please Upgrade Apache Kafka. Now
  - Kafka: The Definitive Guide 의 저자이기도 한 Gwen이 오래된 Kafka 버젼들에 존재하는 각종 버그들과 취약점들을 여러가지 소개하면서 업그레이드 해야 할 이유를 설명하는 세션
Martin Kleppmann | Kafka Summit London 2019 Keynote | Is Kafka a Database?
- Online Event Processing Achieving consistency where distributed transactions have failed
The First Annual State of Apache Kafka Client Use Survey Kafka와 함께 어떤 언어를 많이 사용하는지와 이유
Benchmarking Kafka Performance Part 1: Write Throughpu
Securing the Confluent Schema Registry for Apache Kafaka
- Confluent Schema Registry를 보호하고 ZooKeeper 및 Kafka 클러스터 보안 연결하도록 구성하는 방법 소개
Introduction to Apache Kafka Security
Apache Kafka Security | Need and Components of Kafka
- Zookeeper의 조합으로 권한제어, 인증제어, 암호화하는 방법
Kafka Needs No Keeper - Removing ZooKeeper Dependency
- Apache Kafka, ZooKeeper 의존성을 제거 | GeekNews
Kafka Without ZooKeeper: A Sneak Peek At the Simplest Kafka Yet
- Kafka Without ZooKeeper 첫 배포 | GeekNews
Kafka Needs no Keeper - Confluent
- kafka/config/kraft at trunk · apache/kafka
Kafka 보안 (1) - JAAS 및 SASL
Kafka 보안 (2) - SASL/PLAIN
Apache Kafka지도 시간
Exactly-once Support in Apache Kafka
Exactly-once Semantics are Possible: Here’s How Kafka Does it
kafka exactly-once delivery를 지원하기 위한 transaction
Upgrading Apache Kafka Clients Just Got Easier
- 최신 버전에 Kafka 클라이언트의 순방향/역방향 호환성 추가
- 이 기능을 사용하는 방법 및 브로커와 다른 버전의 클라이언트를 사용할 경우에 대해 설명
How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka
- 미션 크리티컬한 실시간 애플리케이션에서 중앙집중적이고 확장가능한 아키텍처를 어떻게 만들지에 대한 유스케이스에 대해 논의
Benchmarking Message Queue Latency
How Apache Kafka Inspired Our Platform Events Architecture
How to know if Apache Kafka is right for you
URP? Excuse You! The Three Kafka Metrics You Need to Know under replicated partition, request handler, requst time에 대해 모니터링할 수 있는 Kafka 메트릭 설명
Top 5 Things Every Apache Kafka Developer Should Know
일상 협업 이야기: 참조 아키텍처 써먹기 편
Scalability of Kafka Messaging using Consumer Groups
Announcing AMQ Streams: Apache Kafka on OpenShift
Robust Message Serialization in Apache Kafka Using Apache Avro, Part 1
- 아파치 카프카(Apache Kafka)에서는 producer라고 하는 Java 애플리케이션으로 구조화된 메시지를 써서 카프카 클러스터(브로커로 구성됨)로 전송. 이들 메시지를 읽는 작업도 마찬가지로 같은 클러스터에서 consumer라는 Java 애플리케이션이 담당. 조직에 따라서 각기 다른 그룹이나 부서에서 producer와 consumer를 쓰고 관리하는 책임을 전담
  - 이런 경우 한 가지 중대한 이슈가 발생. 즉 producer와 consumer 사이에서 서로 합의된 메시지 형식을 조율 필요
  - 예시는 아파치 아브로(Apache Avro)를 사용하여 아파치 카프카를 대상으로 생성된 레코드를 직렬화하면서 스키마를 개발, producer와 consumer 애플리케이션을 비동기식으로 업데이트하는 방법
- 직렬화와 역직렬화
  - 한 개의 카프카 레코드(기존에는 ‘메시지’라고 불림)는 한개의 키, 한개의 값, 헤더로 구성. 카프카는 레코드의 키와 값 면에서 데이터의 구조 인식 불가능. 대신 바이트 어레이 형태로 취급
  - 하지만 카프카로부터 레코드를 읽는 시스템의 입장에서는 이러한 레코드에 포함된 데이터가 중요. 따라서 데이터를 읽을 수 있는 형식으로 도출할 필요
  - 사용해야 하는 데이터 형식의 특성
    - 컴팩트
    - 빠른 인코딩과 디코딩 가능
    - 변화(evolution) 허용
    - 업스트림 시스템(카프카 클러스터에 데이터를 쓰는 시스템)과 다운스트림 시스템(같은 카프카 클러스터에서 데이터를 읽어오는 시스템)이 각기 다른 시점에 새 스키마로 업그레이드 허용
  - 예를 들어 JSON의 경우 설명이 따로 필요 없지만 컴팩트 데이터 형식이 아니고 구문 분석 저속
  - 아브로는 비교적 컴팩트한 출력 데이터를 생성하는 고속 직렬화 프레임워크. 하지만 아브로 레코드를 읽으려면 데이터를 직렬화하는 데 사용한 스키마 필요
  - 한 가지 옵션은 스키마를 레코드 자체와 함께 저장하고 전송. 이 방법은 스키마를 한 번만 저장했다가 다수의 레코드에 사용하는 경우 가능. 카프카 레코드마다 모두 스키마를 하나씩 저장하려면 스토리지 공간과 네트워크 활용도 면에서 중대한 오버헤드 추가
  - 또 한 가지 옵션은 미리 합의한 식별자 스키마 매핑 세트를 정하여 스키마를 레코드 내에 존재하는 각각의 식별자로 참조
Robust Message Serialization in Apache Kafka Using Apache Avro, Part 2
- 스키마 저장소 구현; 저장소로서 Apache Kafka와 함께 작동하는 스키마 공급자 구현
- 인 메모리 SchemaStore
  - 먼저 스키마를 위한 인 메모리 저장소 구현 가능. 이는 이러한 저장소 및 Kafa지원 저장소 캐시 요건을 이해하는 데 유용. SchemaStore는 VersionedSchema 항목 검색이 신속해야 하기 때문에, 각 검색 방법을 지원하기 위해 별도의 맵을 작성. ConcurrentHashMap을 사용하면 잠김 없이 복수의 스레드로부터 이들 맵에 접근 가능
- Kafka Topic에서/으로 쓰기 및 읽기
  - Kafka 기반 SchemaProvider의 나머지 반은 Kafka와 모든 커뮤니케이션을 수행할 수 있는 클래스. 이것은 스키마 컨셉에 묶일 필요가 없어 제네릭 코드로도 가능. 시작 시 모든 스키마를 읽고 새로운 스키마를 위해 계속 폴링하도록 하기 위해 다음과 같이 소비자를 설정
  - enable.auto.commit =false, 시작시 모든 스키마를 다시 읽기 때문
  - 우연히 group.id가 같은 다른 소비자와 메시지를 공유하지 않도록 모든 파티션을 해당 소비자에 수동으로 할당
  - 읽기 전 가장 오래된 메시지 검색
  - 최신 기록을 읽어 들일 때까지 폴링한 후 스키마 공급자 사용을 허용
  - 새로운 스키마를 받기 위해 백그라운드 스레드에서 폴링 지속
- 한 가지 중요한 문제는 스키마 식별자 생성
  - Kafka에는 RDBMS와 같은 시퀀스 개체가 없기 때문에, 추가하는 스키마마다 고유한 정수 필요. 이에 대한 한 가지 간단한 해결 방법은 다음으로 사용 가능한 양(+)의 정수를 검색. 이 경우, 두 명의 관리자가 동일한 식별자로 거의 동시에 스키마를 추가하지 못하도록 막기는 불가능. 이를 막기 위해서 다음과 같이 진행
  - 단일 파티션이 있는 Kafka Topic을 시퀀스로 사용. 단일 메시지를 생성하고 그 오프셋을 사용
  - ZooKeeper 임시 노드를 사용하여 “잠급니다.”
  - 스키마를 추가하는 서비스 도입. 이 애플리케이션은 메모리를 잠그는 게 가능
  - 소수만 접근할 수 있는 주체에 스키마를 저장한 토픽으로 쓰기를 허용하며 책임을 위임
Understanding the ‘enable.auto.commit’ Kafka Consumer property
Robust Message Serialization in Apache Kafka Using Apache Avro, Part 3
Interview with Jay Kreps about Apache Kafka
RDBMS to Kafka: Stories from the Message Bus Stop
카프카, 산전수전 노하우
Kafka timestamp offset
Resetting first dirty offset to log start offset since the checkpointed offset is invalid
Kafka 0.10 Compression Benchmark
How to use Apache Kafka to transform a batch pipeline into a real-time one
Kafka Korea meetup
- About 1st Conference
- KafkaKRU(Kafka 한국사용자 모임) 2회 미니밋업 후기
- Kafka kru CONFERENCE SEOUL 2019 후기 & 정리글
- mini-meetup2
- About 3rd Mini-Meetup
- Kafka Conference Seoul 2019
- 카프카 기반의 대규모 모니터링 플랫폼 개발이야기
Moving From Legacy To Event-Driven With Kafka
CDC & CDC Sink Platform 개발 1편 - CDC Platform 개발 | Hyperconnect Tech Blog Event Bus, Event Driven
CDC & CDC Sink Platform 개발 2편 - CDC Sink Platform 개발 및 CQRS 패턴의 적용 | Hyperconnect Tech Blog
CDC & CDC Sink Platform 개발 3편 - CDC Event Application Consuming 및 Event Stream Join의 구현 | Hyperconnect Tech Blog
카프카 컨슈머 애플리케이션 배포 전략
cloudurable.com/categories/kafka
- Kafka Tutorial 13: Creating Advanced Kafka Producers in Java
  - 압축방식은 lz4가 좋으며, decompress할때 사이즈 넣어야 snappy보다 느린 현상이 발생하지 않음
Introduction to Schemas in Apache Kafka with the Confluent Schema Registry
- kafka는 json에 대한 serde를 제공하지 않음(구현은 가능)
- json보다 avro를 쓸 이유
  - confluent schema registy (schema 정보를 가지고 있는 저장소) 기준
    - 1. 데이터 축소 : 필드명을 보내지 않아도 됨 >> 데이터 : 매직바이트 + schemaID + value
    - 1. producing 되는 데이터의 스키마가 변경되면 schema registry에 등록이나 수정만 하면되니 consumer는 수정하지 않아도 될 가능성이 높음
  - json 처럼 schema가 free 한 경우 잦은 schema의 변경으로 producing 되는경우 consumer는 수정이 불가피 하며 스키마의 대한 정보를 놓치기 쉽고 이력도 알수 없음
(Kafka) 객체를 JSON 타입으로 넘겨보자 :: 당근케잌
Securing the Confluent Schema Registry for Apache Kafka
Kafka 스키마 관리, Schema Registry
- 하지만 avro 를 사용하면 변경된 스키마를 가진 데이터의 무분별한 producing을 막을수 있음
Apache Kafka Supports 200K Partitions Per Cluster
- 카프카 클러스터에서 파티션 수. 클러스터 내 브로커 한대 기준
- 1.1.0 이전 2,000 ~ 4,000개 정도가 적절, 1.1.0 릴리즈 이후부터는 약 200,000개 까지 가능
- 이렇게 큰 변화가 있게 된 원인은, 주키퍼에 변경되는 업데이트를 async 처리하고, 브로커에 새로운 리더 정보 업데이트를 배치로 일괄 처리함으로써, 1.1.0 릴리즈 이전 버전보다 속도가 향상
Kafka 생태계 들여다보기
Big Data, Fast Data @ PayPal
- Paypal 데이터 플랫폼 이야기. CDC(Change Data Capture)와 Kafka와 Avro를 같이 사용해야 하는 이유 등 아키텍처에 대해 설명
An Overview of Kafka Distributed Message System
- Apache Kafka 개념 설명
Kafka의 디스크가 모자랄 때
New Features of Kafka 2.1
카프카를 활용한 워크 큐
- 기술보다는 뭘 어떤 방향으로 만들지에 대한 이야기
How to Lose Messages on a Kafka Cluster
- Part 1
- Part 2
Kafka 클러스터 메세지 발행 및 문제 해결 :: 당근케잌
Kafka Using Java. Part 1
Kafka Using Java. Part 2
blog.voidmainvoid.net/category/.../Kafka
- blog.voidmainvoid.net/tag/kafka
- Kafka broker와 java client의 버젼 하위호환성 정리
Finding Kafka’s throughput limit in Dropbox infrastructure
Kafka, Producer 부터 Consumer 까지
kafka-multiprocessing-producer.py 정상 동작하는 지 점검 필요
kafka-tutorials.com
Kafka Using Java. Part 1
Kafka Using Java. Part 2
Kafka, Java, and Bitcoin
What's New in Kafka 2.2?
Understanding Kafka with Factorio
Kerberos 인증 #1
Kerberos 인증 #2
카프카 설치 시 가장 중요한 설정 4가지
kafka 운영 - 기본적인 환경 설정 경험담
KAFKA와 그 친구들 monitoring, 운영, test tool 소개
How to use reassign partition tool in Apache Kafka
How to move Kafka Partition log directory within a Broker Node
How to use reassign partition tool in Apache Kafka
KAFKA ARCHITECTURE: LOG COMPACTION
Log Compacted Topics in Apache Kafka
- Consumer Offset 정보가 __consumer_offsets라는 토픽에 저장, 그 토픽의 cleanup.policy가 Compact 로 설정
Log Management in Apache Kafka - Speaker Deck
kafka 운영 - kafka의 Exception들 - (1)
Kafka 로그 종류 및 로그 샘플에 대한 설명
kafka 개발 - AdminClient 로 관리 기능 개발하기 - Broker 정보 보기
카프카 서버 디스크 최적화
BUILDING A RELATIONAL DATABASE USING KAFKA KarelDB, KCache, Avro, Calcite, Omid, Avatica
devidea.tistory.com/category/Big Data/Kafka
- 컨트롤러 분석
How LinkedIn customizes Apache Kafka for 7 trillion messages per day
LINE에서 Kafka를 사용하는 방법 – 1편
LINE에서 Kafka를 사용하는 방법 – 2편
카프카를 쿠버네티스 위에 올리는게 좋은 선택일까?
Running Apache Kafka on Kubernetes
아파치 카프카🚀를 알아야하는 이유! 카프카의 미래? 앞으로 어떻게될까?
Serverless Kafka on Kubernetes | DevNation Live
Apache Kafka Producer Improvements with the Sticky Partitioner
KafkaProducer Client Internals
Incremental Cooperative Rebalancing in Apache Kafka: Why Stop the World When You Can Change It?
강의 - 아파치 카프카
kubernetes, python, kafka 메모
Using graph algorithms to optimize Kafka operations, Part 1
Using graph algorithms to optimize Kafka operations, Part 2
Apache Kafka as a Service with Confluent Cloud Now Available on Azure Marketplace
카프카 컨슈머 멀티쓰레드 애플리케이션 예제코드(for scala)
링크드인은 왜 카프카를 만들었나
링크드인이 카프카를 직접 개발한 이유 - 테크잇
Disaster Recovery Plans for Apache Kafka
Resiliency and Disaster Recovery with Kafka | by eBay TechBlog | eBayTech | Medium
카프카 클러스터 클러스터ip DNS 연동방법. use_all_dns_ips 사용(in AWS, route53)
Is Apache Kafka a Database? - The 2020 Update
Kafka-client client.dns.lookup 옵션 정리
기본 개념잡기
Ordering of events in Kafka
Why Kafka Is so Fast. Discover the deliberate design… | by Emil Koutanov | The Startup | Medium
Is Apache Kafka a Database?. Can and should Apache Kafka replace a… | by Kai Waehner | Medium
kafka 아는 척하기 (개발자용) :: 자바캔(Java Can Do IT)
Kafka is not a Database – Materialize
Sizing Calculator for Apache Kafka and Confluent Platform
Thread-Per-Core Buffer Management for a modern Kafka-API storage system - Vectorized
Introducing Confluent’s Parallel Consumer Message Processing Client
Intro to Apache Kafka: How Kafka Works
Kafka Operations(Production Deployment) – Sori-Nori
Disaster Recovery for Multi-Region Kafka at Uber | Uber Engineering Blog
- summary
How Zendesk Secures Kafka with Self-Hosted mTLS Authentication System
Property Based Testing Confluent Cloud Storage for Fun and Safety
Kafka on Kubernetes, minimal configuration
Beyond the Brokers: A Tour of the Kafka Ecosystem
스케일아웃없이 순간 급증하는 주문 처리하기 (Microservice with Kafka)
Kafka for Engineers. Here are things about Kafka that you… | by Dave Taubler | Level Up Coding
Kafka 운영 컨슈머 그룹 정보는 언제 사라질까? :: 언제나 김김
a-great-day-out-with/a-great-day-out-with.github.io
- A Great Day Out With... Apache Kafka
KafkaConsumer Client Internals
Apache Kafka for Industrial IoT and Manufacturing 4.0 - Kai Waehner
Cannot get state store TOPIC because the stream thread is STARTING, not RUNNING 에러 해결 ktable
A gentle introduction to Apache Kafka
Event Driven Architecture using Kafka | LinkedIn
Kafka in the Wild • Laura Schornack & Maureen Penzenik • GOTO 2021 - YouTube Domain Driven Design for Realtime, Ubiquitous, Distributed Data
How Agoda manages 1.5 Trillion Events per day on Kafka | by Shaun Sit | Agoda Engineering & Design | Jul, 2021 | Medium
Kafka 는 왜 빠를까? - 상구리의 기술 블로그
Kafka 클러스터 구성 및 장애 해결 :: 당근케잌
Integrate Apache Kafka and SAP with the Kafka Connect ODP Source Connector
Logstash의 Kafka Input 성능 개선 이야기
- Logstash를 사용하면서 Kafka Lag가 급격히 증가하는 문제를 해결하기 위한 개선 과정 설명
- 처음에는 파티션 수를 늘렸지만 해결되지 않아서 자세히 보니 파티션에 컨슈머가 고르게 붙어있지 않은 문제 발견
- partition_assignment_strategy를 사용해서 라운드 로빈을 적용했으나 트래픽이 늘어나자 다시 Lag 증가
- 그래서 Lag의 의미를 자세히 찾아보니 마지막에 생성된 메시지와 컨슈머가 가져갔다고 표시한 오프셋의 차이라는 것을 알게 되어 auto_commit_interval_ms를 5초에서 1초로 줄여서 La를 해결
Scaling Kafka Consumer for Billions of Events | by Archit Agarwal | The PayPal Technology Blog | Nov, 2021 | Medium
Postgres, Kafka, and the Mysterious 100 GB – Coding, Climbing, and Commentary
The Top 5 Apache Kafka Use Cases and Architectures in 2022
AWS에서 Kafka를 사용하기 위해 필요한 내용을 정리한 시리즈 글
- Practical Kafka – Intro (1) – 1ambda
  - Kafka 아키텍처와 Broker, Producer, Consumer 등의 요소 설명
  - Broker 간에 Topic을 이용해서 파티셔닝과 리플리케이션을 어떻게 하는지 설명
- Practical Kafka – Concept (2) – 1ambda
  - Consumer가 어떤 Partition을 가져가는지 결정하는 파티션 할당과 재할당이 이뤄지는 과정 설명
  - 새로 추가된 재할당 개선 기능으로 재할당을 삭제 후 추가되어도 진행하지 않는 Static Membership과 재할당이 필요한 Consumer만 진행하는 Incremental Rebalancing Protocol을 설명
Introducing Confluent’s Parallel Consumer Message Processing Client
컨플루언트 김현수 상무 I 이벤트 기반 마이크로서비스 아키텍처에서의 Apache Kafka 역할 on Vimeo
Building and Scaling a Control Plane for 1000s of Kafka Clusters - YouTube
Consuming over 1 billion Kafka messages per day at Ifood | by felipe volpone | Nov, 2021 | Medium
Kafka로 메시지와 이벤트 처리하기 - (1) Kafka 세팅하기 | woolog - 개발자 울이
Kafka로 메시지와 이벤트 처리하기 - (2) Python으로 consumer, producer 만들기 | woolog - 개발자 울이
Working with Data in a Connected World - Clair J. Sullivan | PyData Global 2021 - YouTube
APACHE-KAFKA - YouTube
Kafka NetworkClient Internals
Apache Kafka in the Automotive Industry - YouTube
Kafka Tutorial - Spring Boot Microservices - YouTube
Top 5 Courses to Crack Confluent Apache Kafka Developer Certification (CCDAK) in 2022 - Best of Lot
‘아파치 카프카’, 개념부터 사용례까지 - CIO Korea
Kafka Lag 없는 실시간 데이터 파이프라인을 위한 아키텍처 개선기 - AB180 엔지니어링 베이스 | 기술블로그
Kafka- Best practices & Lessons Learned | By Inder | by Inder Singh | Medium
Make a real-time query across multiple microservices using Kafka | by Mohammed Ragab | Nerd For Tech | Medium
Kafka on The Microservice Architecture | by Andhika Yusup | Medium
간단한 카프카 환경 구성하기

Kafka Library

aiokafka - asyncio client for kafka http://aiokafka.readthedocs.io
akhq: Kafka GUI for Apache Kafka to manage topics, topics data, consumers group, schema registry, connect and more...
- 카프카 매니저를 대체할 수 있을까?! AKHQ (Apache Kafka HQ) :: 언제나 김김
burrow - Kafka Consumer Lag Checking
- Burrow - kafka consumer의 지연(lag)을 모니터링할 수 있는 효과적인 opensource tool
- Revisiting Burrow: Burrow 1.1 Linkedin의 SRE팀에서 만들어서 오픈소스로 공개한 Apache Kafka의 Consumer 모니터링 도구
- Apache Kafka Lag Monitoring and Metrics at AppsFlyer
- kafka-lag-dashboard
- kafka-lag-dashboard
Conduktor - the ultimate Apache Kafka Desktop Client
Cruise-control - the first of its kind to fully automate the dynamic workload rebalance and self-healing of a kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters
Flafka: Apache Flume Meets Apache Kafka for Event Processing
Greyhound - Rich Kafka client library
- Kafka Cron using wix/greyhound. I think one of the best ways to learn… | by Algimantas Krasauskas | Wix Engineering | Dec, 2020 | Medium
hive
- Kafka Storage Handler Module
kafka-docker: Dockerfile for Apache Kafka
Kafka Manager - A tool for managing Apache Kafka
- hub.docker.com/r/sheepkiller/kafka-manager
- Kafka Manager Consumer Lag Exporter
- 디테일하게 'Destionation Topic의 Partition별 offset' 을 보고 싶은 경우 Destination Topic을 모니터링 시스템에 연결하는 방식
kafka-monitor - Monitor the availability of Kafka clusters with generated messages
- URP? Excuse You! The Three Metrics You Have to Know (Todd Palino, Linkedin) Kafka Summit 2018
Kafka Offset Monitor - an app to monitor your kafka consumers and their position (offset) in the queue
Kafka-Sprout: Web GUI for Kafka Cluster Management
kafka-statsd-metrics2
kafka tools - A collection of tools for working with Apache Kafka
Kafractive - interative CLI tool for kafka admin, built on top of Spring Shell
kowl: Kafka WebUI for exploring messages, consumers. configurations and more with a focus on a good UI & UX
KSETL로 Kafka 스트림 ETL 시스템을 빠르게 구성하기 - 2021 Korean version - - YouTube
KubeMQ: A Modern Alternative to Kafka - DZone Microservices
librdkafka: The Apache Kafka C/C++ library
MAADS Machine Learning and AI at Scale with MAADS-VIPER and Apache Kafka
rest proxy
- 카프카의 토픽 데이터를 REST api로 주고받자 - Kafka rest proxy 사용
- Confluent REST Proxy 6.0: Putting Apache Kafka to REST
spring-kafka-example: Example source code for KafkaKRU meetup
Trifecta - a web-based and Command Line Interface (CLI) tool that enables users to quickly and easily inspect, verify and even query Kafka messages
trivup - Trivially Up a cluster of applications
- 프로그래밍 방식으로 카프카 클러스터를 구축하고 해체하는 도구. 클라이언트 응용 프로그램에 대한 Kafka의 SSL 인증 및 암호화 지원
uGroup Introducing uGroup: Uber’s Consumer Management Framework
zoe: The missing companion for Kafka

Kafka Stream

카프카 스트림즈 All stream threads have died. 오류 해결 방안
REACTIVE STREAMS FOR APACHE KAFKA
This is a Kafka-Storm-Esper example on vagrant
1. kafka를 사용할 때 Producer.send 해서 stream을 전달하던데, legacy시스템에서 별도의 코딩을 통해서 구현해야 하는 것인지 => kafka를 사용할 때 보통 producer, consumer를 구현한다. kafka - storm을 사용할 때 kafkaspout는 consumer 역할은 한다.
2. KafkaSpout에서 생성된 stream이 storm의 Bolt로 들어올 때 어떻게 디버깅이 가능한 지 => 원격 디버깅은 없고 -Dstorm.log.dir를 통한 로그파일로 디버깅한다.
3. bolt로 넘어온 중복된 stream을 어떻게 unique한 데이터로 처리 가능한 지 => unique한 데이터 처리를 위해서 trident를 사용하며, trident는 storm의 구현을 지원하는 (aggregation 등) 역할을 한다. -> esper로 group by 등의 쿼리문을 만들 수 있는데 trident와 역할 충돌이 있지 않을까 싶지만, trident를 통해 unique한 데이터를 받아 esper로 쿼리문을 돌릴 수 있지 않을까 싶다.
4. kafka 대신에 zmq로 연동할 때 예상되는 문제점이 있는지. zmq와 kafka 모두 큐 역할을 하므로 특별한 이유가 없다면 zmqspout를 활용하는 것이 좋겠다.
Distributed, Real-time Joins and Aggregations on User Activity Events using Kafka Streams
Tweeter: Processing Tweets with Kafka Streams
내부 데이터 파이프라인에 Kafka Streams 적용하기
- Line: Applying Kafka Streams for internal message delivery pipeline
Quick Recipe for #Kafka Streams in #Clojure
Perfecting Lambda Architecture with Oracle Data Integrator (and Kafka / MapR Streams)
- MySQL 데이터베이스의 변경 내용을 스트림으로 캡처하기 위해 Oracle Data Integrator, Apache Kafka / MapR Stream를 구성하는 과정
Streaming databases in realtime with MySQL, Debezium, and Kafka
- WePay에서 Debezium을 사용하여 Kafka로 데이터를 스트리밍하는 MySQL용 데이터 캡처 솔루션을 사용하는 것에 대한 기사
Kafka + Spark-Streaming with Python으로 실시간 분석시스템 만들기
Kafka + Spark-Streaming with Python으로 실시간 분석시스템 만들기(2)
Reading data securely from Apache Kafka to Apache Spark
- Cloudera에서 최근 Kafka와 연계된 Spark 작업에 암호화 및 권한 부여를 제공하기 위해 Apache Kafka, Apache Spark, Apache Ranger를 통합
- 이를 어떻게 구현하고 왜 이런 설계를 하게되었는지 설명
Kafka Connect vs StreamSets: advantages and disadvantages?
- Kafka Connect 및 StreamSets 데이터 수집기 비교 설명
Evolving Avro Schemas with Apache Kafka and StreamSets Data Collector
- Streamsets의 Dataflow Performance Blog에 올라온 내용
- Avro의 스키마 버번을 저장하기 위해 Confluent Schema Registry의 동기화에 대해 설명
- Streamset의 데이터 수집기 도구를 사용하여 schema-aware producer를 사용하여 데이터를 serialize/deserialize 하는 방법 설명
Performance Tuning of an Apache Kafka/Spark Streaming System - Telecom Case Study
- Apache Kafka, Spark Streaming 및 Apache Ignite (RDD의 캐싱)와 관련된 실제 응용 프로그램의 성능 튜닝
- Kafka 파티션 수 증가, RPC 시간 초과 설정 수정, Spark 및 Ignite 메모리 모두 조정, 일괄 처리 간격 수정 등
Build Services on a Backbone of Events
- Apache Kafka가 단순히 빠른 ETL보다 더 혁신적이고 좋다고 주장
- 스트리밍, 응용 프로그램, 데이터베이스 간의 통합, ETL (중앙 집중식 모노리스가 아닌) 배포, 규모 및 안정성 등 Kafka가 제공하는 장점을 강조
Recent Evolution of Zero Data Loss Guarantee in Spark Streaming With Kafka
Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
Getting Started with the Kafka Streams API using Confluent Docker Images
Real-time Financial Alerts at Rabobank with Apache Kafka’s Streams API
- Rabobank가 메인 프레임에서 Apache Kafka(다중 데이터 센터 배포 및 Kafka Streams로 구축)로 고객 알림 시스템을 이동한 사례에 대해 설명
Real-Time Anomaly Detection Streaming Microservices with H2O and MapR – Part 1: Architecture
- IOT 센서 데이터를 스트리밍하여 비정상 상태를 감지하는 아키텍처에 대해 소개
Streaming Kafka Messages to MySQL Database flume과의 조합
Integrating Kafka and Spark Streaming: Code Examples and State of the Game
Spark Streaming with Kafka and Cassandra
Ranking Websites in Real-time with Apache Kafka’s Streams API
- 유럽 최대의 온라인 패션 소매 업체인 Zalando에서 Apache Kafka를 사용하여 패션 웹 사이트의 정보를 색인하고 순위를 매기는 방법에 대해 소개
- 이 시스템은 HITS (Hyperlink Induced Topic Search) 알고리즘을 사용하며 Kafka 스트림이 기반
Using Kafka Streams API for predictive budgeting
lenses - a Streaming Data Management Platform for Apache Kafka
- How to explore data in Kafka topics with Lenses - part 1
- stream-reactor - Streaming reference architecture for ETL with Kafka and Kafka-Connect. You can find more on http://landoop.com on how we provide a unified solution to manage your connectors, most advanced SQL engine for Kafka and Kafka Streams, cluster monitoring and alerting, and more http://www.landoop.com/kafka/connectors
Kafka & Redis Streams
Enabling Exactly-Once in Kafka Streams
Migrating Batch ETL to Stream Processing: A Netflix Case Study with Kafka and Flink
- QCon New York 2017에서 Netflix의 스트림 처리 시스템에 대해 소개한 내용을 설명
- Apache Kafka, Apache Flink, Apache Mesos 등으로 구축
- 비디오 재생 / 검색 이벤트의 데이터를 분석
- Netflix가 직면한 도전 과제와 그것에 따라 구현된 전략에 대해서도 설명
Of Streams and Tables in Kafka and Stream Processing, Part 1 스트림과 테이블에 대한 개념을 설명
Kafka streams Java application to aggregate messages using a session window Java Kafka stream 기초 예제
Neha Narkhede | Kafka Summit 2017 Keynote (Go Against the Flow: Databases and Stream Processing) KSQL demo
Neha Narkhede | Kafka Summit 2018 Keynote (The Present and Future of the Streaming Platform) London
Kafka Summit London
Introducing Hortonworks Streams Messaging Manager (SMM)
- Apache Kafka 운영 관리 도구 & API
- kafka의 4가지 엔티티(producer, topic, broker, consumers)에 대한 메트릭을 보여주고, 하나 이상의 (Secure) Kafka cluster에 대한 통합 플랫폼뿐만 아니라 각 클래스에 대해 REST API를 제공
- 자사 개발 제품군인 Apache Atlas, Ranger, Ambari와 높은 호환성
Testing Kafka Streams Applications
Kafka Streams for Stream processing A few words about how Kafka works
Building Secure and Governed Microservices with Kafka Streams
- 트럭 화물 운송회사에서 지오-이벤트 센서 데이터를 캡처하고 분석할 수 있는 애플리케이션을 Kafka Streams로 만드는 방법
Learn kafka streams by making the tests pass
- Apache Kafka Streams를 배울 수 있는 워크숍
Apache Kafka leaves the Zoo
Using Graph Processing for Kafka Stream Visualizations)
Making sense of Avro, Kafka, Schema Registry, and Spark Streaming
Kafka Spark Streaming Integration in java from scratch | Code walk through - YouTube
Streaming the last few minutes from Kafka using Akka Streams
How to Test Kafka Streams Applications
Streaming With Probabilistic Data Structures: Why & How | by Eliav Lavi | Riskified Technology | Oct, 2020 | Medium
Batch to Real-Time Streams: 8 Years of Event Streaming with Apache Kafka
카프카 스트림즈 Exactly-once 설정하는 방법과 내부 동작
카프카 스트림즈! 대용량, 폭발적인 성능의 실시간 데이터 처리! - YouTube
카프카 스트림즈에서 stateful window 처리를 다루는 방법 그리고 커밋타이밍
Kafka Streams 101 - Rock the JVM Blog
Deep Dive into Apache Kafka: Your go-to Event Streaming Framework. | by Jay | Nov, 2021 | Medium
Event Sourcing with Kafka Streams in Production — Lessons Learned | by Nico | comsystoreply | Medium
Hands-on Kafka Streams in Scala
brooklin - An extensible distributed system for reliable nearline data streaming at scale
- Open Sourcing Brooklin: Near Real-Time Data Streaming at Scale
- Kafka Connect + MirrorMaker의 대안으로 개발된 범용 Framework. Scalable할 뿐만 아니라 Kafka 외에도 다양한 Storage / Streaming System 지원
- 자체적인 Cluster를 설정해야 하며, 2019.07에 공개되어 자료 전무
- monitoring 방법은 MirrorMaker 1/2와 마찬가지로 내부적으로 kafka producer를 사용해 해당 process에 jmx로 접속해 producer sender metrics를 확인
Debezium - Stream changes from your database
- debezium: Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
- How Debezium & Kafka Streams Can Help You Write CDC Solution Debezium과 Kafka를 사용하여 MySQL과 MongoDB에서 쓰여진 데이터를 캡처하는 플랫폼을 설정하는 방법
- DevNation Live: Kafka and Debezium
- Change Data Streaming Patterns for Microservices with Debezium
- Using Debezium, CDC for Apache Kafka, with PostgreSQL and MongoDB – Flant blog
- Practical Change Data Streaming Use Cases with Apache Kafka & Debezium
- Configuring Topic Auto-Creation with the Debezium UI - YouTube
- Hans-Peter Grahsl&Gunnar Morling - Dissecting our Legacy: The Strangler Fig Pattern with ... - YouTube
- Scheduling Millions Of Messages With Kafka & Debezium | by Elia Rohana | Yotpo Engineering | Medium
Decaton Kafka를 이용한 작업 큐 라이브러리 'Decaton' 활용 사례 - LINE ENGINEERING
kafka connect
- Kafka Connect S3 Source Connector
- Presto Kafka connector 개선 실패기
- Splunking Kafka with Kafka Connect
  - Kafka에서 Splunk로 데이터를 전송하기 위한 새로운 Kafka Connect 플러그인을 설명(아키텍처 및 디자인 선택 포함)
  - Kafka Connect를 설정하여 Kafka topic을 Splunk Heavy Forwarder로 데이터를 스트리밍하는 튜토리얼 포함
- The Simplest Useful Kafka Connect Data Pipeline In The World … or Thereabouts
  - Part 1
    - RDBMS (이 경우 MySQL)에서 변경 데이터 캡처를 위해 Apache Kafka Connect를 사용하는 방법을 예제를 통해 설명
  - Part 2
  - Part 3
- Getting started with the Kafka Connect Cassandra Source Ladoop 에서 제공하고 있는 Cassandra Source Connector 사용하여 Kafka로 스트리밍을 설정하는 방법 소개
- Connecting Kafka to MinIO. How to connect data being distributed… | by Alex | The Startup | Medium
- How to Write a Kafka Connector with Proper Configuration Handling
- kafka-connect-datagen: Connector that generates data for demos
  - kafka-connect-datagen 커넥터로 테스트 데이터 생성하기
- Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka
  - Alpakka Kafka connector — an open-source Reactive Enterprise Integration library for Java and Scala
  - Retrying consumer architecture with Alpakkas
- MirrorMaker2 kafka/connect/mirror at trunk · apache/kafka
  - How to run Kafka Mirror Maker using Kerberos clusters
  - MirrorMaker Performance Tuning Tuning Kafka for Cross Data Center Replication
    - compression.type 지정
    - Producer 에서 사용하면, Network BW 및 Broker단의 CPU 절약
    - 전통적으로 Kafka 프로젝트 안에 탑재되어 있던 툴이지만 설계가 오래되서 scalable하게 동작하지 않으므로, 어지간히 오래된 Cluster 내용을 옮기는 게 아니라면 비추천
  - Kafka Replication: The case for MirrorMaker 2.0
    - MirrorMaker 1의 대안으로 Cloudera 엔지니어가 개발. 1보다 훨씬 좋지만 아직 정식 탑재된 게 아니라 문서화 부족
  - MirrorMaker2 가 release되었습니다
  - MirrorMaker2 마이그레이션
kafka-spark-consumer High Performance Kafka Consumer for Spark Streaming. Now Support Spark 2.0 and Kafka 0.10
Kafka Streams examples
kafka-streams-viz - Kafka Streams Topology Visualizer
KSQL
- Introducing KSQL: Open Source Streaming SQL for Apache Kafka
  - spark streaming의 대체?
  - Apache Kafka에서 SQL을 사용할 수 있는 인터페이스를 제공
- Getting Started Analyzing Twitter Data in Apache Kafka through KSQL
  - 트위터의 스트리밍 데이터를 KSQL의 술어(predicate)로 필터링하고 시간당 사용자당 트윗 수를 계산하는 등 집계를 작성하는 예제
- KSQL: Streaming SQL for Apache Kafka
- Taking KSQL for a Spin Using Real-time Device Data
  - KSQL을 사용하여 간단한 스트리밍 프로그램을 보여주는 포스트
  - 입력이 드라이빙 게임 핸들의 디지털 센서 데이터 스트림
- Building a Microservices Ecosystem with Kafka Streams and KSQL
  - 카프카 스트림을 이용한 동기식 트랜잭션 시스템을 구축하는 예제
  - 사이드카 패턴을 통해 비 JVM 언어에 대해 패턴을 구현하기 위해 KSQL을 사용하는 개념 언급
- KSQL January release: Streaming SQL for Apache Kafka
- How to Write a User Defined Function (UDF) for KSQL
  - 아직 사용자 정의 함수(UDFs)의 런타임 구성을 지원하지 않지만 사용자 함수를 작성하고 빌드 가능
- KSQL in Action: Real-Time Streaming ETL from Oracle Transactional Data
- Secure Stream Processing with Apache Kafka, Confluent Platform and KSQL
- We ❤ syslogs: Real-time syslog Processing with Apache Kafka and KSQL
  - Part 1: Filtering Syslog Apache Kafka Connect plugin을 사용하여 Avro log 형식으로 Kafka로 가져온 후 KSQL을 사용하여 분석하는 방법을 설명
  - Part 2: Event-Driven Alerting with Slack
  - Part 3: Enriching events with external data
    - MongoDB 데이터를 결합하여 Apache Kafka의 syslog 데이터에 KSQL을 사용하여 스트리밍 응용프로그램을 구축하는 과정 설명
    - 알림은 Slack, 시각화 도구는 ES
- How to Build a UDF and/or UDAF in KSQL 5.0 KSQL 5.0에서 사용자 정의 집계 함수를 사용하는 방법
- ATM Fraud Detection with Apache Kafka and KSQL
  - ATM Fraud Detection with Kafka and KSQL - Hands on Guide
- Real-Time Sysmon Processing via KSQL and HELK — Part 1: Initial Integration
  - HELK; 보안 이벤트 로그를 분석하기 위해 표준 ELK를 확장한 스택
  - 이 글에서는 KSQL을 통해 추가 분석을 하는 방법을 설명
- Machine learning & Kafka KSQL stream processing — bug me when I’ve left the heater on
- 아파치 카프카 테스트용 data generator 소개 - ksql-datagen
- KSQL - 효과적이고 간단한 스트리밍 프로세스 SQL엔진
- How to handle deserialization errors using ksqlDB
  - ksqldb-handle-deserialization-errors: How to handle deserialization errors.
- ksqlDB - The event streaming database purpose-built for stream processing applications
  - How Real-Time Stream Processing Safely Scales with ksqlDB
mockedstreams - Scala DSL for Unit-Testing Processing Topologies in Kafka Streams
stream-reactor Streaming reference architecture built around Kafka. http://datamountaineer.com/2016/01/12/streamliner

Kudu

Kudu
Kudu
getkudu.io
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Apache Kudu as a More Flexible And Reliable Kafka-style Queue
Big Data: current trends & next big thing 'Apache Kudu' - my takeaways from Strata + Hadoop 2016 @San Jose
#bbuzz 2016: Todd Lipcon - Apache Kudu (incubating): Fast Analytics on Fast Data
Build a Prediction Engine Using Spark, Kudu, and Impala
Creating a Post-Lambda World with Apache Kudu
Up and running with Apache Spark on Apache Kudu
Apache Kudu 1.3.0 was released
- Apache Kudu 1.3.0 릴리즈
- Kerberos 인증, TLS를 사용한 암호화 전송, coarse-grained authorization 등 새로운 기능 추가
- LZ4 압축으로 전환하는 등 몇 가지 최적화 기능 포함
Apache Kudu Read & Write Paths
kudu-master clustering
```
kudu-master \
  --master_addresses=172.23.30.101,172.23.30.102,172.23.30.103 \
  --fs_data_dirs=/data1/kudu/master/data \
  --fs_wal_dir=/data1/kudu/master/wal \
  --log_dir=/opt/log/kudu \
  --raft_get_node_instance_timeout_ms=60000
```
- 위와 같이 3대에 띄우면, /data1/kudu/master/data 하위에 consensus를 맞추고 리더가 선출된 후에 별도의 000000000000000000 파일을 생성
- 성공적으로 띄워지고 난 후로는 클러스터 노드가 깨져도 다시 띄울때 오류가 발생하지 않음
- 오류 발생하였을 때는, /data1/kudu/master/data 와 /data1/kudu/master/wal 디렉토리 삭제후 다시 raft_get_node_instance_timeout_ms 내에 클러스터를 이루는 IP에 프로세스가 실행되도록 하면 됨
Low latency high throughput streaming using Apache Apex and Apache Kudu
- Apache Kudu와 Apache Apex를 이용한 고성능 스트리밍처리 방식에 대해 설명
A brave new world in mutable big data relational storage (Strata NYC 2017)
Kudu를 이용한 빅데이터 다차원 분석 시스템 개발
Guide to Using Apache Kudu and Performance Comparison with HDFS
Transparent Hierarchical Storage Management with Apache Kudu and Impala
- Apache Kudu 및 Impala를 사용한 계층적 스토리지 관리
- Apache Impala를 Apache Kudu 및 Apache HDFS에 저장된 데이터와 함께 사용하는 슬라이딩 윈도우(sliding window) 패턴
- 이러한 패턴을 사용하면 여러 스토리지 계층의 이점을 사용자에게 투명한 방식으로 모두 구현 가능
- Apache Kudu는 급변하는 데이터를 빠르게 분석할 수 있도록 설계. 또한 빠른 인서트/업데이트와 효율적인 열 기반 스캔을 결합하여 단일 스토리지 계층에서도 다수의 실시간 분석 워크로드를 지원. 이러한 이유 때문에 언제든지 쿼리를 실행할 수 있는 실시간 데이터가 저장되는 장소로서 데이터 파이프라인에 매우 적합. 또한 행 업데이트와 행 삭제를 실시간으로 지원하여 지연 수신되는 데이터 및 데이터 교정도 가능
- Apache HDFS는 낮은 비용으로 무제한 확장이 가능하도록 설계. 따라서 데이터 변경이 불가능한 배치 지향 사용 사례에 최적화. 그 밖에도 Apache Parquet 파일 형식과 연결할 경우 매우 높은 처리량과 효율성으로 정형 데이터에 액세스 가능
- 차원 테이블처럼 데이터가 소량이면서 끊임없이 바뀌는 상황에서는 모든 데이터를 Kudu에 저장하는 경우 다수. 데이터가 Kudu의 확장 제한을 넘지 않는다면 대용량 테이블이라고 해도 Kudu의 고유 기능을 이용 가능하므로 Kudu에 저장. 데이터가 대용량이고, 배치 지향적이고, 변경이 불가능한 경우에는 Parquet 형식을 사용해 데이터를 HDFS에 저장하는 것이 좋음. 두 스토리지 계층의 이점이 모두 요하다면 슬라이딩 윈도우 패턴이 효과적인 솔루션
Testing Apache Kudu Applications on the JVM
Kudu as Storage Layer to Digitize Credit Processes

Kylin

Kylin Extreme OLAP Engine for Big Data
빅데이터 다차원 분석 플랫폼, Kylin
Apache Kylin 2.2.0 is released
- Apache Ranger를 사용하여 테이블 레벨에서 ACL을 관리하는 기능 등이 탑재
Using Hue to interact with Apache Kylin in your cluster or on AWS Hue에서 JDBC 드라이버를 통해 Apache Kylin을 조회할 수 있는 방법을 설명합니다. AWS EMR 포함

Kyuubi

Kyuubi Project Incubation Status - Apache Incubator
- distributed multi-tenant Thrift JDBC/ODBC server for large-scale data management, processing, and analytics, built on top of Apache Spark and designed to support more engines (i.e., Apache Flink)

Mesos

Mesos
Advanced Mesos Course
Spark(1.2.1 -> 1.3.1) 을 위한 Mesos(0.18 -> 0.22.rc) - Upgrade
mesos, omega, borg: a survey
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
minmesos - Testing infrastructure for Mesos frameworks
메소스(mesos) 공부

Metron

Metron 보안에 포커스를 둔 분석 시스템

Nifi

Nifi Apache nifi is an easy to use, powerful, and reliable system to process and distribute data
NiFi를 이용한 빅데이터 플랫폼 개선
NSA의 Dataflow 엔진 Apache NiFi 소개와 설치
NiFi vs Falcon vs Oozie
NiFi 소개 발표 자료
Introduction to Apache NiFi and Storm
Apache NiFi 1.x Cheatsheet
- Apache NiFi에는 많은 Processor가 있어 어떤 Processor를 사용해야 할 지 찾아야 하는 경우가 많은데, 많이 사용하는 Processor를 소개
- NiFi의 Rest API에 대해서도 설명
NiFi User Interface Overview
실시간 Kafka consumer cluster를 구성하니
- 구동속도 빠르고, 모니터링 편하고, 복잡한 transform GUI로 관리하고, partitioning도 알아서 한다는 글을 봤음
- 간단한 작업에도 적합할까?
Apache NiFi 소개 및 Tensorflow 연동
HORTONWORKS DATAFLOW (HDF) 3.1 BLOG SERIES PART 5: INTRODUCING APACHE NIFI-ATLAS INTEGRATION Apache NiFi와 Apache Atlas를 Hortonwork DataFlow에 통합하여 Kafka, Hive 등의 데이터를 추적하는 방법을 간략하게 설명
What’s new in Hortonworks DataFlow (HDF) 3.2?
Best practices for using Apache NiFi in real world projects - 3 takeaways
- PoC에서 프로덕션 환경 적용까지 필요한 사례 소개
Building an IIoT system using Apache NiFi, MiNiFi, C2 Server, MQTT and Raspberry Pi IoT에서 Apache NiFi를 활용하는 예
HDF/HDP Twitter Sentiment Analysis End-to-End Solution
IoT with Apache MXNet and Apache NiFi and MiniFi
Introduction to Apache NiFi dws19 DWS - DC 2019
Using Apache NiFi for Speech Processing: Speech to Text with Mozilla/Baidu's Deep Search in Tensorflow
NiFi & NiFi Registry on the Google Cloud Platform with Cloud Source Repositories
Building Data Pipelines on Apache NiFi with Python introduction인데 내용이 정말 풍부함
How Apache Nifi works — surf on your dataflow, don’t drown in it
Processing one billion events per second with NiFi
- Importing RDBMS Data Into Hive Using NiFi on CDP Public Cloud
NiFi as a Function in DataFlow Service - Cloudera Blog

Nutch

Nutch
Apache Nutch - 오픈소스 웹 검색 엔진

Oozie

Oozie
How-to: Use the New Apache Oozie Database Migration Tool
Jailbreak Oozie Spark action

Ozone

Introducing Apache Hadoop Ozone: An Object Store for Apache Hadoop
- Apache Hadoop Ozone 소개. 하둡 저장소 레이어 최상단. 얼마 전 알파 버전 릴리즈
- 기본 컨셉
  - SCALABLE
    - Ozone is designed to scale to tens of billions of files and blocks and, in the future, even more
    - Small files or huge number of datanodes are no longer a limitation
  - CONSISTENT; Storage Layer uses RAFT protocol for consistentency
  - CLOUD-NATIVE; Hadoop Ozone is designed to work well in containerized environments like YARN and Kubernetes
Apache Hadoop Ozone – Object Store Architecture
One billion files in Ozone

Parquet

Parquet
Using Apache Parquet at AppNexus
Dremel made simple with Parquet
Benchmarking Apache Parquet: The Allstate Experience
fastparquet - A Python interface to the Parquet file format
Sorting and Parquet
- Apache Parquet로 직렬화하기 전에 데이터를 정렬하면 쿼리 성능이 크게 달라질 수 있음
- 이 글에서는 그 이유를 설명하고 정렬할 column을 파악하는 방법에 대한 아이디어 제공
Parquet Internal Part 1. Google Dremel(1)
🌲Parquet(파케이)란? 컬럼기반 포맷 장점/구조/파일생성 및 열기
Working with Data in a Connected World - Clair J. Sullivan | PyData Global 2021 - YouTube

Phoenix

Phoenix High performance relational database layer over HBase for low latency applications
Apache Phoenix Joins Cloudera Labs
Apache Phoenix: Use Cases and New Features
- HBase + Phoenix를 활용하여 Timeseries DB로 사용하도록 하는 Argus, ACID Transaction 이 가능케 하는 Apache Tephra, Cost bases Query Optimizer인 Apache Calite 활용 사례 소개
The Apache Software Foundation: Column Mapping and Immutable Data Encoding of Apach Phoenix 4.1
- Apache Phoenix 4.10 릴리즈
- 새로운 기능인 컬럼 매핑과 변경 불가 데이터 인코딩 기능을 소개
- TPC-H benchmark상으로 속도 향상 및 공간 절약 효과가 상당
Apache Spark Plugin
3 Steps for Bulk Loading 1M Records in 20 Seconds Into Apache Phoenix Apache Spark를 사용하여 Apache HBase 및 Apache Phoenix와 호환하는 HFile을 생성하는 방법 설명
Apache Phoenix for CDH
- Apache Phoenix for CDH

Pig

Pig
A Simple Explanation of COGROUP in Apache Pig
practice - gist.github.com/hyunjun/55f83bfd91e2b1e24f46
hug number of part files
Hadoop Tutorial: Pig Part 2 -- Joining Data Sets and Other Advanced Topics
Hadoop Pig Tutorial

Pinot

Apache Pinot™ (Incubating): Realtime distributed OLAP datastore | Apache Pinot™ (Incubating)
Introducing Apache Pinot 0.5.0. We are excited to announce that Apache… | by Ting Chen | Apache Pinot Developer Blog | Sep, 2020 | Medium
Intro to Apache Pinot - YouTube

PredictionIO

PredictionIO
incubator-predictionio - PredictionIO, a machine learning server for developers and ML engineers. Built on Apache Spark, HBase and Spray. http://prediction.io

Pulsar

Apache Pulsar 기존의 메시징/스트리밍 시스템의 단점을 보완하기 위해 Yahoo에서 시작된 분산형 pub-sub 시스템
Geo-replication in Apache Pulsar
- part 1: concepts and features
- part 2: patterns and practices
- Apache Pulsar를 사용하여 cross-data center replication를 수행하는 방법에 대해 설명
- 복제를 설정하는 데 필요한 명령, 응용 프로그램별로 재정의하는 방법, 모니터링 방법, 복제 대역폭을 제한하는 방법 등에 대해 설명
Comparing Pulsar and Kafka: how a segment-based architecture delivers better performance, scalability, and resilience
Querying Data Streams with Apache Pulsar SQL
- Apache Pulsar를 통해 스트리밍 데이터를 SQL로 조회하는 아키텍처, 성능, 리뷰 포함
Apache Pulsar. MacOS installation Guide
Apache Pulsar Using Java
Rendezvous Architecture for Data Science in Production
Apache Pulsar as One Storage System for Both Real-time and Historical Data Analysis
Pulsar vs. Kafka — Part 1 — A More Accurate Perspective on Performance, Architecture, and Features
Event-driven railway network based on Pulsar - I'm Pavels, welcome! scala
Scale By The Bay 2020: Keynote: Karthik Ramasy, Apache Pulsar @ Splunk - YouTube
Event Streaming with Apache Pulsar and Scala - Rock the JVM Blog
Apache Pulsar Tutorial with Scala - YouTube

Ranger

Ranger
IT’S MORPHING TIME: APACHE RANGER GRADUATES TO A TOP LEVEL PROJECT – PART 2
- Apache 탑 레벨 프로젝트로 승격된 Apach Ranger에 대한 Key Feature 소개
- 속성 기반의 엑세스 제어, 정책 엔진, 하드웨어 관리 모들과 결합할 수 있는 키 관리 서비스 등을 포함
INTRODUCING ROW/ COLUMN LEVEL ACCESS CONTROL FOR APACHE SPARK
- Hortonworks에서 Apache Ranger를 통해 Hive 또는 Apark SQL에서 행렬 수준의 데이터 엑세스 및 데이터 마스킹을 지원하는 방법을 간단한 데모와 함께 설명
Apache Ranger Vs Sentry Hadoop 에코시스템들에 대한 인증과 여러 보안 기능을 제공하는 Apache Ranger와 Apache Sentry에 대해 비교 설명

River

River

Samza

REAL-TIME FULL-TEXT SEARCH WITH LUWAK AND SAMZA
Apache Kafka, Samza, and the Unix Philosophy of Distributed Data
Concourse: Generating Personalized Content Notifications in Near-Real-Time
- LinkedIn의 개인화된 알림 시스템인 Concourse의 디자인에 대해 소개
- Apache Kafka와 Apache Samza에 기반한 배치 시스템을 사용
- 처리량을 향상시키기 위해 데이터 처리는 각 데이터센터에서 하도록 설계

SeaTunnel

incubator-seatunnel: SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time)
Apache SeaTunnel - 분산, 고성능 데이터 통합 플랫폼 | GeekNews

SINGA

SINGA a general distributed deep learning platform for training big deep learning models over large datasets

Slider

Slider Project Incubation Status - Apache Incubator
DEVIEW 2018 :: C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
- 212 C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터

Solr

gooper.com/검색엔진-solr

Spot

Spot 네트워크 데이터를 분석하여 infosec 위협을 탐지하는데 사용
Apache Spot (incubating) and Cloudera on AWS in 60 Minutes
- Apache Kafka(처리용), Apache Spark(처리 및 ML 분석용), Apache Hadoop(처리 및 저장용) 등을 기반으로 한 Apache Spot의 아키텍처 소개
- Spot은 파일 시스템의 변경 사항을 감지하고 이벤트를 발생시키는 Python Watchdog 라이브러리를 사용

Sqoop

An HDFS Tutorial for Data Analysts Stuck With Relational Databases PostgreSQL to HDFS
SQOOP으로 MYSQL 데이터 가져오기
How to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines
- Streamsets의 Dataflow Performance Blog에 올라온 내용
- Apache Sqoop을 대체하기 위한 마이그레이션 방법 및 고려 사항에 대해 간단하게 설명
Using Sqoop to Import Data from MySQL to Cloudera Data Warehouse
An in-depth introduction to SQOOP architecture

Storm

Apache Storm을 이용한 실시간 데이타 처리
Scaling Apache Storm - Strata + Hadoop World 2014
주니어 개발자의 storm kafka 시작하기
Real-Time Analytics with Apache Storm
대용량 스트리밍 데이터 실시간 분석
Reading and Understanding the Storm UI
Introduction to Apache NiFi and Storm

Superset

Superset a data exploration and visualization web application
Supercharging Apache Superset | by Airbnb | Airbnb Engineering & Data Science
Use Apache Superset for open source business intelligence reporting | Opensource.com

SystemML

SystemML Apache Spark와 Apache Hadoop을 확장하기 위해 빌드된 machine learning 라이브러리
IBM's SystemML Machine Learning - Now Apache SystemML
The Apache Software Foundation Announces Apache® SystemML™ as a Top-Level Project

Tajo

Tajo
Introduction to Apache Tajo
누구나 따라할 수 있는 Tajo 시작하기 : How to install Apache Tajo
아즈카반으로 타조 워크플로우 구성하기 : How to schedule Tajo Job using Azkaban
Collaborate Apache Tajo + Elasticsearch
아파치 타조(Apache Tajo)를 이용한 코호트(Cohort) 분석
아파치 타조 (Apache Tajo) 한글 문서 프로젝트 리소스 및 진행 공유
Big data analysis with R and Apache Tajo (in Korean)
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Apache Tajo 데스크탑 + Zeppelin 연동 하기
Expanding Your Data Warehouse with Tajo
AWS + Tajo를 이용한 '테라 렉 로그 분석 이야기'
Python 에서 Tajo 사용하기
MelOn 빅데이터 플랫폼과 Tajo 이야기

Thrift

Apache Thrift
아파치 쓰리프트의 bool 타입 관련 제한 값

Tika

Tika

Toree

Toree

Traffic Server

Apache Traffic Server

UIMA

UIMA

WEEX

WEEX A framework for building Mobile cross-platform UIs

Zookeeper

Zookeeper
Zoom: Reactive Programming with Zookeeper
The Discovery of Apache ZooKeeper’s Poison Packet
Mining Zookeeper’s transaction log to track down bugs
Apache ZooKeeper Four Letter Words and Security
- Apache ZooKeeper의 네 글자 단어 지원(4lw)에 대한 간략한 내용
- 이러한 관리 명령의 경우 정상적인 ZK 포트를 통한 연결과 같이 좋은 보안 솔루션이 없음
- 다른 방법으로, ZooKeeper는 JMX를 지원하고 3.5.x 릴리스에서는 별도의 포트에 AdminServer를 제공
Zookeeper 클러스터 및 컨트롤러 선출 :: 당근케잌
consul.io
- HashiCorp사의 Consul, Consul Template 소개
- Real-time Service Configuration으로 Consul을 신주소 서비스에 적용한 사례
- Mitchell Hashimoto on Consul since 1.2 and its Role as a Modern Service Mesh
- Announcing HashiCorp’s Homebrew Tap
- /usr/bin/consul-template -consul-retry-attempts=1 -template "./dynamic.ctmpl:./dynamic.conf" -config="/etc/consul.d/template/config.json" -once template에서 conf를 생성하는 예

Files

apache.md

Latest commit

History

apache.md

File metadata and controls

Apache

Airflow

Ambari

Apex

Arrow

Atlas

Beam

BookKeeper

Brooklyn

Camel

Commons

Cordova

Crunch

Drill

Druid

Eagle

Falcon

Flink

Flume

Geode

Goblin

HAWQ - advanced enterprise SQL-on-Hadoop query engine and analytic database

Hivemall

Iceberg

Ignite

Impala

Jena

Kafka

Kafka Library

Kafka Stream

Kudu

Kylin

Kyuubi

Mesos

Metron

Nifi

Nutch

Oozie

Ozone

Parquet

Phoenix

Pig

Pinot

PredictionIO

Pulsar

Ranger

River

Samza

SeaTunnel

SINGA

Slider

Solr

Spot

Sqoop

Storm

Superset

SystemML

Tajo

Thrift

Tika

Toree

Traffic Server

UIMA

WEEX

Zookeeper