Skip to content

greysap/microbatch2cassandra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

microbatch2cassandra

Simple example of using Spark Structured Streaming for imaginary case of aggregating customers bank transactions

Case description

Our customer can make debit and credit transactions. Our task is to aggregate all of them in one-hour windows and save aggregates to a database. "To aggregate" means to group transactions by customer, time window, type... and sum up. Transactions arrive as CSV-files. Apache Cassandra is used as aggregates storage.

Few complications:

  1. Transactions may be incorrect (malformed CSV, zero amount).
  2. Transaction may be duplicated in one hour. And we mustn't sum up one transaction twice.
  3. Transactions may be late for few hours, but no more than 24 hours.

Solution:

  1. Read stream of CSV-files and filter off incorrect records.
  2. Use time-window aggregation.
  3. Write batches of aggregates to Cassandra via spark-cassandra-connector.

New requirement:

  1. Transactions arrive as JSON-messages from Kafka.

New solution:

  1. Read stream of JSON via spark-sql-kafka-0-10_2.11
  2. Use time-window aggregation as earlier.
  3. Write batches of aggregates to Cassandra via spark-cassandra-connector as earlier.

Releases

No releases published

Packages

No packages published