Spark-based Feature Store.

A containerized approach using Apache Kafka, Spark, Cassandra, Hive,postgresql, Jupyter, and Docker-compose.

Medium link https://medium.com/@weslleylc/building-a-feature-store-4a2dffee17fe

Usage

Clone the repo:

git clone https://github.com/weslleylc/Feature-Store.git
cd Feature-Store

Run docker-compose and wait until all containers have been instantiated.:

docker-compose up -d

Open and run the following notebooks at spark/notebooks:

Kafka.ipynb
Starbucks_ETL.ipynb

The Kafka notebook will produce json kafka streming at topic queueing.transactions. The Starbucks_ETL.ipynb will consume kafka topics and build the feature store.

Simulating the following scenario:

We have a streaming JSON data source with events of Starbucks orders being captured in real-time. We have a CSV data set with more information about drinks.

Objective:

We want to parse the JSON from the streaming source, performing aggregations operations, and store all rows in a cheap structure(like s3) and get more recent transactions on a low latency database like Cassandra. We desire to have an output with the schema:

id_employer: int
name_employer: string
name_client: string
payment: string
timestamp: timestamp
product_name: string
product_size: string
product_price: int
percent_carbo: float
final_price: float

Solution using Butterfree library and the above architecture:

Apache Kafka as data sources (Streaming input data);
A hive metastore to store metadata (like their schema and location) in a relational database. (For this tutorial we will use Postgresql)
Apache Cassandra to store more recent data.
Amazon S3 to store historical features or table views for debug mode.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
spark		spark
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
hadoop-hive.env		hadoop-hive.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-based Feature Store.

Usage

Simulating the following scenario:

About

Releases

Packages

Languages

License

weslleylc/Feature-Store

Folders and files

Latest commit

History

Repository files navigation

Spark-based Feature Store.

Usage

Simulating the following scenario:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages