Querying Big Data using Taipy and Dask

This project uses Taipy to create a Dask pipeline to run queries on a 24Gb dataset of Yelp Reviews and build a web app to run these queries and display the results.

Taipy is a great way to manage and display the results of Dask applications, as its backend is built for large-scale applications and can handle caching, parallelization, scenario management, pipeline versioning, data scoping, etc.

Why Taipy?

Taipy is an open-source Python library that manages both front and back-end:

Taipy GUI helps create web apps quickly using only Python code
Taipy Core manages data pipelines through a visual editor where parallelization, caching and scoping are easily defined

Why Dask?

Pandas is an excellent library for data analysis, but it is not designed to handle large datasets. Dask is a library that extends Pandas to parallelize computations and handle ("out-of-core") datasets.

Datasets

The datasets used in this project are based on Yelp Reviews datasets:

data/yelp_business.csv contains the reference information about businesses (mainly name and id)
data/yelp_review.csv is a 24Gb dataset containing Yelp reviews (mainly business id, text, and stars)

The goal will be to join these two datasets and run queries to find average stars and reviews for a specific business.

Data Pipeline

The Data Pipeline is built using Taipy Studio in VSCode and looks like this:

Blue nodes are Data Nodes that store Python variables or datasets:

business_data is the yelp_business.csv dataset represented as a Pandas DataFrame
business_dict is a dictionary mapping business ids to business names
business_name is the name of the business we want to query
business_id is the id of the business we want to query
review_data is the yelp_review.csv dataset as a Dask DataFrame

review_data is a generic data node that calls a Python function read_fct to read the dataset with a read_fct_params argument to specify the path.

raw_reviews contains the reviews that we queried
parsed_reviews is raw_reviews but filtered only to contain relevant columns

Between the data nodes (in blue) are Task Nodes (in orange). Task Nodes take Data Nodes as inputs and return Data Nodes as outputs using Python functions.

These Task Nodes are combined into a pipeline using a green node called the Pipeline** Node**, which is the entry point of the pipeline. (Note that Taipy allows for several Pipelines Nodes to co-exist)

Caching

Task Nodes have a skippable property that allows them to be skipped if the output was already computed and cached.

Example:

Let’s set the parameter skippable to True for the get_reviews task.
Let’s run the pipeline once
Then run it a second time, Taipy will log:
[2023-05-16 04:40:06,858][Taipy][INFO] job JOB_getread_reviews_39d2bb45-8901-4081-b877-2e308507bb90 is skipped.
meaning it did not reread the dataset but used the cached result instead.

Scenarios

Taipy also allows you not to re-run scenarios that have already been run. For example, if you query the reviews for "Mon Ami Gabi". The next time you want to query the reviews for "Mon Ami Gabi", Taipy will log:

[2023-05-25 15:07:05,230][Taipy][INFO] job JOB_preprocessing_e0dc369a-a082-4524-8421-bef572e3643c is skipped.
[2023-05-25 15:07:05,275][Taipy][INFO] job JOB_get_business_id_6b8c7eb2-90c1-400a-acd8-1570e96464c3 is skipped.
[2023-05-25 15:07:05,317][Taipy][INFO] job JOB_get_reviews_e2a4f00c-1de7-4b05-8615-ebb778fbf5dc is skipped.
[2023-05-25 15:07:05,344][Taipy][INFO] job JOB_parse_reviews_3770539d-6fde-4a28-9ba1-bd0607758f1f is skipped.

Signaling that the pipeline did not run any tasks because the scenario was already run. The results will be displayed immediately.

Web App

The web app is built using Taipy GUI and looks like this:

The app allows you to select a business from a dropdown menu. This will call the pipeline, run the query in 5 minutes and display the results: average stars and reviews for the selected business.

How to Run

You can run the app using this repo which contains a smaller version of the dataset (30Mb):

Clone the repository

git clone https://github.com/AlexandreSajus/Taipy-Dask-Demo.git

Install the requirements

pip install -r requirements.txt

Run the web app

python app.py

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
algos		algos
config		config
data		data
media		media
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Querying Big Data using Taipy and Dask

Table of Contents

Why Taipy?

Why Dask?

Datasets

Data Pipeline

Caching

Scenarios

Web App

How to Run

About

Releases

Packages

Contributors 2

Languages

License

AlexandreSajus/Taipy-Dask-Demo

Folders and files

Latest commit

History

Repository files navigation

Querying Big Data using Taipy and Dask

Table of Contents

Why Taipy?

Why Dask?

Datasets

Data Pipeline

Caching

Scenarios

Web App

How to Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages