Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme updates nirmit #44

Merged
merged 4 commits into from
May 1, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 49 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@


<h1 align="center">Data Prep LAB </h1>
<h1 align="center">Data Prep Lab </h1>

<div align="center">

Expand All @@ -11,11 +11,15 @@

---

Data Prep Lab is a cloud native [Ray](https://docs.ray.io/en/latest/index.html)
based toolkit that allows a user to quickly prepare their data for building LLM applications using a set of available transforms.
This toolkit gives the user flexibility to run data prep from laptop-scale to cluster-scale,
and provides automation via [KFP pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/). Moreover, a user can add in their own data prep module and scale it
using Ray without having to worry about Ray internals.
Data Prep Lab is community project to democratize and accelerate unstructured data preparation for LLM app developers.
With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case specific unstructured data to fine-tune or instruct-tune the LLMs.
As the variety of use cases grows, so does the need to support:
- New modalities of data (code, language, speech, visual)
- New ways of transforming the data to optimize the performance of the resulting LLMs for each specific use case.
- Large variety in the scale of data to be processed, laptop-scale to datacenter-scale

Data Prep Lab offers implementations of commonly needed data transformations, called *modules*, for both Code and Language modalities.
The goal is to offer high-level APIs for developers to quickly get started in working with their data without needing expertise in the underlying runtimes and frameworks.

## 📝 Table of Contents
- [About](#about)
Expand All @@ -25,30 +29,48 @@ using Ray without having to worry about Ray internals.
- [Acknowledgments](#acknowledgement)

## &#x1F4D6; About <a name = "about"></a>
Data Prep Lab is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning or instruction-tuning.
Data Prep Lab contributes a set of available modules that the user can get started with to easily build data pipelines suitable for their use case.
These modules have been tested in producing pre-training datasets for the [Granite](https://huggingface.co/instructlab/granite-7b-lab) open models.

The modules are built on common frameworks (for Spark and Ray), called the *data processing library* that allows the developers to build new custom modules that readily scale across a variety of runtimes.
Eventually, Data Prep Lab will offer consistent APIs and configurations across the following underlying runtimes.
1. Python runtime
2. Ray runtime (local and distributed)
3. Spark runtime (local and distributed)
4. [No-code pipelines with KFP](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)

Current support matrix for the above runtimes is shown in the table below.

|Transform | Python-only | Ray | Spark | KFP on Ray |
|------------------------------ |-------------------|------------------|--------------------|------------------------|
|No-op / template |:white_check_mark: |:white_check_mark:| |:white_check_mark: |
|Doc ID annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: |
|Programming language annnotation|:white_check_mark: |:white_check_mark:| |:white_check_mark: |
|Exact dedup filter | |:white_check_mark:| |:white_check_mark: |
|Fuzzy dedup filter | |:white_check_mark:| |:white_check_mark: |
|Code quality annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: |
|Malware annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: |
|Filter on annotations |:white_check_mark: |:white_check_mark:|:white_check_mark: |:white_check_mark: |
|Tokenize |:white_check_mark: |:white_check_mark:| |:white_check_mark: |


Data Prep LAB is an toolkit that enables users to prepare their data for building LLM applications.
This toolkit comes with a set of available modules (known as transforms) that the user can get started
with to easily build customized data pipelines.
This set of transforms is built on a framework, known as the data processing library,
that allows a user to quickly build in their own new transforms and then scale them as needed.
Users can incorporate their logic for custom data transformation and then use the included Ray-based
distributed computing framework to scalably apply the transform to their data.

Features of the toolkit:
- Collection of [scalable transformations](transforms) to expedite user onboarding
- [Data processing library](data-processing-lib) designed to facilitate effortless addition and deployment of new scalable transformations
- Operate efficiently and seamlessly from laptop-scale to cluster-scale supporting data processing at any data size
- [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) of transforms.
- Aiming to accelerate unstructured data prep burden for the "long tail" of LLM use cases
- Growing set of data transform implementations across multiple runtimes and scales of data
- [Data processing library](data-processing-lib) to enable contribution of new custom modules targeting new use cases
- [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) for no-code data prep

Data modalities supported:
* Code - support for code datasets as downloaded .zip files of github repositories converted to .
[parquet](https://arrow.apache.org/docs/python/parquet.html) files.
* Language - Future releases will provide transforms specific to natural language, and like code transforms will operate on parquet files.

Support for additional data formats are expected.
Support for additional data modalities is expected in the future.

### Toolkit Design:
The toolkit is a python-based library that has ready to use scalable Ray based transforms.
### Data Processing Library:
Python-based library that has ready to use transforms that can be supported across a variety of runtimes.
We use the popular point [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language).
Every parquet file follows a set
[schema](tools/ingest2parquet/).
Expand All @@ -58,20 +80,22 @@ tool that also adds the necessary fields in the schema.
A user can use one or more of the [available transforms](transforms) to process their data.

#### Transform design:
A transform can follow one of the two patterns: filter or annotator pattern.
In the annotator design pattern, a transform adds information during the processing by adding one more column to the parquet file.
A transform can follow one of the two patterns: annotator or filter.

- **Annotator** An anotator transform adds information during the processing by adding one more column to the parquet file.
The annotator design also allows a user to verify the results of the processing before actual filtering of the data.
When a transform acts as a filter, it processes the data and outputs the transformed data (example exact deduplication).

- **Filter** A filter transform processes the data and outputs the transformed data (example exact deduplication).
A general purpose [SQL-based filter transform](transforms/filter) enables a powerful mechanism for identifying
columns and rows of interest for downstream processing.
For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms).

#### Scaling of transforms:
The distributed infrastructure, based on [Ray](https://docs.ray.io/en/latest/index.html), is used to scale out the transformation process.
To enable processing large volumes of data leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) and [Spark](https://spark.apache.org) wrappers are provided, to readily scale-out the Python implementations.
A generalized workflow is shown [here](doc/data-processing.md).

#### Bring Your Own Transform:
One can add new transforms by bringing in their own processing logic and using the framework to build scalable transforms.
One can add new transforms by bringing in their own Python-based processing logic and using the Data Processing Library to build and contribute transforms.
More details on the data processing library [here](data-processing-lib/doc/overview.md).

#### Automation:
Expand Down