From 0372f20182ec22e18e1483395a3c8bf02a7ec6b6 Mon Sep 17 00:00:00 2001 From: Nirmit Desai Date: Wed, 1 May 2024 11:48:10 -0400 Subject: [PATCH 1/3] Updated top-level Readme to describe the project mission more holistically --- README.md | 75 ++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 50 insertions(+), 25 deletions(-) diff --git a/README.md b/README.md index c31546439..fa0dab69d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ -

Data Prep LAB

+

Data Prep Lab

@@ -11,11 +11,15 @@ --- -Data Prep Lab is a cloud native [Ray](https://docs.ray.io/en/latest/index.html) -based toolkit that allows a user to quickly prepare their data for building LLM applications using a set of available transforms. -This toolkit gives the user flexibility to run data prep from laptop-scale to cluster-scale, -and provides automation via [KFP pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/). Moreover, a user can add in their own data prep module and scale it -using Ray without having to worry about Ray internals. +Data Prep Lab is community project to democratize and accelerate unstructured data preparation for LLM app developers. +With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case specific unstructured data to fine-tune or instruct-tune the LLMs. +As the variety of use cases grows, so does the need to support: +- New modalities of data (code, language, speech, visual) +- New ways of transforming the data to optimize the performance of the resulting LLMs for each specific use case. +- Large variety in the scale of data to be processed, laptop-scale to datacenter-scale + +Data Prep Lab offers implementations of commonly needed data transformations, called *modules*, for both Code and Language modalities. +The goal is to offer high-level APIs for developers to quickly get started in working with their data without needing expertise in the underlying runtimes and frameworks. ## 📝 Table of Contents - [About](#about) @@ -25,30 +29,49 @@ using Ray without having to worry about Ray internals. - [Acknowledgments](#acknowledgement) ## 📖 About +Data Prep Lab is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning or instruction-tuning. +Data Prep Lab contributes a set of available modules that the user can get started with to easily build data pipelines suitable for their use case. +These modules have been tested in producing pre-training datasets for the [Granite](https://huggingface.co/instructlab/granite-7b-lab) open models. + +The modules are built on common frameworks (for Spark and Ray), called the *data processing library* that allows the developers to build new custom modules that readily scale across a variety of runtimes. +Eventually, Data Prep Lab will offer consistent APIs and configurations across the following underlying runtimes. +1. Python runtime +2. Ray runtime (local and distributed) +3. Spark runtime (local and distributed) +4. [No-code pipelines with KFP](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray) + +Current support matrix for the above runtimes is shown in the table below. + +|Transform | Python-only | Ray | Spark | KFP on Ray | +|------------------------------ |-------------------|------------------|--------------------|------------------------| +|No-op / template |:white_check_mark: |:white_check_mark:| |:white_check_mark: | +|Doc ID annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | +|Language annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | +|Programming language annnotation|:white_check_mark: |:white_check_mark:| |:white_check_mark: | +|Exact dedup | |:white_check_mark:| |:white_check_mark: | +|Fuzzy dedup | |:white_check_mark:| |:white_check_mark: | +|Code quality annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | +|Malware annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | +|Filter on annotations |:white_check_mark: |:white_check_mark:|:white_check_mark: |:white_check_mark: | +|Tokenize |:white_check_mark: |:white_check_mark:| |:white_check_mark: | + -Data Prep LAB is an toolkit that enables users to prepare their data for building LLM applications. -This toolkit comes with a set of available modules (known as transforms) that the user can get started -with to easily build customized data pipelines. -This set of transforms is built on a framework, known as the data processing library, -that allows a user to quickly build in their own new transforms and then scale them as needed. -Users can incorporate their logic for custom data transformation and then use the included Ray-based -distributed computing framework to scalably apply the transform to their data. Features of the toolkit: -- Collection of [scalable transformations](transforms) to expedite user onboarding -- [Data processing library](data-processing-lib) designed to facilitate effortless addition and deployment of new scalable transformations -- Operate efficiently and seamlessly from laptop-scale to cluster-scale supporting data processing at any data size -- [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) of transforms. +- Aiming to accelerate unstructured data prep burden for the "long tail" of LLM use cases +- Growing set of data transform implementations across multiple runtimes and scales of data +- [Data processing library](data-processing-lib) to enable contribution of new custom modules targeting new use cases +- [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) for no-code data prep Data modalities supported: * Code - support for code datasets as downloaded .zip files of github repositories converted to . [parquet](https://arrow.apache.org/docs/python/parquet.html) files. * Language - Future releases will provide transforms specific to natural language, and like code transforms will operate on parquet files. -Support for additional data formats are expected. +Support for additional data modalities is expected in the future. -### Toolkit Design: -The toolkit is a python-based library that has ready to use scalable Ray based transforms. +### Data Processing Library: +Python-based library that has ready to use transforms that can be supported across a variety of runtimes. We use the popular point [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language). Every parquet file follows a set [schema](tools/ingest2parquet/). @@ -58,20 +81,22 @@ tool that also adds the necessary fields in the schema. A user can use one or more of the [available transforms](transforms) to process their data. #### Transform design: -A transform can follow one of the two patterns: filter or annotator pattern. -In the annotator design pattern, a transform adds information during the processing by adding one more column to the parquet file. +A transform can follow one of the two patterns: annotator or filter. + +- **Annotator** An anotator transform adds information during the processing by adding one more column to the parquet file. The annotator design also allows a user to verify the results of the processing before actual filtering of the data. -When a transform acts as a filter, it processes the data and outputs the transformed data (example exact deduplication). + +- **Filter** A filter transform processes the data and outputs the transformed data (example exact deduplication). A general purpose [SQL-based filter transform](transforms/filter) enables a powerful mechanism for identifying columns and rows of interest for downstream processing. For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms). #### Scaling of transforms: -The distributed infrastructure, based on [Ray](https://docs.ray.io/en/latest/index.html), is used to scale out the transformation process. +To enable processing large volumes of data leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) and [Spark](https://spark.apache.org) wrappers are provided, to readily scale-out the Python implementations. A generalized workflow is shown [here](doc/data-processing.md). #### Bring Your Own Transform: -One can add new transforms by bringing in their own processing logic and using the framework to build scalable transforms. +One can add new transforms by bringing in their own Python-based processing logic and using the Data Processing Library to build and contribute transforms. More details on the data processing library [here](data-processing-lib/doc/overview.md). #### Automation: From d38d6ceebf496a1f27dca3fa1435d4230d11bdf2 Mon Sep 17 00:00:00 2001 From: Nirmit Desai Date: Wed, 1 May 2024 12:03:38 -0400 Subject: [PATCH 2/3] Remove lang ID from the top-level Readme as it is not in scope yet --- README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index fa0dab69d..b192afaf8 100644 --- a/README.md +++ b/README.md @@ -46,10 +46,9 @@ Current support matrix for the above runtimes is shown in the table below. |------------------------------ |-------------------|------------------|--------------------|------------------------| |No-op / template |:white_check_mark: |:white_check_mark:| |:white_check_mark: | |Doc ID annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | -|Language annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | |Programming language annnotation|:white_check_mark: |:white_check_mark:| |:white_check_mark: | -|Exact dedup | |:white_check_mark:| |:white_check_mark: | -|Fuzzy dedup | |:white_check_mark:| |:white_check_mark: | +|Exact dedup filter | |:white_check_mark:| |:white_check_mark: | +|Fuzzy dedup filter | |:white_check_mark:| |:white_check_mark: | |Code quality annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | |Malware annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | |Filter on annotations |:white_check_mark: |:white_check_mark:|:white_check_mark: |:white_check_mark: | From c8b01883dcd6aeea3cf2832083d61b24dfa71190 Mon Sep 17 00:00:00 2001 From: Nirmit Desai Date: Wed, 1 May 2024 13:09:47 -0400 Subject: [PATCH 3/3] Updating global Readme to address conflicts and make minor corrections --- README.md | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index b192afaf8..661d4452b 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ As the variety of use cases grows, so does the need to support: - Large variety in the scale of data to be processed, laptop-scale to datacenter-scale Data Prep Lab offers implementations of commonly needed data transformations, called *modules*, for both Code and Language modalities. -The goal is to offer high-level APIs for developers to quickly get started in working with their data without needing expertise in the underlying runtimes and frameworks. +The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks. ## 📝 Table of Contents - [About](#about) @@ -30,7 +30,7 @@ The goal is to offer high-level APIs for developers to quickly get started in wo ## 📖 About Data Prep Lab is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning or instruction-tuning. -Data Prep Lab contributes a set of available modules that the user can get started with to easily build data pipelines suitable for their use case. +Data Prep Lab contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case. These modules have been tested in producing pre-training datasets for the [Granite](https://huggingface.co/instructlab/granite-7b-lab) open models. The modules are built on common frameworks (for Spark and Ray), called the *data processing library* that allows the developers to build new custom modules that readily scale across a variety of runtimes. @@ -40,9 +40,11 @@ Eventually, Data Prep Lab will offer consistent APIs and configurations across t 3. Spark runtime (local and distributed) 4. [No-code pipelines with KFP](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray) -Current support matrix for the above runtimes is shown in the table below. +Current matrix for the combination of modules and supported runtimes is shown in the table below. +Contributors are welcome to add new modules as well as add runtime support for existing modules! -|Transform | Python-only | Ray | Spark | KFP on Ray | + +|Modules | Python-only | Ray | Spark | KFP on Ray | |------------------------------ |-------------------|------------------|--------------------|------------------------| |No-op / template |:white_check_mark: |:white_check_mark:| |:white_check_mark: | |Doc ID annotation |:white_check_mark: |:white_check_mark:| |:white_check_mark: | @@ -58,11 +60,13 @@ Current support matrix for the above runtimes is shown in the table below. Features of the toolkit: - Aiming to accelerate unstructured data prep burden for the "long tail" of LLM use cases -- Growing set of data transform implementations across multiple runtimes and scales of data +- Growing set of module implementations across multiple runtimes and targeting laptop-scale to datacenter-scale processing +- Growing set of sample pipelines developed for real enterprise use cases - [Data processing library](data-processing-lib) to enable contribution of new custom modules targeting new use cases - [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) for no-code data prep Data modalities supported: + * Code - support for code datasets as downloaded .zip files of github repositories converted to . [parquet](https://arrow.apache.org/docs/python/parquet.html) files. * Language - Future releases will provide transforms specific to natural language, and like code transforms will operate on parquet files. @@ -85,9 +89,8 @@ A transform can follow one of the two patterns: annotator or filter. - **Annotator** An anotator transform adds information during the processing by adding one more column to the parquet file. The annotator design also allows a user to verify the results of the processing before actual filtering of the data. -- **Filter** A filter transform processes the data and outputs the transformed data (example exact deduplication). -A general purpose [SQL-based filter transform](transforms/filter) enables a powerful mechanism for identifying -columns and rows of interest for downstream processing. +- **Filter** A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. +A general purpose [SQL-based filter transform](transforms/universal/filter) enables a powerful mechanism for identifying columns and rows of interest for downstream processing. For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms). #### Scaling of transforms: @@ -145,12 +148,13 @@ Get started by running the noop transform that performs an identity operation by Get started by building a data pipeline with our example pipeline (link to be added) that can run on a laptop. ### Build your own sequence of transforms -Follow the documentation [here](doc/overview.md) to build your own pipelines. +Follow the documentation [here](data-processing-lib/doc/overview.md) to build your own pipelines. ### Automate the pipeline The data preprocessing can be automated by running transformers as a KubeFlow pipeline (KFP). -See a simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md), and [multi-steps pipeline](kfp/doc/multi_transform_pipeline.md) -if you want to combine several data transformation steps. +See a simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md). Next releases of Data Prep LAB will +demonstrate how several simple transform pipelines can be combined into a single KFP pipeline. Future releases of +Data Prep LAB will demonstrate how multiple simple transform pipelines can be combined into a single KFP pipeline. The project facilitates the creation of a local Kind cluster with all the required software and test data. To work with the Kind cluster and KFP, you need to install several pre-required software packages. Please refer to