From 0372f20182ec22e18e1483395a3c8bf02a7ec6b6 Mon Sep 17 00:00:00 2001
From: Nirmit Desai <nirmit.desai@us.ibm.com>
Date: Wed, 1 May 2024 11:48:10 -0400
Subject: [PATCH 1/3] Updated top-level Readme to describe the project mission
 more holistically

---
 README.md | 75 ++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 50 insertions(+), 25 deletions(-)
diff --git a/README.md b/README.md
index c31546439..fa0dab69d 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 
 
-<h1 align="center">Data Prep LAB </h1>
+<h1 align="center">Data Prep Lab </h1>
 
 <div align="center">
 
@@ -11,11 +11,15 @@
 
 ---
 
-Data Prep Lab is a cloud native [Ray](https://docs.ray.io/en/latest/index.html)
-based toolkit that allows a user to quickly prepare their data for building LLM applications using a set of available transforms.
-This toolkit gives the user flexibility to run data prep from laptop-scale to cluster-scale, 
-and provides automation via [KFP pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/). Moreover, a user can add in their own data prep module and scale it 
-using Ray without having to worry about Ray internals. 
+Data Prep Lab is community project to democratize and accelerate unstructured data preparation for LLM app developers. 
+With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case specific unstructured data to fine-tune or instruct-tune the LLMs.
+As the variety of use cases grows, so does the need to support:
+- New modalities of data (code, language, speech, visual) 
+- New ways of transforming the data to optimize the performance of the resulting LLMs for each specific use case.
+- Large variety in the scale of data to be processed, laptop-scale to datacenter-scale
+
+Data Prep Lab offers implementations of commonly needed data transformations, called *modules*, for both Code and Language modalities.
+The goal is to offer high-level APIs for developers to quickly get started in working with their data without needing expertise in the underlying runtimes and frameworks.
 
 ## 📝 Table of Contents
 - [About](#about)
@@ -25,30 +29,49 @@ using Ray without having to worry about Ray internals.
 - [Acknowledgments](#acknowledgement)
 
 ## &#x1F4D6; About <a name = "about"></a>
+Data Prep Lab is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning or instruction-tuning.
+Data Prep Lab contributes a set of available modules that the user can get started with to easily build data pipelines suitable for their use case.
+These modules have been tested in producing pre-training datasets for the [Granite](https://huggingface.co/instructlab/granite-7b-lab) open models. 
+
+The modules are built on common frameworks (for Spark and Ray), called the *data processing library* that allows the developers to build new custom modules that readily scale across a variety of runtimes.
+Eventually, Data Prep Lab will offer consistent APIs and configurations across the following underlying runtimes.
+1. Python runtime
+2. Ray runtime (local and distributed)
+3. Spark runtime (local and distributed)
+4. [No-code pipelines with KFP](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)
+
+Current support matrix for the above runtimes is shown in the table below.
+
+|Transform                       | Python-only       | Ray              | Spark              | KFP on Ray             |
+|------------------------------  |-------------------|------------------|--------------------|------------------------|
+|No-op / template                |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
+|Doc ID annotation               |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
+|Language annotation             |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
+|Programming language annnotation|:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      | 
+|Exact dedup                     |                   |:white_check_mark:|                    |:white_check_mark:      |
+|Fuzzy dedup                     |                   |:white_check_mark:|                    |:white_check_mark:      |
+|Code quality annotation         |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
+|Malware annotation              |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
+|Filter on annotations           |:white_check_mark: |:white_check_mark:|:white_check_mark:  |:white_check_mark:      |
+|Tokenize                        |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
+
 
-Data Prep LAB is an toolkit that enables users to prepare their data for building LLM applications.
-This toolkit comes with a set of available modules (known as transforms) that the user can get started 
-with to easily build customized data pipelines.
-This set of transforms is built on a framework, known as the data processing library, 
-that allows a user to quickly build in their own new transforms and then scale them as needed.
-Users can incorporate their logic for custom data transformation and then use the included Ray-based
-distributed computing framework to scalably apply the transform to their data. 
 
 Features of the toolkit: 
-- Collection of [scalable transformations](transforms) to expedite user onboarding
-- [Data processing library](data-processing-lib) designed to facilitate effortless addition and deployment of new scalable transformations
-- Operate efficiently and seamlessly from laptop-scale to cluster-scale supporting data processing at any data size
-- [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) of transforms.
+- Aiming to accelerate unstructured data prep burden for the "long tail" of LLM use cases
+- Growing set of data transform implementations across multiple runtimes and scales of data
+- [Data processing library](data-processing-lib) to enable contribution of new custom modules targeting new use cases
+- [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) for no-code data prep
 
 Data modalities supported: 
 * Code - support for code datasets as downloaded .zip files of github repositories converted to . 
 [parquet](https://arrow.apache.org/docs/python/parquet.html) files. 
 * Language - Future releases will provide transforms specific to natural language, and like code transforms will operate on parquet files.
 
-Support for additional data formats are expected. 
+Support for additional data modalities is expected in the future.
 
-### Toolkit Design: 
-The toolkit is a python-based library that has ready to use scalable Ray based transforms. 
+### Data Processing Library: 
+Python-based library that has ready to use transforms that can be supported across a variety of runtimes.
 We use the popular point [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language). 
 Every parquet file follows a set 
 [schema](tools/ingest2parquet/).
@@ -58,20 +81,22 @@ tool that also adds the necessary fields in the schema.
 A user can use one or more of the [available transforms](transforms) to process their data. 
 
 #### Transform design: 
-A transform can follow one of the two patterns: filter or annotator pattern.
-In the annotator design pattern, a transform adds information during the processing by adding one more column to the parquet file.
+A transform can follow one of the two patterns: annotator or filter.
+
+- **Annotator** An anotator transform adds information during the processing by adding one more column to the parquet file.
 The annotator design also allows a user to verify the results of the processing before actual filtering of the data.
-When a transform acts as a filter, it processes the data and outputs the transformed data (example exact deduplication).
+
+- **Filter** A filter transform processes the data and outputs the transformed data (example exact deduplication).
 A general purpose [SQL-based filter transform](transforms/filter) enables a powerful mechanism for identifying 
 columns and rows of interest for downstream processing.
 For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms). 
 
 #### Scaling of transforms: 
-The distributed infrastructure, based on [Ray](https://docs.ray.io/en/latest/index.html), is used to scale out the transformation process.
+To enable processing large volumes of data leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) and [Spark](https://spark.apache.org) wrappers are provided, to readily scale-out the Python implementations.
 A generalized workflow is shown [here](doc/data-processing.md).
 
 #### Bring Your Own Transform: 
-One can add new transforms by bringing in their own processing logic and using the framework to build scalable transforms.
+One can add new transforms by bringing in their own Python-based processing logic and using the Data Processing Library to build and contribute transforms.
 More details on the data processing library [here](data-processing-lib/doc/overview.md). 
 
 #### Automation: 

From d38d6ceebf496a1f27dca3fa1435d4230d11bdf2 Mon Sep 17 00:00:00 2001
From: Nirmit Desai <nirmit.desai@us.ibm.com>
Date: Wed, 1 May 2024 12:03:38 -0400
Subject: [PATCH 2/3] Remove lang ID from the top-level Readme as it is not in
 scope yet

---
 README.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index fa0dab69d..b192afaf8 100644
--- a/README.md
+++ b/README.md
@@ -46,10 +46,9 @@ Current support matrix for the above runtimes is shown in the table below.
 |------------------------------  |-------------------|------------------|--------------------|------------------------|
 |No-op / template                |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
 |Doc ID annotation               |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
-|Language annotation             |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
 |Programming language annnotation|:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      | 
-|Exact dedup                     |                   |:white_check_mark:|                    |:white_check_mark:      |
-|Fuzzy dedup                     |                   |:white_check_mark:|                    |:white_check_mark:      |
+|Exact dedup filter              |                   |:white_check_mark:|                    |:white_check_mark:      |
+|Fuzzy dedup filter              |                   |:white_check_mark:|                    |:white_check_mark:      |
 |Code quality annotation         |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
 |Malware annotation              |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
 |Filter on annotations           |:white_check_mark: |:white_check_mark:|:white_check_mark:  |:white_check_mark:      |

From c8b01883dcd6aeea3cf2832083d61b24dfa71190 Mon Sep 17 00:00:00 2001
From: Nirmit Desai <nirmit.desai@us.ibm.com>
Date: Wed, 1 May 2024 13:09:47 -0400
Subject: [PATCH 3/3] Updating global Readme to address conflicts and make
 minor corrections

---
 README.md | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/README.md b/README.md
index b192afaf8..661d4452b 100644
--- a/README.md
+++ b/README.md
@@ -19,7 +19,7 @@ As the variety of use cases grows, so does the need to support:
 - Large variety in the scale of data to be processed, laptop-scale to datacenter-scale
 
 Data Prep Lab offers implementations of commonly needed data transformations, called *modules*, for both Code and Language modalities.
-The goal is to offer high-level APIs for developers to quickly get started in working with their data without needing expertise in the underlying runtimes and frameworks.
+The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks.
 
 ## 📝 Table of Contents
 - [About](#about)
@@ -30,7 +30,7 @@ The goal is to offer high-level APIs for developers to quickly get started in wo
 
 ## &#x1F4D6; About <a name = "about"></a>
 Data Prep Lab is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning or instruction-tuning.
-Data Prep Lab contributes a set of available modules that the user can get started with to easily build data pipelines suitable for their use case.
+Data Prep Lab contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case.
 These modules have been tested in producing pre-training datasets for the [Granite](https://huggingface.co/instructlab/granite-7b-lab) open models. 
 
 The modules are built on common frameworks (for Spark and Ray), called the *data processing library* that allows the developers to build new custom modules that readily scale across a variety of runtimes.
@@ -40,9 +40,11 @@ Eventually, Data Prep Lab will offer consistent APIs and configurations across t
 3. Spark runtime (local and distributed)
 4. [No-code pipelines with KFP](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)
 
-Current support matrix for the above runtimes is shown in the table below.
+Current matrix for the combination of modules and supported runtimes is shown in the table below. 
+Contributors are welcome to add new modules as well as add runtime support for existing modules!
 
-|Transform                       | Python-only       | Ray              | Spark              | KFP on Ray             |
+
+|Modules                       | Python-only       | Ray              | Spark              | KFP on Ray             |
 |------------------------------  |-------------------|------------------|--------------------|------------------------|
 |No-op / template                |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
 |Doc ID annotation               |:white_check_mark: |:white_check_mark:|                    |:white_check_mark:      |
@@ -58,11 +60,13 @@ Current support matrix for the above runtimes is shown in the table below.
 
 Features of the toolkit: 
 - Aiming to accelerate unstructured data prep burden for the "long tail" of LLM use cases
-- Growing set of data transform implementations across multiple runtimes and scales of data
+- Growing set of module implementations across multiple runtimes and targeting laptop-scale to datacenter-scale processing
+- Growing set of sample pipelines developed for real enterprise use cases
 - [Data processing library](data-processing-lib) to enable contribution of new custom modules targeting new use cases
 - [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp) for no-code data prep
 
 Data modalities supported: 
+
 * Code - support for code datasets as downloaded .zip files of github repositories converted to . 
 [parquet](https://arrow.apache.org/docs/python/parquet.html) files. 
 * Language - Future releases will provide transforms specific to natural language, and like code transforms will operate on parquet files.
@@ -85,9 +89,8 @@ A transform can follow one of the two patterns: annotator or filter.
 - **Annotator** An anotator transform adds information during the processing by adding one more column to the parquet file.
 The annotator design also allows a user to verify the results of the processing before actual filtering of the data.
 
-- **Filter** A filter transform processes the data and outputs the transformed data (example exact deduplication).
-A general purpose [SQL-based filter transform](transforms/filter) enables a powerful mechanism for identifying 
-columns and rows of interest for downstream processing.
+- **Filter** A filter transform processes the data and outputs the transformed data, e.g., exact deduplication.
+A general purpose [SQL-based filter transform](transforms/universal/filter) enables a powerful mechanism for identifying  columns and rows of interest for downstream processing.
 For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms). 
 
 #### Scaling of transforms: 
@@ -145,12 +148,13 @@ Get started by running the noop transform that performs an identity operation by
 Get started by building a data pipeline with our example pipeline (link to be added) that can run on a laptop. 
 
 ### Build your own sequence of transforms
-Follow the documentation [here](doc/overview.md) to build your own pipelines. 
+Follow the documentation [here](data-processing-lib/doc/overview.md) to build your own pipelines. 
 
 ### Automate the pipeline
 The data preprocessing can be automated by running transformers as a KubeFlow pipeline (KFP). 
-See a simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md), and [multi-steps pipeline](kfp/doc/multi_transform_pipeline.md) 
-if you want to combine several data transformation steps.
+See a simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md). Next releases of Data Prep LAB will 
+demonstrate how several simple transform pipelines can be combined into a single KFP pipeline. Future releases of 
+Data Prep LAB will demonstrate how multiple simple transform pipelines can be combined into a single KFP pipeline.
 
 The project facilitates the creation of a local Kind cluster with all the required software and test data. 
 To work with the Kind cluster and KFP, you need to install several pre-required software packages. Please refer to