IBM · daw3rd · May 13, 2024 · May 13, 2024
diff --git a/.secrets.baseline b/.secrets.baseline
@@ -3,7 +3,7 @@
     "files": null,
     "lines": null
   },
-  "generated_at": "2024-04-25T18:36:43Z",
+  "generated_at": "2024-05-13T11:04:51Z",
   "plugins_used": [
     {
       "name": "AWSKeyDetector"
@@ -92,7 +92,7 @@
         "hashed_secret": "f15e1014a6f4234f0d394a979a45f5983c9fbc7f",
         "is_secret": false,
         "is_verified": false,
-        "line_number": 53,
+        "line_number": 55,
         "type": "Secret Keyword",
         "verified_result": null
       }

diff --git a/data-processing-lib/doc/advanced-transform-tutorial.md b/data-processing-lib/doc/advanced-transform-tutorial.md
@@ -13,6 +13,7 @@ removes duplicate documents across all files. In this tutorial, we will show the
   the operation of our _noop_ transform.
 
 The complete task involves the following:
+
 * EdedupTransform - class that implements the specific transformation
 * EdedupRuntime - class that implements custom TransformRuntime to create supporting Ray objects and enhance job output
   statistics
@@ -39,6 +40,7 @@ First, let's define the transform class.  To do this we extend
 the base abstract/interface class
 [AbstractTableTransform](../src/data_processing/transform/table_transform.py),
 which requires definition of the following:
+
 * an initializer (i.e. `init()`) that accepts a dictionary of configuration
   data.  For this example, the configuration data will only be defined by
   command line arguments (defined below).
@@ -138,6 +140,7 @@ First, let's define the transform runtime class.  To do this we extend
 the base abstract/interface class
 [DefaultTableTransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py),
 which requires definition of the following:
+
 * an initializer (i.e. `init()`) that accepts a dictionary of configuration
   data.  For this example, the configuration data will only be defined by
   command line arguments (defined below).

diff --git a/data-processing-lib/doc/architecture.md b/data-processing-lib/doc/architecture.md
@@ -32,32 +32,36 @@ It uses the following components, all of which can/do define CLI configuration p
     After all parameters are validated, the ray cluster is started and the DataAccessFactory, TransformOrchestratorConfiguraiton
     and TransformConfiguration are given to the Ray Orchestrator, via Ray remote() method invocation.
     The Launcher waits for the Ray Orchestrator to complete.
-* [Ray Orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py) is responsible for overall management of
+
+* documents with [Ray Orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py) is responsible for overall management of
   the data processing job. It creates the actors, determines the set of input data and distributes the 
   references to the data files to be processed by the workers. More specifically, it performs the following:
+
   1. Uses the DataAccess instance created by the DataAccessFactory to determine the set of the files 
   to be processed.  
   2. uses the TransformConfiguration to create the TransformRuntime instance 
   3. Uses the TransformRuntime to optionally apply additional configuration (ray object storage, etc) for the configuration
   and operation of the Transform.
-  3. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create
+  4. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create
   to execute transformers in parallel, providing the following to each worker:
       * Ray worker configuration
       * DataAccessFactory 
       * Transform class and its TransformConfiguration containing the CLI parameters and any TransformRuntime additions.
-  4. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process.
+  5. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process.
 
   Additionally, to provide monitoring of long-running transforms, the orchestrator is instrumented with 
   [custom metrics](https://docs.ray.io/en/latest/ray-observability/user-guides/add-app-metrics.html), that are exported to localhost:8080 (this is the endpoint that 
   Prometheus would be configured to scrape).
   Once all data is processed, the orchestrator will collect execution statistics (from the statistics actor) 
   and build and save it in the form of execution metadata (`metadata.json`). Finally, it will return the execution 
   result to the Launcher.
+
 * [Ray worker](../src/data_processing/runtime/ray/transform_table_processor.py) is responsible for 
 reading files (as [PyArrow Tables](https://levelup.gitconnected.com/deep-dive-into-pyarrow-understanding-its-features-and-benefits-2cce8b1466c8))
 assigned by the orchestrator, applying the transform to the input table and writing out the 
 resulting table(s).  Metadata produced by each table transformation is aggregated into
 Transform Statistics (below).
+
 * [Transform Statistics](../src/data_processing/runtime/ray/transform_statistics.py) is a general 
 purpose data collector actor aggregating the numeric metadata from different places of 
 the framework (especially metadata produced by the transform).

diff --git a/data-processing-lib/doc/simplest-transform-tutorial.md b/data-processing-lib/doc/simplest-transform-tutorial.md
@@ -15,11 +15,13 @@ one table to another.  That said, we will show the following:
   the operation of our _noop_ transform.
 
 We will **not** be showing the following:
+
 * The creation of a custom TransformRuntime that would enable more global
   state and/or coordination among the transforms running in other ray actors.
   This will be covered in an advanced tutorial.
 
 The complete task involves the following:
+
 * NOOPTransform - class that implements the specific transformation
 * NOOPTableTransformConfiguration - class that provides configuration for the
   NOOPTransform, specifically the command line arguments used to configure it.
@@ -37,6 +39,7 @@ First, let's define the transform class.  To do this we extend
 the base abstract/interface class
 [AbstractTableTransform](../src/data_processing_ibm/transform/table_transform.py),
 which requires definition of the following:
+
 * an initializer (i.e. `init()`) that accepts a dictionary of configuration
   data.  For this example, the configuration data will only be defined by
   command line arguments (defined below).

diff --git a/data-processing-lib/doc/testing-e2e-transform.md b/data-processing-lib/doc/testing-e2e-transform.md
@@ -1,4 +1,5 @@
 # Testing End-to-End Transform operation
 WIP - Points to discuss
+
 1. Reading input files and writing output files. 
-2. Testing of the transform runtime and use of ray components in the transform
+2. Testing of the transform runtime and use of ray components in the transform
diff --git a/data-processing-lib/doc/transform-external-resources.md b/data-processing-lib/doc/transform-external-resources.md
@@ -8,6 +8,7 @@ In addition to actually loading the resource(s), the transform needs to define t
 defines the location of the domain list. 
 
 In the next sections we cover the following:
+
    1. How to define the transform-specific resource location(s) as command line arguments
    2. How to load the transform-specific resources, either or both of:
       1. During transform initialization - this is useful for testing outside of ray, and optionally

diff --git a/data-processing-lib/doc/transform-standalone-testing.md b/data-processing-lib/doc/transform-standalone-testing.md
@@ -15,6 +15,7 @@ transform implementation tests will easily leverage.
 
 The first (currently only test) is a the `test_transform()` method that takes the
 following inputs:
+
 * the transform implementation being tested, properly configured with the configuration
 dictionary for the associated test data.
 * a list of N (1 or more) input tables to be processed with the transform's `transform(Table)` method.

diff --git a/data-processing-lib/doc/transform-tutorials.md b/data-processing-lib/doc/transform-tutorials.md
@@ -44,16 +44,19 @@ not need this feature, a default implementation is provided to return an empty l
 The [TransformConfiguration](../src/data_processing/transform/transform_configuration.py)
 serves as an interface and must be implemented by the any `AbstractTableTransform`
 implementation to provide the following configuration:
+
 * the transform class to be used,
 * command line arguments used to initialize the Transform Runtime and generally, the Transform.
 * Transform Runtime class to use
 * transform short name 
+
 It is expected that transforms are initialized with a fixed name, the class of its corresponding
 `AbstractTableTransform` implementation and optionally the configuration keys that should not
 be exposed as metadata for a run.
 To support command line configuration, the `TransformConfiguration` extends the
 [CLIArgumentProvider](../src/data_processing/utils/cli_utils.py) class.
 The set of methods of interest are
+
 * ```__init__(self, name:str, transform_class:type[AbstractTableTransform], list[str]:remove_from_metadata )``` - sets the required fields
 * ```add_input_params(self, parser:ArgumentParser)``` - adds transform-specific command line options that will
 be made available in the dictionary provided to the transform's initializer.

diff --git a/data-processing-lib/doc/transformer-utilities.md b/data-processing-lib/doc/transformer-utilities.md
@@ -2,6 +2,7 @@
 
 A class [TransformUtils](../src/data_processing/utils/transform_utils.py) provides several methods that simplify 
 transformer's implementation. Currently it includes the following methods:
+
 * `deep_get_size` is the method to get the complete size of the Python object based on
   https://www.askpython.com/python/built-in-methods/variables-memory-size-in-python
   It supports Python structures: list, tuple and set
@@ -17,8 +18,9 @@ be removed before it is added
   removes URL encoding
 
 It also contain two variables:
+
 * `RANDOM_SEED` number that is used for methods that require seed
 * `LOCAL_TO_DISK` rough local size to size on disk/S3
 
 This class should be extended with additional methods, generally useful across multiple transformers and documentation 
-should be added here 
+should be added here 
diff --git a/doc/data-processing.md b/doc/data-processing.md
@@ -14,6 +14,7 @@ to, for example:
 * Filter the table to remove or edit rows and/or columns, for example to remove rows from blocked domains.
 
 The table is generally expected to have something like the following minimal set of columns :
+
 * URL source of the document (can be use for domain block listing)
 * Document id
 * Contents of the actual document to be used for LLM training

diff --git a/kfp/kfp_ray_components/README.md b/kfp/kfp_ray_components/README.md
@@ -12,6 +12,7 @@ A pipeline component is a self-contained set of code that performs one step in a
 ````
 
 The first step in creation of components its implementation. The framework automation includes the following 3 components:
+
 * [Create Ray cluster](src/create_ray_cluster.py) is responsible for creation of the Ray cluster. Its implementation is 
   based on the [RayRemoteJobs class](../kfp_support_lib/src/kfp_support/workflow_support/README.md)
 * [execute Ray job](src/execute_ray_job.py) is responsible for submission of the Ray job, watching its execution,
@@ -29,6 +30,7 @@ command-line arguments to pass to your component’s code.
 * The component’s metadata, such as the name and description.
 
 Components specifications are provided here:
+
 * [Create Ray cluster Component](createRayComponent.yaml)
 * [execute Ray job component](executeRayJobComponent.yaml)
 * [clean up Ray cluster component](cleanupRayComponent.yaml)

diff --git a/kfp/kfp_support_lib/README.md b/kfp/kfp_support_lib/README.md
@@ -2,6 +2,7 @@
 
 This provides support for implementing KFP pipelines automating transform's execution.
 It comprises 2 main modules
+
 * [api server client](src/kfp_support/api_server_client/README.md) 
 * [workflow support](src/kfp_support/workflow_support/README.md)
 

diff --git a/transforms/README.md b/transforms/README.md
@@ -120,6 +120,7 @@ To push all the images run `make push`, or `make -C <path to transform directory
 
 ### IDE Setup
 When running in an IDE, such as PyCharm, the following are generally required:
+
 * From the command line, build the venv using `make venv`.
 * In the IDE
     * Set your project/run configuration to use the venv/bin/python as your runtime virtual environment.