diff --git a/.secrets.baseline b/.secrets.baseline index a61a47652..9648681e1 100644 --- a/.secrets.baseline +++ b/.secrets.baseline @@ -3,7 +3,7 @@ "files": null, "lines": null }, - "generated_at": "2024-04-25T18:36:43Z", + "generated_at": "2024-05-13T11:04:51Z", "plugins_used": [ { "name": "AWSKeyDetector" @@ -92,7 +92,7 @@ "hashed_secret": "f15e1014a6f4234f0d394a979a45f5983c9fbc7f", "is_secret": false, "is_verified": false, - "line_number": 53, + "line_number": 55, "type": "Secret Keyword", "verified_result": null } diff --git a/data-processing-lib/doc/advanced-transform-tutorial.md b/data-processing-lib/doc/advanced-transform-tutorial.md index 886e8d192..d7bad46fb 100644 --- a/data-processing-lib/doc/advanced-transform-tutorial.md +++ b/data-processing-lib/doc/advanced-transform-tutorial.md @@ -13,6 +13,7 @@ removes duplicate documents across all files. In this tutorial, we will show the the operation of our _noop_ transform. The complete task involves the following: + * EdedupTransform - class that implements the specific transformation * EdedupRuntime - class that implements custom TransformRuntime to create supporting Ray objects and enhance job output statistics @@ -39,6 +40,7 @@ First, let's define the transform class. To do this we extend the base abstract/interface class [AbstractTableTransform](../src/data_processing/transform/table_transform.py), which requires definition of the following: + * an initializer (i.e. `init()`) that accepts a dictionary of configuration data. For this example, the configuration data will only be defined by command line arguments (defined below). @@ -138,6 +140,7 @@ First, let's define the transform runtime class. To do this we extend the base abstract/interface class [DefaultTableTransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py), which requires definition of the following: + * an initializer (i.e. `init()`) that accepts a dictionary of configuration data. For this example, the configuration data will only be defined by command line arguments (defined below). diff --git a/data-processing-lib/doc/architecture.md b/data-processing-lib/doc/architecture.md index 34ca36258..e66cff872 100644 --- a/data-processing-lib/doc/architecture.md +++ b/data-processing-lib/doc/architecture.md @@ -32,20 +32,22 @@ It uses the following components, all of which can/do define CLI configuration p After all parameters are validated, the ray cluster is started and the DataAccessFactory, TransformOrchestratorConfiguraiton and TransformConfiguration are given to the Ray Orchestrator, via Ray remote() method invocation. The Launcher waits for the Ray Orchestrator to complete. -* [Ray Orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py) is responsible for overall management of + +* documents with [Ray Orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py) is responsible for overall management of the data processing job. It creates the actors, determines the set of input data and distributes the references to the data files to be processed by the workers. More specifically, it performs the following: + 1. Uses the DataAccess instance created by the DataAccessFactory to determine the set of the files to be processed. 2. uses the TransformConfiguration to create the TransformRuntime instance 3. Uses the TransformRuntime to optionally apply additional configuration (ray object storage, etc) for the configuration and operation of the Transform. - 3. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create + 4. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create to execute transformers in parallel, providing the following to each worker: * Ray worker configuration * DataAccessFactory * Transform class and its TransformConfiguration containing the CLI parameters and any TransformRuntime additions. - 4. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process. + 5. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process. Additionally, to provide monitoring of long-running transforms, the orchestrator is instrumented with [custom metrics](https://docs.ray.io/en/latest/ray-observability/user-guides/add-app-metrics.html), that are exported to localhost:8080 (this is the endpoint that @@ -53,11 +55,13 @@ It uses the following components, all of which can/do define CLI configuration p Once all data is processed, the orchestrator will collect execution statistics (from the statistics actor) and build and save it in the form of execution metadata (`metadata.json`). Finally, it will return the execution result to the Launcher. + * [Ray worker](../src/data_processing/runtime/ray/transform_table_processor.py) is responsible for reading files (as [PyArrow Tables](https://levelup.gitconnected.com/deep-dive-into-pyarrow-understanding-its-features-and-benefits-2cce8b1466c8)) assigned by the orchestrator, applying the transform to the input table and writing out the resulting table(s). Metadata produced by each table transformation is aggregated into Transform Statistics (below). + * [Transform Statistics](../src/data_processing/runtime/ray/transform_statistics.py) is a general purpose data collector actor aggregating the numeric metadata from different places of the framework (especially metadata produced by the transform). diff --git a/data-processing-lib/doc/simplest-transform-tutorial.md b/data-processing-lib/doc/simplest-transform-tutorial.md index 940c795a9..1bcea0ffd 100644 --- a/data-processing-lib/doc/simplest-transform-tutorial.md +++ b/data-processing-lib/doc/simplest-transform-tutorial.md @@ -15,11 +15,13 @@ one table to another. That said, we will show the following: the operation of our _noop_ transform. We will **not** be showing the following: + * The creation of a custom TransformRuntime that would enable more global state and/or coordination among the transforms running in other ray actors. This will be covered in an advanced tutorial. The complete task involves the following: + * NOOPTransform - class that implements the specific transformation * NOOPTableTransformConfiguration - class that provides configuration for the NOOPTransform, specifically the command line arguments used to configure it. @@ -37,6 +39,7 @@ First, let's define the transform class. To do this we extend the base abstract/interface class [AbstractTableTransform](../src/data_processing_ibm/transform/table_transform.py), which requires definition of the following: + * an initializer (i.e. `init()`) that accepts a dictionary of configuration data. For this example, the configuration data will only be defined by command line arguments (defined below). diff --git a/data-processing-lib/doc/testing-e2e-transform.md b/data-processing-lib/doc/testing-e2e-transform.md index 657826cd0..8ae4f7d73 100644 --- a/data-processing-lib/doc/testing-e2e-transform.md +++ b/data-processing-lib/doc/testing-e2e-transform.md @@ -1,4 +1,5 @@ # Testing End-to-End Transform operation WIP - Points to discuss + 1. Reading input files and writing output files. -2. Testing of the transform runtime and use of ray components in the transform \ No newline at end of file +2. Testing of the transform runtime and use of ray components in the transform diff --git a/data-processing-lib/doc/transform-external-resources.md b/data-processing-lib/doc/transform-external-resources.md index 89e2b2b9c..30f9e90b8 100644 --- a/data-processing-lib/doc/transform-external-resources.md +++ b/data-processing-lib/doc/transform-external-resources.md @@ -8,6 +8,7 @@ In addition to actually loading the resource(s), the transform needs to define t defines the location of the domain list. In the next sections we cover the following: + 1. How to define the transform-specific resource location(s) as command line arguments 2. How to load the transform-specific resources, either or both of: 1. During transform initialization - this is useful for testing outside of ray, and optionally diff --git a/data-processing-lib/doc/transform-standalone-testing.md b/data-processing-lib/doc/transform-standalone-testing.md index 2d001f965..02070df2d 100644 --- a/data-processing-lib/doc/transform-standalone-testing.md +++ b/data-processing-lib/doc/transform-standalone-testing.md @@ -15,6 +15,7 @@ transform implementation tests will easily leverage. The first (currently only test) is a the `test_transform()` method that takes the following inputs: + * the transform implementation being tested, properly configured with the configuration dictionary for the associated test data. * a list of N (1 or more) input tables to be processed with the transform's `transform(Table)` method. diff --git a/data-processing-lib/doc/transform-tutorials.md b/data-processing-lib/doc/transform-tutorials.md index 1bf383107..3b41b52dd 100644 --- a/data-processing-lib/doc/transform-tutorials.md +++ b/data-processing-lib/doc/transform-tutorials.md @@ -44,16 +44,19 @@ not need this feature, a default implementation is provided to return an empty l The [TransformConfiguration](../src/data_processing/transform/transform_configuration.py) serves as an interface and must be implemented by the any `AbstractTableTransform` implementation to provide the following configuration: + * the transform class to be used, * command line arguments used to initialize the Transform Runtime and generally, the Transform. * Transform Runtime class to use * transform short name + It is expected that transforms are initialized with a fixed name, the class of its corresponding `AbstractTableTransform` implementation and optionally the configuration keys that should not be exposed as metadata for a run. To support command line configuration, the `TransformConfiguration` extends the [CLIArgumentProvider](../src/data_processing/utils/cli_utils.py) class. The set of methods of interest are + * ```__init__(self, name:str, transform_class:type[AbstractTableTransform], list[str]:remove_from_metadata )``` - sets the required fields * ```add_input_params(self, parser:ArgumentParser)``` - adds transform-specific command line options that will be made available in the dictionary provided to the transform's initializer. diff --git a/data-processing-lib/doc/transformer-utilities.md b/data-processing-lib/doc/transformer-utilities.md index 41632bad6..52ec866b4 100644 --- a/data-processing-lib/doc/transformer-utilities.md +++ b/data-processing-lib/doc/transformer-utilities.md @@ -2,6 +2,7 @@ A class [TransformUtils](../src/data_processing/utils/transform_utils.py) provides several methods that simplify transformer's implementation. Currently it includes the following methods: + * `deep_get_size` is the method to get the complete size of the Python object based on https://www.askpython.com/python/built-in-methods/variables-memory-size-in-python It supports Python structures: list, tuple and set @@ -17,8 +18,9 @@ be removed before it is added removes URL encoding It also contain two variables: + * `RANDOM_SEED` number that is used for methods that require seed * `LOCAL_TO_DISK` rough local size to size on disk/S3 This class should be extended with additional methods, generally useful across multiple transformers and documentation -should be added here \ No newline at end of file +should be added here diff --git a/doc/data-processing.md b/doc/data-processing.md index cc3de6e47..efc7d1a68 100644 --- a/doc/data-processing.md +++ b/doc/data-processing.md @@ -14,6 +14,7 @@ to, for example: * Filter the table to remove or edit rows and/or columns, for example to remove rows from blocked domains. The table is generally expected to have something like the following minimal set of columns : + * URL source of the document (can be use for domain block listing) * Document id * Contents of the actual document to be used for LLM training diff --git a/kfp/kfp_ray_components/README.md b/kfp/kfp_ray_components/README.md index 86a55d2f9..e9159b27d 100644 --- a/kfp/kfp_ray_components/README.md +++ b/kfp/kfp_ray_components/README.md @@ -12,6 +12,7 @@ A pipeline component is a self-contained set of code that performs one step in a ```` The first step in creation of components its implementation. The framework automation includes the following 3 components: + * [Create Ray cluster](src/create_ray_cluster.py) is responsible for creation of the Ray cluster. Its implementation is based on the [RayRemoteJobs class](../kfp_support_lib/src/kfp_support/workflow_support/README.md) * [execute Ray job](src/execute_ray_job.py) is responsible for submission of the Ray job, watching its execution, @@ -29,6 +30,7 @@ command-line arguments to pass to your component’s code. * The component’s metadata, such as the name and description. Components specifications are provided here: + * [Create Ray cluster Component](createRayComponent.yaml) * [execute Ray job component](executeRayJobComponent.yaml) * [clean up Ray cluster component](cleanupRayComponent.yaml) diff --git a/kfp/kfp_support_lib/README.md b/kfp/kfp_support_lib/README.md index f42ce9808..86f3f4360 100644 --- a/kfp/kfp_support_lib/README.md +++ b/kfp/kfp_support_lib/README.md @@ -2,6 +2,7 @@ This provides support for implementing KFP pipelines automating transform's execution. It comprises 2 main modules + * [api server client](src/kfp_support/api_server_client/README.md) * [workflow support](src/kfp_support/workflow_support/README.md) diff --git a/transforms/README.md b/transforms/README.md index 09e923284..159bd2c4d 100644 --- a/transforms/README.md +++ b/transforms/README.md @@ -120,6 +120,7 @@ To push all the images run `make push`, or `make -C