Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new lines in markdown documents before lists/bullets #119

Merged
merged 1 commit into from
May 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .secrets.baseline
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"files": null,
"lines": null
},
"generated_at": "2024-04-25T18:36:43Z",
"generated_at": "2024-05-13T11:04:51Z",
"plugins_used": [
{
"name": "AWSKeyDetector"
Expand Down Expand Up @@ -92,7 +92,7 @@
"hashed_secret": "f15e1014a6f4234f0d394a979a45f5983c9fbc7f",
"is_secret": false,
"is_verified": false,
"line_number": 53,
"line_number": 55,
"type": "Secret Keyword",
"verified_result": null
}
Expand Down
3 changes: 3 additions & 0 deletions data-processing-lib/doc/advanced-transform-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ removes duplicate documents across all files. In this tutorial, we will show the
the operation of our _noop_ transform.

The complete task involves the following:

* EdedupTransform - class that implements the specific transformation
* EdedupRuntime - class that implements custom TransformRuntime to create supporting Ray objects and enhance job output
statistics
Expand All @@ -39,6 +40,7 @@ First, let's define the transform class. To do this we extend
the base abstract/interface class
[AbstractTableTransform](../src/data_processing/transform/table_transform.py),
which requires definition of the following:

* an initializer (i.e. `init()`) that accepts a dictionary of configuration
data. For this example, the configuration data will only be defined by
command line arguments (defined below).
Expand Down Expand Up @@ -138,6 +140,7 @@ First, let's define the transform runtime class. To do this we extend
the base abstract/interface class
[DefaultTableTransformRuntime](../src/data_processing/runtime/ray/transform_runtime.py),
which requires definition of the following:

* an initializer (i.e. `init()`) that accepts a dictionary of configuration
data. For this example, the configuration data will only be defined by
command line arguments (defined below).
Expand Down
10 changes: 7 additions & 3 deletions data-processing-lib/doc/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,32 +32,36 @@ It uses the following components, all of which can/do define CLI configuration p
After all parameters are validated, the ray cluster is started and the DataAccessFactory, TransformOrchestratorConfiguraiton
and TransformConfiguration are given to the Ray Orchestrator, via Ray remote() method invocation.
The Launcher waits for the Ray Orchestrator to complete.
* [Ray Orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py) is responsible for overall management of

* documents with [Ray Orchestrator](../src/data_processing/runtime/ray/transform_orchestrator.py) is responsible for overall management of
the data processing job. It creates the actors, determines the set of input data and distributes the
references to the data files to be processed by the workers. More specifically, it performs the following:

1. Uses the DataAccess instance created by the DataAccessFactory to determine the set of the files
to be processed.
2. uses the TransformConfiguration to create the TransformRuntime instance
3. Uses the TransformRuntime to optionally apply additional configuration (ray object storage, etc) for the configuration
and operation of the Transform.
3. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create
4. uses the TransformOrchestratorConfiguration to determine the set of RayWorkers to create
to execute transformers in parallel, providing the following to each worker:
* Ray worker configuration
* DataAccessFactory
* Transform class and its TransformConfiguration containing the CLI parameters and any TransformRuntime additions.
4. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process.
5. in a load-balanced, round-robin fashion, distributes the names of the input files to the workers for them to transform/process.

Additionally, to provide monitoring of long-running transforms, the orchestrator is instrumented with
[custom metrics](https://docs.ray.io/en/latest/ray-observability/user-guides/add-app-metrics.html), that are exported to localhost:8080 (this is the endpoint that
Prometheus would be configured to scrape).
Once all data is processed, the orchestrator will collect execution statistics (from the statistics actor)
and build and save it in the form of execution metadata (`metadata.json`). Finally, it will return the execution
result to the Launcher.

* [Ray worker](../src/data_processing/runtime/ray/transform_table_processor.py) is responsible for
reading files (as [PyArrow Tables](https://levelup.gitconnected.com/deep-dive-into-pyarrow-understanding-its-features-and-benefits-2cce8b1466c8))
assigned by the orchestrator, applying the transform to the input table and writing out the
resulting table(s). Metadata produced by each table transformation is aggregated into
Transform Statistics (below).

* [Transform Statistics](../src/data_processing/runtime/ray/transform_statistics.py) is a general
purpose data collector actor aggregating the numeric metadata from different places of
the framework (especially metadata produced by the transform).
Expand Down
3 changes: 3 additions & 0 deletions data-processing-lib/doc/simplest-transform-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,13 @@ one table to another. That said, we will show the following:
the operation of our _noop_ transform.

We will **not** be showing the following:

* The creation of a custom TransformRuntime that would enable more global
state and/or coordination among the transforms running in other ray actors.
This will be covered in an advanced tutorial.

The complete task involves the following:

* NOOPTransform - class that implements the specific transformation
* NOOPTableTransformConfiguration - class that provides configuration for the
NOOPTransform, specifically the command line arguments used to configure it.
Expand All @@ -37,6 +39,7 @@ First, let's define the transform class. To do this we extend
the base abstract/interface class
[AbstractTableTransform](../src/data_processing_ibm/transform/table_transform.py),
which requires definition of the following:

* an initializer (i.e. `init()`) that accepts a dictionary of configuration
data. For this example, the configuration data will only be defined by
command line arguments (defined below).
Expand Down
3 changes: 2 additions & 1 deletion data-processing-lib/doc/testing-e2e-transform.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Testing End-to-End Transform operation
WIP - Points to discuss

1. Reading input files and writing output files.
2. Testing of the transform runtime and use of ray components in the transform
2. Testing of the transform runtime and use of ray components in the transform
1 change: 1 addition & 0 deletions data-processing-lib/doc/transform-external-resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ In addition to actually loading the resource(s), the transform needs to define t
defines the location of the domain list.

In the next sections we cover the following:

1. How to define the transform-specific resource location(s) as command line arguments
2. How to load the transform-specific resources, either or both of:
1. During transform initialization - this is useful for testing outside of ray, and optionally
Expand Down
1 change: 1 addition & 0 deletions data-processing-lib/doc/transform-standalone-testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ transform implementation tests will easily leverage.

The first (currently only test) is a the `test_transform()` method that takes the
following inputs:

* the transform implementation being tested, properly configured with the configuration
dictionary for the associated test data.
* a list of N (1 or more) input tables to be processed with the transform's `transform(Table)` method.
Expand Down
3 changes: 3 additions & 0 deletions data-processing-lib/doc/transform-tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,16 +44,19 @@ not need this feature, a default implementation is provided to return an empty l
The [TransformConfiguration](../src/data_processing/transform/transform_configuration.py)
serves as an interface and must be implemented by the any `AbstractTableTransform`
implementation to provide the following configuration:

* the transform class to be used,
* command line arguments used to initialize the Transform Runtime and generally, the Transform.
* Transform Runtime class to use
* transform short name

It is expected that transforms are initialized with a fixed name, the class of its corresponding
`AbstractTableTransform` implementation and optionally the configuration keys that should not
be exposed as metadata for a run.
To support command line configuration, the `TransformConfiguration` extends the
[CLIArgumentProvider](../src/data_processing/utils/cli_utils.py) class.
The set of methods of interest are

* ```__init__(self, name:str, transform_class:type[AbstractTableTransform], list[str]:remove_from_metadata )``` - sets the required fields
* ```add_input_params(self, parser:ArgumentParser)``` - adds transform-specific command line options that will
be made available in the dictionary provided to the transform's initializer.
Expand Down
4 changes: 3 additions & 1 deletion data-processing-lib/doc/transformer-utilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

A class [TransformUtils](../src/data_processing/utils/transform_utils.py) provides several methods that simplify
transformer's implementation. Currently it includes the following methods:

* `deep_get_size` is the method to get the complete size of the Python object based on
https://www.askpython.com/python/built-in-methods/variables-memory-size-in-python
It supports Python structures: list, tuple and set
Expand All @@ -17,8 +18,9 @@ be removed before it is added
removes URL encoding

It also contain two variables:

* `RANDOM_SEED` number that is used for methods that require seed
* `LOCAL_TO_DISK` rough local size to size on disk/S3

This class should be extended with additional methods, generally useful across multiple transformers and documentation
should be added here
should be added here
1 change: 1 addition & 0 deletions doc/data-processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ to, for example:
* Filter the table to remove or edit rows and/or columns, for example to remove rows from blocked domains.

The table is generally expected to have something like the following minimal set of columns :

* URL source of the document (can be use for domain block listing)
* Document id
* Contents of the actual document to be used for LLM training
Expand Down
2 changes: 2 additions & 0 deletions kfp/kfp_ray_components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ A pipeline component is a self-contained set of code that performs one step in a
````

The first step in creation of components its implementation. The framework automation includes the following 3 components:

* [Create Ray cluster](src/create_ray_cluster.py) is responsible for creation of the Ray cluster. Its implementation is
based on the [RayRemoteJobs class](../kfp_support_lib/src/kfp_support/workflow_support/README.md)
* [execute Ray job](src/execute_ray_job.py) is responsible for submission of the Ray job, watching its execution,
Expand All @@ -29,6 +30,7 @@ command-line arguments to pass to your component’s code.
* The component’s metadata, such as the name and description.

Components specifications are provided here:

* [Create Ray cluster Component](createRayComponent.yaml)
* [execute Ray job component](executeRayJobComponent.yaml)
* [clean up Ray cluster component](cleanupRayComponent.yaml)
Expand Down
1 change: 1 addition & 0 deletions kfp/kfp_support_lib/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

This provides support for implementing KFP pipelines automating transform's execution.
It comprises 2 main modules

* [api server client](src/kfp_support/api_server_client/README.md)
* [workflow support](src/kfp_support/workflow_support/README.md)

Expand Down
1 change: 1 addition & 0 deletions transforms/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ To push all the images run `make push`, or `make -C <path to transform directory

### IDE Setup
When running in an IDE, such as PyCharm, the following are generally required:

* From the command line, build the venv using `make venv`.
* In the IDE
* Set your project/run configuration to use the venv/bin/python as your runtime virtual environment.
Expand Down