From 0f7ce8f761b5bca7bd00b2b764fdcb2a8b80220b Mon Sep 17 00:00:00 2001 From: Dean Wampler Date: Fri, 10 May 2024 13:26:57 -0500 Subject: [PATCH 1/2] Typos and formatting suggestions. Signed-off-by: Dean Wampler --- .../doc/simplest-transform-tutorial.md | 73 +++++++++---------- transforms/universal/noop/README.md | 4 +- 2 files changed, 38 insertions(+), 39 deletions(-) diff --git a/data-processing-lib/doc/simplest-transform-tutorial.md b/data-processing-lib/doc/simplest-transform-tutorial.md index 9d16c7117..83b84fd40 100644 --- a/data-processing-lib/doc/simplest-transform-tutorial.md +++ b/data-processing-lib/doc/simplest-transform-tutorial.md @@ -15,33 +15,33 @@ one table to another. That said, we will show the following: the operation of our _noop_ transform. We will **not** be showing the following: -* The creation of a custom TransformRuntime that would enable more global - state and/or coordination among the transforms running in other ray actors. +* The creation of a custom `TransformRuntime` that would enable more global + state and/or coordination among the transforms running in other Ray actors. This will be covered in an advanced tutorial. The complete task involves the following: -* NOOPTransform - class that implements the specific transformation -* NOOPTableTransformConfiguration - class that provides configuration for the - NOOPTransform, specifically the command line arguments used to configure it. -* main() - simple creation and use of the TransformLauncher. +* `NOOPTransform` - class that implements the specific transformation +* `NOOPTableTransformConfiguration` - class that provides configuration for the + `NOOPTransform`, specifically the command line arguments used to configure it. +* `main()` - simple creation and use of the `TransformLauncher`. (Currently, the complete code for the noop transform used for this tutorial can be found in the [noop transform](../../transforms/universal/noop) directory. -Finally, we show to use the command line to run the transform in a local ray cluster +Finally, we show how to use the command line to run the transform in a local ray cluster. -## NOOPTransform +## `NOOPTransform` First, let's define the transform class. To do this we extend the base abstract/interface class -[AbstractTableTransform](../src/data_processing_ibm/transform/table_transform.py), -which requires definition of the following: +[`AbstractTableTransform`](../src/data_processing_ibm/transform/table_transform.py), +which requires defining the following: * an initializer (i.e. `init()`) that accepts a dictionary of configuration data. For this example, the configuration data will only be defined by command line arguments (defined below). -* the `transform()` method itself that takes an input table produces an output - table and any associated metadata for that table transformation. +* the `transform()` method itself that takes an input table and produces an output + table with any associated metadata for that table transformation. Other methods such as `flush()` need not be overridden/redefined for this simple example. @@ -75,7 +75,7 @@ with an amount of seconds to sleep/delay during the call to `transform()`. Configuration is provided by the framework in a dictionary provided to the initializer. Below we will cover how this `sleep` argument is made available to the initializer. -Note that in more complex transforms that might, for example, load a hugging face or other model, +Note that in more complex transforms that might, for example, load a Hugging Face or other model, or perform other deep initializations, these can be done in the initializer. Next we define the `transform()` method itself, which includes the addition of some @@ -90,22 +90,22 @@ almost trivial metadata. return [table], metadata ``` The single input to this method is the in-memory pyarrow table to be transformed. -The return of this function is a list of tables and optional metadata. In this -case of simple 1:1 table conversion the list will contain a single table, the input. +The return value of this method is a list of tables and optional metadata. In this +case, we are doing a simple 1:1 table conversion, so the list will contain a single table, the input table. The metadata is a free-form dictionary of keys with numeric values that will be aggregated by the framework and reported as aggregated job statistics metadata. If there is no metadata then simply return an empty dictionary. -## NOOPTransformConfiguration +## `NOOPTransformConfiguration` -Next we define the `NOOPTransformConfiguration` class and its initializer that define the following: +Next we define the `NOOPTransformConfiguration` class and its initializer that defines the following: * The short name for the transform -* The class implementing the transform - in our case NOOPTransform +* The class implementing the transform - in our case `NOOPTransform` * Command line argument support. -* The transform runtime class be used. We will use the `DefaultTableTransformRuntime` +* The transform runtime class be used. We will use the `DefaultTableTransformRuntime`, which is sufficient for most 1:1 table transforms. Extensions to this class can be - used when more complex interactions among transform is required.* + used when more complex interactions among transform is required. First we define the class and its initializer, @@ -119,18 +119,19 @@ class NOOPTransformConfiguration(DefaultTableTransformConfiguration): super().__init__(name=short_name, transform_class=NOOPTransform) self.params = {} ``` -The initializer extends the DefaultTableTransformConfiguration which provides simple -capture of our configuration data and enables picklability through the network. +The initializer extends the `DefaultTableTransformConfiguration` which provides simple +capture of our configuration data and enables to pickle data through the network. It also adds a `params` field that will be used below to hold the transform's configuration data (used in `NOOPTransform.init()` above). Next, we provide two methods that define and capture the command line configuration that -is specific to the `NOOPTransform`, in this case the number of seconds to sleep during transformation -and an example command line, `pwd`, option holding sensitive data that we don't want reported -in the job metadata produced by the ray orchestrator. -First we define the method establishes the command line arguments. -This method is given a global argument parser to which the `NOOPTransform` arguments are added. -It is good practice to include a common prefix to all transform-specific options (i.e. pii, lang, etc). +is specific to the `NOOPTransform`, in this case the parameters are the number of seconds to sleep during transformation +and an example command line parameter, `pwd` ("password"), option holding sensitive data that we don't want reported +in the job metadata produced by the Ray orchestrator. + +The first method establishes the command line arguments. +It is given a global argument parser to which the `NOOPTransform` arguments are added. +It is a good practice to include a common prefix to all transform-specific options (i.e. pii, lang, etc). In our case we will use `noop_`. ```python @@ -150,8 +151,7 @@ In our case we will use `noop_`. help="A dummy password which should be filtered out of the metadata", ) ``` -Next we implement a method that is called after the framework has parsed the CLI args -and which allows us to capture the `NOOPTransform`-specific arguments, optionally validate them +The second method is called after the framework has parsed the CLI args, which allows us to capture the `NOOPTransform`-specific arguments, optionally validate them, and flag that the `pwd` parameter should not be reported in the metadata. ```python @@ -168,7 +168,7 @@ and flag that the `pwd` parameter should not be reported in the metadata. return True ``` -## main() +## `main()` Next, we show how to launch the framework with the `NOOPTransform` using the framework's `TransformLauncher` class. @@ -178,21 +178,20 @@ if __name__ == "__main__": launcher = TransformLauncher(transform_runtime_config=NOOPTransformConfiguration()) launcher.launch() ``` -The launcher requires only an instance of DefaultTableTransformConfiguration +The launcher requires only an instance of `DefaultTableTransformConfiguration` (our `NOOPTransformConfiguration` class). A single method `launch()` is then invoked to run the transform in a Ray cluster. ## Running -Assuming the above `main()` is placed in `noop_main.py` we can run the transform on data -in COS as follows: +Assuming the above `main` code is placed in `noop_main.py` we can run the transform data in S3 as follows: ```shell python noop_main.py --noop_sleep_msec 2 \ --run_locally True \ - --s3_cred "{'access_key': 'KEY', 'secret_key': 'SECRET', 'cos_url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud'}" \ - --s3_config "{'input_folder': 'cos-optimal-llm-pile/test/david/input/', 'output_folder': 'cos-optimal-llm-pile/test/david/output/'}" + --s3_cred "{'access_key': 'KEY', 'secret_key': 'SECRET', 's3_url': 'https://s3.us-east.amazonaws.com'}" \ + --s3_config "{'input_folder': 'mybucket/path/input/', 'output_folder': 'mybucket/path/output/'}" ``` -This is a minimal set of options to run locally. +This is a minimal set of options to run in S3. See the [launcher options](launcher-options.md) for a complete list of transform-independent command line options. diff --git a/transforms/universal/noop/README.md b/transforms/universal/noop/README.md index ff31ae07b..a00db270a 100644 --- a/transforms/universal/noop/README.md +++ b/transforms/universal/noop/README.md @@ -5,9 +5,9 @@ for details on general project conventions, transform configuration, testing and IDE set up. ## Summary -This transforms serves as a template for transform writers as it does +This transform serves as a template for transform writers as it does not perform any transformations on the input (i.e., a no-operation transform). -As such it simply copies the input parquet files to the output directory. +As such, it simply copies the input parquet files to the output directory. It shows the basics of creating a simple 1:1 table transform. It also implements a single configuration value to show how configuration of the transform is implemented. From 608e4002ba8fa258f67c353c6e290a675c9e1a9e Mon Sep 17 00:00:00 2001 From: Dean Wampler Date: Fri, 10 May 2024 13:58:40 -0500 Subject: [PATCH 2/2] Added a callout to do additional setup steps. Signed-off-by: Dean Wampler --- README.md | 2 ++ data-processing-lib/doc/simplest-transform-tutorial.md | 3 +++ 2 files changed, 5 insertions(+) diff --git a/README.md b/README.md index fe6bb1279..26c92ea20 100644 --- a/README.md +++ b/README.md @@ -144,6 +144,8 @@ Refer to [Minio install instructions](data-processing-lib/doc/using_s3_transform There are various entry points that you can choose based on the use case. Below are a few demos to get you started. +> **Note:** You will need to run the setup commands in the [`data-processing-lib/README`](data-processing-lib/README.md) before running the following examples. + ### Run a Single Transform on Local Ray Get started by running the "noop" transform that performs an identity operation by following the [tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) and associated diff --git a/data-processing-lib/doc/simplest-transform-tutorial.md b/data-processing-lib/doc/simplest-transform-tutorial.md index 539414746..41e89020f 100644 --- a/data-processing-lib/doc/simplest-transform-tutorial.md +++ b/data-processing-lib/doc/simplest-transform-tutorial.md @@ -31,6 +31,9 @@ tutorial can be found in the Finally, we show how to use the command line to run the transform in a local ray cluster. +> **Note:** You will need to run the setup commands in the [`../README`](..) before running the following examples. + + ## `NOOPTransform` First, let's define the transform class. To do this we extend