Skip to content

Commit

Permalink
Merge pull request #109 from deanwampler/dev
Browse files Browse the repository at this point in the history
Doc fixes
  • Loading branch information
shahrokhDaijavad authored May 13, 2024
2 parents 489bec8 + c49f5ce commit af3591e
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 31 deletions.
65 changes: 36 additions & 29 deletions data-processing-lib/doc/simplest-transform-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,36 +15,38 @@ one table to another. That said, we will show the following:
the operation of our _noop_ transform.

We will **not** be showing the following:

* The creation of a custom TransformRuntime that would enable more global
state and/or coordination among the transforms running in other ray actors.
* The creation of a custom `TransformRuntime` that would enable more global
state and/or coordination among the transforms running in other Ray actors.
This will be covered in an advanced tutorial.

The complete task involves the following:

* NOOPTransform - class that implements the specific transformation
* NOOPTableTransformConfiguration - class that provides configuration for the
NOOPTransform, specifically the command line arguments used to configure it.
* main() - simple creation and use of the TransformLauncher.
* `NOOPTransform` - class that implements the specific transformation
* `NOOPTableTransformConfiguration` - class that provides configuration for the
`NOOPTransform`, specifically the command line arguments used to configure it.
* `main()` - simple creation and use of the `TransformLauncher`.

(Currently, the complete code for the noop transform used for this
tutorial can be found in the
[noop transform](../../transforms/universal/noop) directory.

Finally, we show to use the command line to run the transform in a local ray cluster
Finally, we show how to use the command line to run the transform in a local ray cluster.

> **Note:** You will need to run the setup commands in the [`../README`](..) before running the following examples.
## NOOPTransform

## `NOOPTransform`

First, let's define the transform class. To do this we extend
the base abstract/interface class
[AbstractTableTransform](../src/data_processing_ibm/transform/table_transform.py),
[`AbstractTableTransform`](../src/data_processing_ibm/transform/table_transform.py),
which requires definition of the following:

* an initializer (i.e. `init()`) that accepts a dictionary of configuration
data. For this example, the configuration data will only be defined by
command line arguments (defined below).
* the `transform()` method itself that takes an input table produces an output
table and any associated metadata for that table transformation.
* the `transform()` method itself that takes an input table and produces an output
table with any associated metadata for that table transformation.

Other methods such as `flush()` need not be overridden/redefined for this simple example.

Expand Down Expand Up @@ -78,7 +80,7 @@ with an amount of seconds to sleep/delay during the call to `transform()`.
Configuration is provided by the framework in a dictionary provided to the initializer.
Below we will cover how this `sleep` argument is made available to the initializer.

Note that in more complex transforms that might, for example, load a hugging face or other model,
Note that in more complex transforms that might, for example, load a Hugging Face or other model,
or perform other deep initializations, these can be done in the initializer.

Next we define the `transform()` method itself, which includes the addition of some
Expand All @@ -93,19 +95,18 @@ almost trivial metadata.
return [table], metadata
```
The single input to this method is the in-memory pyarrow table to be transformed.
The return of this function is a list of tables and optional metadata. In this
case of simple 1:1 table conversion the list will contain a single table, the input.
The return value of this method is a list of tables and optional metadata. In this
case, we are doing a simple 1:1 table conversion, so the list will contain a single table, the input table.
The metadata is a free-form dictionary of keys with numeric values that will be aggregated
by the framework and reported as aggregated job statistics metadata.
If there is no metadata then simply return an empty dictionary.

## NOOPTransformConfiguration
## `NOOPTransformConfiguration`

Next we define the `NOOPTransformConfiguration` and
classes and there initializer that define the following:
Next we define the `NOOPTransformConfiguration` class and its initializer that defines the following:

* The short name for the transform
* The class implementing the transform - in our case NOOPTransform
* The class implementing the transform - in our case `NOOPTransform`
* Command line argument support.

We also define the `NOOPRayTransformationConfiguration` so we can run the transform
Expand All @@ -128,18 +129,20 @@ class NOOPTransformConfiguration(TransformConfiguration):
remove_from_metadata=[pwd_key],
)
```
The initializer extends the TransformConfiguration which provides simple
capture of our configuration data and enables picklability through the network.

The initializer extends the `TransformConfiguration` that provides simple
capture of our configuration data and enables the ability to pickle through the network.
It also adds a `params` field that will be used below to hold the transform's
configuration data (used in `NOOPTransform.init()` above).

Next, we provide two methods that define and capture the command line configuration that
is specific to the `NOOPTransform`, in this case the number of seconds to sleep during transformation
and an example command line, `pwd`, option holding sensitive data that we don't want reported
in the job metadata produced by the ray orchestrator.
First we define the method establishes the command line arguments.
This method is given a global argument parser to which the `NOOPTransform` arguments are added.
It is good practice to include a common prefix to all transform-specific options (i.e. pii, lang, etc).
is specific to the `NOOPTransform`, in this case the parameters are the number of seconds to sleep during transformation
and an example command line parameter, `pwd` ("password"), option holding sensitive data that we don't want reported
in the job metadata produced by the Ray orchestrator.

The first method establishes the command line arguments.
It is given a global argument parser to which the `NOOPTransform` arguments are added.
It is a good practice to include a common prefix to all transform-specific options (i.e. pii, lang, etc).
In our case we will use `noop_`.

```python
Expand All @@ -162,6 +165,7 @@ In our case we will use `noop_`.
```
Next we implement a method that is called after the CLI args are parsed (usually by one
of the runtimes) and which allows us to capture the `NOOPTransform`-specific arguments.


```python

Expand All @@ -179,13 +183,16 @@ To run the transform on a set of input data, we use one of the runtimes, each de
### Python Runtime
To run in the python runtime, we need to create the instance of `PythonTransformLauncher`
using the `NOOPTransformConfiguration`, and launch it as follows:

```python
if __name__ == "__main__":
launcher = PythonTransformLauncher(transform_config=NOOPTransformConfiguration())
launcher.launch()
```

To run this on some test data, we'll use data in the repo for the noop transform
## Running

Assuming the above `main` code is placed in `noop_main.py` we can run the transform on some test data. We'll use data in the repo for the noop transform
and create a temporary directory to hold the output:
```shell
export DPK_REPOROOT=...
Expand All @@ -194,7 +201,7 @@ export NOOP_INPUT=$DPK_REPOROOT/transforms/universal/noop/test-data/input
To run
```shell
python noop_main.py --noop_sleep_msec 2 \
--data_local_config "{'input_folder': '"$NOOP_INPUT"', 'output_folder': '/tmp/noop-output'}"
--data_local_config "{'input_folder': '"$NOOP_INPUT"', 'output_folder': '/tmp/noop-output'}"
```
See the [python launcher options](python-launcher-options) for a complete list of
transform-independent command line options.
Expand Down
4 changes: 2 additions & 2 deletions transforms/universal/noop/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ for details on general project conventions, transform configuration,
testing and IDE set up.

## Summary
This transforms serves as a template for transform writers as it does
This transform serves as a template for transform writers as it does
not perform any transformations on the input (i.e., a no-operation transform).
As such it simply copies the input parquet files to the output directory.
As such, it simply copies the input parquet files to the output directory.
It shows the basics of creating a simple 1:1 table transform.
It also implements a single configuration value to show how configuration
of the transform is implemented.
Expand Down

0 comments on commit af3591e

Please sign in to comment.