Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(docs): tutorial for writing a custom transformer #2959

Merged
merged 10 commits into from
Jul 28, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ function list_ids_in_directory(directory) {
return ids;
}

// note: to handle errors where you don't want a markdown file in the sidebar, add it as a comment.
// this will fix errors like `Error: File not accounted for in sidebar: ...`
module.exports = {
// users
// architects
Expand Down Expand Up @@ -73,6 +75,7 @@ module.exports = {
"docs/docker/development",
"metadata-ingestion/adding-source",
"metadata-ingestion/s3-ingestion",
//"metadata-ingestion/examples/transforms/README"
"metadata-ingestion/transformers",
//"docs/what/graph",
//"docs/what/search-index",
Expand Down
4 changes: 2 additions & 2 deletions metadata-ingestion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -950,7 +950,7 @@ sink:

If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub.

Check out the [transformers guide](./transformers.md) for more info!.
Check out the [transformers guide](./transformers.md) for more info!

## Using as a library

Expand Down Expand Up @@ -1018,4 +1018,4 @@ In order to use this example, you must first configure the Datahub hook. Like in

## Developing

See the [developing guide](./developing.md), [adding a source guide](./adding-source.md) and the [using transformers](./transformers.md) guides.
See the guides on [developing](./developing.md), [adding a source](./adding-source.md) and the [using transformers](./transformers.md).
kevinhu marked this conversation as resolved.
Show resolved Hide resolved
5 changes: 5 additions & 0 deletions metadata-ingestion/examples/transforms/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Custom transformer script

This script sets up a transformer that reads in a list of owner URNs from a JSON file specified via `owners_json` and appends these owners to every MCE.

See the transformers tutorial (https://datahubproject.io/docs/metadata-ingestion/transformers) for how this module is built and run.
24 changes: 20 additions & 4 deletions metadata-ingestion/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,14 @@ transformers:
- "urn:li:tag:Legacy"
```

:::tip

If you'd like to add more complex logic for assigning tags, you can use the more generic add_dataset_tags transformer, which calls a user-provided function to determine the tags for each dataset.
kevinhu marked this conversation as resolved.
Show resolved Hide resolved

:::
```yaml
transformers:
- type: "add_dataset_tags"
config:
get_tags_to_add: "<your_module>.<your_function>"
```

### Setting ownership

Expand All @@ -47,6 +50,15 @@ transformers:
- "urn:li:corpGroup:groupname"
```
kevinhu marked this conversation as resolved.
Show resolved Hide resolved

If you'd like to add more complex logic for assigning ownership, you can use the more generic `add_dataset_ownership` transformer, which calls a user-provided function to determine the ownership of each dataset.

```yaml
transformers:
- type: "add_dataset_ownership"
config:
get_owners_to_add: "<your_module>.<your_function>"
```

## Writing a custom transformer from scratch

In the above couple of examples, we use classes that have already been implemented in the ingestion framework. However, it’s common for more advanced cases to pop up where custom code is required, for instance if you'd like to utilize conditional logic or rewrite properties. In such cases, we can add our own modules and define the arguments it takes as a custom transformer.
Expand Down Expand Up @@ -182,8 +194,10 @@ def transform_one(self, mce: MetadataChangeEventClass) -> MetadataChangeEventCla

### Installing the package

Now that we've defined the transformer, we need to make it visible to DataHub. This can be done by making sure the Python file is available as a local import.
Now that we've defined the transformer, we need to make it visible to DataHub. The easiest way to do this is to just place it in the same directory as your recipe, in which case the module name is the same as the file – in this case, `custom_transform_example`.

<details>
<summary>Advanced: installing as a package</summary>
Alternatively, create a `setup.py` in the same directory as our transform script to make it visible globally. After installing this package (e.g. with `python setup.py` or `pip install -e .`), our module will be installed and importable as `custom_transform_example`.

```python
Expand All @@ -198,6 +212,8 @@ setup(
)
```

</details>

### Running the transform

```yaml
Expand Down