Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lineage doc #331

Merged
merged 1 commit into from
Sep 30, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions DOCUMENTATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -1834,6 +1834,43 @@ mim transform import simple-job --file=transform1.js
NOTE: The transform_entities function must be exported when using the above command. However, when generating and inserting base64 script the function MUST NOT be exported. We aim to fix this.
## Lineage
With jobs being the primary way of moving data around in the data hub, they form a lineage graph for datasets.
Sinks are the end of the line for a dataset, and sources are the start. Transform queries are additional inputs and
transform transactions are additional outputs.
The lineage graph can be queried using the lineage API.
### Lineage API
Get only lineage nodes directly connected to a single dataset
```
GET /lineage/{dataset}
```
Get the complete lineage graph
```
GET /lineage
```
The response format is a list of graph edges. Callers can use this to for example build a graphviz visualization of the lineage.
```javascript
[
{ "From": "source-dataset", "To": "sink-dataset", "Type": "copy" },
{ "From": "source-dataset", "To": "sink-dataset", "Type": "transform" },
{ "From": "other-dataset", "To": "sink-dataset", "Type": "transform-hop" }
]
```
The `Type` of an edge can be one of the following:
- `copy` - the sink dataset is a copy of the source dataset
- `transform` - the sink dataset is the result of a transform of the source dataset
- `transform-hop` - the sink dataset enriched with data from the `From` dataset in a transform
## Configuration
The Datahub can be configured in several ways, but it should work for testing purposes without any setup needed. However, once you are ready to deploy into a production environment, you need to configure security as a minimum.
Expand Down
Loading