Static Lineage Support in Marquez using OpenLineage #2624

wslulciuc · 2023-09-26T07:47:39Z

Background

OpenLineage 0.29.2 added support for static lineage events: DatasetEvent, JobEvent. Before 0.29.2, lineage events required a unique runID and emitted only at the job run-level. With static lineage events, the spec has evolved to allow for run-less lineage events where events can be emitted outside the context of a job run with the unique runID now optional.

What is Static Lineage?

In Marquez, static lineage would represent the latest (or current) lineage metadata for a given job. That is, each job node in the lineage graph will reference the current metadata for a given job, but also the current metadata for its input and output dataset(s). Below, we outline theJob graph node definition for static lineage:

Note: We use latest (or current) metadata and latest version interchangeably for a job or dataset.

Job: Object

ID - The globally unique ID of the job (namespace + job name)
Inputs:Set[InputDataset] - A set of input datasets (latest versions(s)) (=inEdges)
Outputs: Set[OutputDataset] - A set of output datasets (latest versions(s)) (=outEdges)

Now, if we look at our LineageAPI, you'll notice that the contract does not change! That is, the query used for the lineage graph API call always depended on the latest lineage metadata for a given job. But, given that the runID was required by OpenLineage, we needed to query the dataset_versions, job_version, and runs tables to return the current lineage graph.

How Marquez uses Static Lineage events emitted by OpenLineage?

With runID not required for static lineage events, Marquez will handle DatasetEvent and JobEvent differently. Below, we outline the processing logic for each static lineage event.

`DatasetEvent`

Within a RunEvent, datasets were either inputs or outputs to a job run. A lookup for the current version of the input datasets (assumed to be present) would ensure Marquez associated the input versions to the runID; output datasets are handlled slightly differently, but more or less the same in terms of processing logic. In the case of output datasets, Marquez applies the following versioning logic:

If the output dataset has not been registered with Marquez, the dataset will be created.
A version for the output dataset is created when:
- The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).
- A job run completes or fails.
The dataset version is associated with the runID.

For static lineage events, the dataset is no longer required to be part of a RunEvent and will not be associated with runID. Therefore, the logic versioning logic is simplified to:

If the output dataset has not been registered with Marquez, the dataset will be created.
A version for the output dataset is created when:
- The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).

The dataset can then be used by any job or run as it's inputs or outputs.

`JobEvent`

A JobEvent will contain metadata about the job definition. For example, the source code location of the job, but also it's input and output datasets. In the case of job metadata collected outside the context of a run, Marquez will apply the following logic:

If the job has not been registered with Marquez, the job will be created.
If the job's input datasets have not been registered with Marquez, the datasets will be created.
If the job's output datasets have not been registered with Marquez, the datasets will be created.
A version for the input/output datasets are created when:
- The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).

Now, this logic will be applied on each JobEvent; therefore, any existing metadata for the job will be overwritten. In the proposal following this issue, we will dive deeper into how static lineage events will be used in conjunction with run-level events to fully capture the evolution of lineage metadata for a given job run.

What data model changes are needed in Marquez to support Static Lineage?

To signify a dataset version has been created outside the context of a job run, the run_uuid column will be set to nullable (and in any other relationships where the run_uuid was required). We will also need to modify the job_versions_io_mapping to include job_uuid:

CREATE TABLE job_versions_io_mapping (
    job_version_uuid UUID REFERENCES job_versions(uuid) ON DELETE CASCADE,
    dataset_uuid     UUID REFERENCES datasets(uuid) ON DELETE CASCADE,
    io_type          VARCHAR(64) NOT NULL,
    job_uuid         UUID REFERENCES jobs(uuid) ON DELETE CASCADE
);

With the job_uuid, we can now query for lineage without the need for a job_version or run_uuid. Again, in the proposal following this issue, we will dive deeper into how job_versions_io_mapping will be used to serve static lineage queries.

The text was updated successfully, but these errors were encountered:

davidjgoss · 2023-11-06T09:08:27Z

In the proposal following this issue, we will dive deeper into how static lineage events will be used in conjunction with run-level events to fully capture the evolution of lineage metadata for a given job run.

Just to make sure I understand - does this refer to static vs runtime lineage being effectively two separate graphs so they can evolve in parallel without overwriting one another?

wslulciuc added the proposal label Sep 26, 2023

pawel-big-lebowski self-assigned this Oct 5, 2023

pawel-big-lebowski mentioned this issue Oct 6, 2023

Runless events - consume dataset event #2641

Merged

7 tasks

davidjgoss mentioned this issue Oct 28, 2023

Support for a cumulative lineage graph at both levels #2670

Open

wslulciuc closed this as completed in #2641 Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static Lineage Support in Marquez using OpenLineage #2624

Static Lineage Support in Marquez using OpenLineage #2624

wslulciuc commented Sep 26, 2023 •

edited

Loading

davidjgoss commented Nov 6, 2023

Static Lineage Support in Marquez using OpenLineage #2624

Static Lineage Support in Marquez using OpenLineage #2624

Comments

wslulciuc commented Sep 26, 2023 • edited Loading

Background

What is Static Lineage?

How Marquez uses Static Lineage events emitted by OpenLineage?

DatasetEvent

JobEvent

What data model changes are needed in Marquez to support Static Lineage?

davidjgoss commented Nov 6, 2023

wslulciuc commented Sep 26, 2023 •

edited

Loading

`DatasetEvent`

`JobEvent`