You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenLineage 0.29.2 added support for static lineage events: DatasetEvent, JobEvent. Before 0.29.2, lineage events required a unique runID and emitted only at the job run-level. With static lineage events, the spec has evolved to allow for run-less lineage events where events can be emitted outside the context of a job run with the unique runID now optional.
What is Static Lineage?
In Marquez, static lineage would represent the latest (or current) lineage metadata for a given job. That is, each job node in the lineage graph will reference the current metadata for a given job, but also the current metadata for its input and output dataset(s). Below, we outline theJob graph node definition for static lineage:
Note: We use latest (or current) metadata and latest version interchangeably for a job or dataset.
Job: Object
ID - The globally unique ID of the job (namespace + job name)
Inputs:Set[InputDataset] - A set of input datasets (latest versions(s)) (=inEdges)
Outputs: Set[OutputDataset] - A set of output datasets (latest versions(s)) (=outEdges)
Now, if we look at our LineageAPI, you'll notice that the contract does not change! That is, the query used for the lineage graph API call always depended on the latest lineage metadata for a given job. But, given that the runID was required by OpenLineage, we needed to query the dataset_versions, job_version, and runs tables to return the current lineage graph.
How Marquez uses Static Lineage events emitted by OpenLineage?
With runID not required for static lineage events, Marquez will handle DatasetEvent and JobEvent differently. Below, we outline the processing logic for each static lineage event.
DatasetEvent
Within a RunEvent, datasets were either inputs or outputs to a job run. A lookup for the current version of the input datasets (assumed to be present) would ensure Marquez associated the input versions to the runID; output datasets are handlled slightly differently, but more or less the same in terms of processing logic. In the case of output datasets, Marquez applies the following versioning logic:
If the output dataset has not been registered with Marquez, the dataset will be created.
A version for the output dataset is created when:
The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).
A job run completes or fails.
The dataset version is associated with the runID.
For static lineage events, the dataset is no longer required to be part of a RunEvent and will not be associated with runID. Therefore, the logic versioning logic is simplified to:
If the output dataset has not been registered with Marquez, the dataset will be created.
A version for the output dataset is created when:
The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).
The dataset can then be used by any job or run as it's inputs or outputs.
JobEvent
A JobEvent will contain metadata about the job definition. For example, the source code location of the job, but also it's input and output datasets. In the case of job metadata collected outside the context of a run, Marquez will apply the following logic:
If the job has not been registered with Marquez, the job will be created.
If the job's input datasets have not been registered with Marquez, the datasets will be created.
If the job's output datasets have not been registered with Marquez, the datasets will be created.
A version for the input/output datasets are created when:
The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).
Now, this logic will be applied on each JobEvent; therefore, any existing metadata for the job will be overwritten. In the proposal following this issue, we will dive deeper into how static lineage events will be used in conjunction with run-level events to fully capture the evolution of lineage metadata for a given job run.
What data model changes are needed in Marquez to support Static Lineage?
To signify a dataset version has been created outside the context of a job run, the run_uuid column will be set to nullable (and in any other relationships where the run_uuid was required). We will also need to modify the job_versions_io_mapping to include job_uuid:
CREATETABLEjob_versions_io_mapping (
job_version_uuid UUID REFERENCES job_versions(uuid) ON DELETE CASCADE,
dataset_uuid UUID REFERENCES datasets(uuid) ON DELETE CASCADE,
io_type VARCHAR(64) NOT NULL,
job_uuid UUID REFERENCES jobs(uuid) ON DELETE CASCADE
);
With the job_uuid, we can now query for lineage without the need for a job_version or run_uuid. Again, in the proposal following this issue, we will dive deeper into how job_versions_io_mapping will be used to serve static lineage queries.
The text was updated successfully, but these errors were encountered:
In the proposal following this issue, we will dive deeper into how static lineage events will be used in conjunction with run-level events to fully capture the evolution of lineage metadata for a given job run.
Just to make sure I understand - does this refer to static vs runtime lineage being effectively two separate graphs so they can evolve in parallel without overwriting one another?
Background
OpenLineage
0.29.2
added support for static lineage events:DatasetEvent
,JobEvent
. Before0.29.2
, lineage events required a uniquerunID
and emitted only at the job run-level. With static lineage events, the spec has evolved to allow for run-less lineage events where events can be emitted outside the context of a job run with the uniquerunID
now optional.What is Static Lineage?
In Marquez, static lineage would represent the latest (or current) lineage metadata for a given job. That is, each job node in the lineage graph will reference the current metadata for a given job, but also the current metadata for its input and output dataset(s). Below, we outline the
Job
graph node definition for static lineage:Job
:Object
ID
- The globally unique ID of the job (namespace + job name)Inputs
:Set[InputDataset]
- A set of input datasets (latest versions(s)) (=inEdges
)Outputs
:Set[OutputDataset]
- A set of output datasets (latest versions(s)) (=outEdges
)Now, if we look at our LineageAPI, you'll notice that the contract does not change! That is, the query used for the lineage graph API call always depended on the latest lineage metadata for a given job. But, given that the
runID
was required by OpenLineage, we needed to query thedataset_versions
,job_version
, andruns
tables to return the current lineage graph.How Marquez uses Static Lineage events emitted by OpenLineage?
With
runID
not required for static lineage events, Marquez will handleDatasetEvent
andJobEvent
differently. Below, we outline the processing logic for each static lineage event.DatasetEvent
Within a
RunEvent
, datasets were either inputs or outputs to a job run. A lookup for the current version of the input datasets (assumed to be present) would ensure Marquez associated the input versions to therunID
; output datasets are handlled slightly differently, but more or less the same in terms of processing logic. In the case of output datasets, Marquez applies the following versioning logic:version
for the output dataset is created when:v0
representing the initial schema).runID
.For static lineage events, the dataset is no longer required to be part of a
RunEvent
and will not be associated withrunID
. Therefore, the logic versioning logic is simplified to:version
for the output dataset is created when:v0
representing the initial schema).The dataset can then be used by any job or run as it's inputs or outputs.
JobEvent
A
JobEvent
will contain metadata about the job definition. For example, the source code location of the job, but also it's input and output datasets. In the case of job metadata collected outside the context of a run, Marquez will apply the following logic:version
for the input/output datasets are created when:v0
representing the initial schema).Now, this logic will be applied on each
JobEvent
; therefore, any existing metadata for the job will be overwritten. In the proposal following this issue, we will dive deeper into how static lineage events will be used in conjunction with run-level events to fully capture the evolution of lineage metadata for a given job run.What data model changes are needed in Marquez to support Static Lineage?
To signify a dataset version has been created outside the context of a job run, the
run_uuid
column will be set to nullable (and in any other relationships where therun_uuid
was required). We will also need to modify thejob_versions_io_mapping
to includejob_uuid
:With the
job_uuid
, we can now query for lineage without the need for ajob_version
orrun_uuid
. Again, in the proposal following this issue, we will dive deeper into howjob_versions_io_mapping
will be used to serve static lineage queries.The text was updated successfully, but these errors were encountered: