Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static Lineage Support in Marquez using OpenLineage #2624

Closed
wslulciuc opened this issue Sep 26, 2023 · 1 comment · Fixed by #2641
Closed

Static Lineage Support in Marquez using OpenLineage #2624

wslulciuc opened this issue Sep 26, 2023 · 1 comment · Fixed by #2641
Assignees
Labels

Comments

@wslulciuc
Copy link
Member

wslulciuc commented Sep 26, 2023

Background

OpenLineage 0.29.2 added support for static lineage events: DatasetEvent, JobEvent. Before 0.29.2, lineage events required a unique runID and emitted only at the job run-level. With static lineage events, the spec has evolved to allow for run-less lineage events where events can be emitted outside the context of a job run with the unique runID now optional.

What is Static Lineage?

In Marquez, static lineage would represent the latest (or current) lineage metadata for a given job. That is, each job node in the lineage graph will reference the current metadata for a given job, but also the current metadata for its input and output dataset(s). Below, we outline theJob graph node definition for static lineage:

Note: We use latest (or current) metadata and latest version interchangeably for a job or dataset.

Job: Object

  • ID - The globally unique ID of the job (namespace + job name)
  • Inputs:Set[InputDataset] - A set of input datasets (latest versions(s)) (=inEdges)
  • Outputs: Set[OutputDataset] - A set of output datasets (latest versions(s)) (=outEdges)

Now, if we look at our LineageAPI, you'll notice that the contract does not change! That is, the query used for the lineage graph API call always depended on the latest lineage metadata for a given job. But, given that the runID was required by OpenLineage, we needed to query the dataset_versions, job_version, and runs tables to return the current lineage graph.

How Marquez uses Static Lineage events emitted by OpenLineage?

With runID not required for static lineage events, Marquez will handle DatasetEvent and JobEvent differently. Below, we outline the processing logic for each static lineage event.

DatasetEvent

Within a RunEvent, datasets were either inputs or outputs to a job run. A lookup for the current version of the input datasets (assumed to be present) would ensure Marquez associated the input versions to the runID; output datasets are handlled slightly differently, but more or less the same in terms of processing logic. In the case of output datasets, Marquez applies the following versioning logic:

  1. If the output dataset has not been registered with Marquez, the dataset will be created.
  2. A version for the output dataset is created when:
    • The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).
    • A job run completes or fails.
  3. The dataset version is associated with the runID.

For static lineage events, the dataset is no longer required to be part of a RunEvent and will not be associated with runID. Therefore, the logic versioning logic is simplified to:

  1. If the output dataset has not been registered with Marquez, the dataset will be created.
  2. A version for the output dataset is created when:
    • The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).

The dataset can then be used by any job or run as it's inputs or outputs.

JobEvent

A JobEvent will contain metadata about the job definition. For example, the source code location of the job, but also it's input and output datasets. In the case of job metadata collected outside the context of a run, Marquez will apply the following logic:

  1. If the job has not been registered with Marquez, the job will be created.
  2. If the job's input datasets have not been registered with Marquez, the datasets will be created.
  3. If the job's output datasets have not been registered with Marquez, the datasets will be created.
  4. A version for the input/output datasets are created when:
    • The dataset schema changes (for newly created datasets, this will be v0 representing the initial schema).

Now, this logic will be applied on each JobEvent; therefore, any existing metadata for the job will be overwritten. In the proposal following this issue, we will dive deeper into how static lineage events will be used in conjunction with run-level events to fully capture the evolution of lineage metadata for a given job run.

What data model changes are needed in Marquez to support Static Lineage?

To signify a dataset version has been created outside the context of a job run, the run_uuid column will be set to nullable (and in any other relationships where the run_uuid was required). We will also need to modify the job_versions_io_mapping to include job_uuid:

CREATE TABLE job_versions_io_mapping (
    job_version_uuid UUID REFERENCES job_versions(uuid) ON DELETE CASCADE,
    dataset_uuid     UUID REFERENCES datasets(uuid) ON DELETE CASCADE,
    io_type          VARCHAR(64) NOT NULL,
    job_uuid         UUID REFERENCES jobs(uuid) ON DELETE CASCADE
);

With the job_uuid, we can now query for lineage without the need for a job_version or run_uuid. Again, in the proposal following this issue, we will dive deeper into how job_versions_io_mapping will be used to serve static lineage queries.

@davidjgoss
Copy link
Contributor

In the proposal following this issue, we will dive deeper into how static lineage events will be used in conjunction with run-level events to fully capture the evolution of lineage metadata for a given job run.

Just to make sure I understand - does this refer to static vs runtime lineage being effectively two separate graphs so they can evolve in parallel without overwriting one another?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants