Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit job and dataset runless metadata #1880

Merged
merged 2 commits into from
Jun 7, 2023
Merged

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented May 25, 2023

Problem

Proposal: https://github.com/OpenLineage/OpenLineage/blob/main/proposals/1837/static_lineage.md

Closes: #1868

Solution

  • Add proposed event types to spec,
  • Include client test,
  • Include support in python client,
  • Verify event encoding in Marquez (to be done)

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

@pawel-big-lebowski pawel-big-lebowski added the area:spec Specifications and standards for the project label May 25, 2023
@boring-cyborg boring-cyborg bot added the area:client/java openlineage-java label May 25, 2023
@boring-cyborg boring-cyborg bot added the area:client/python openlineage-python label May 25, 2023
"properties": {
"dataset": {
"allOf": [
{ "$ref": "#/$defs/Dataset" }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of workaround ;)

So, current code generation for Java works in a way such that base classes are generated as interfaces and only top level objects are classes. Dataset is only an interface while InputDataset and OutputDataset is a class. So, we cannot use Dataset directly here, instead we create an object with allOf containing Dataset. This advantage of this is that the existing code of generated client does not get modified. The disadvantage is a method name newDatasetEventDataset to create an instance of DatasetEventDataset as in case of un-named objects, generated class name is a parent name concatenated with a property name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename Dataset to BaseDataset and then add a Dataset type that "extends" BaseDataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or create a named StaticDataset type to use here as a ref

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try StaticDataset approach.

@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/runless-events branch 3 times, most recently from 29257e3 to 62a11ab Compare May 26, 2023 07:17
@codecov-commenter
Copy link

codecov-commenter commented May 26, 2023

Codecov Report

Merging #1880 (f1a4af2) into main (b928e2e) will not change coverage.
The diff coverage is n/a.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff            @@
##               main    #1880   +/-   ##
=========================================
  Coverage     81.92%   81.92%           
  Complexity      100      100           
=========================================
  Files            85       85           
  Lines          3586     3586           
  Branches         27       27           
=========================================
  Hits           2938     2938           
  Misses          617      617           
  Partials         31       31           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@boring-cyborg boring-cyborg bot added the area:documentation Improvements or additions to documentation label May 29, 2023
@pawel-big-lebowski pawel-big-lebowski force-pushed the spec/runless-events branch 6 times, most recently from 4f8c177 to 5677917 Compare May 30, 2023 07:34
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review May 30, 2023 07:56
@@ -20,4 +20,16 @@ public void emit(OpenLineage.RunEvent runEvent) {
// if DEBUG loglevel is enabled, this will double-log even due to OpenLineageClient also logging
log.info(OpenLineageClientUtils.toJson(runEvent));
}

@Override
public void emit(OpenLineage.DatasetEvent datasetEvent) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overcomplicating mind suggests me stuff like

  public <E extends OpenLineage.Event> void emit(E datasetEvent) {

...but I don't thing it's a good idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it work, we would need to introduce Event class/interface within a code that is generated from spec. To avoid hardcoding, this should have been included within spec. So the json-schema would not only contain data definition, but also some hints on how to implement model classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code generation will create a base class if you use allOf like in facets:
https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.json#L5-L8
but I'm not sure that it is necessary.

@@ -23,4 +23,12 @@ enum Type {
Transport(@NonNull final Type type) {}

public abstract void emit(OpenLineage.RunEvent runEvent);

public void emit(OpenLineage.DatasetEvent datasetEvent) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not abstract? Do not want to enforce implementation in Transports made for previous versions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may have custom transport implementations. Making it abstract would enforce implementation on their side and would break their code even though they may have no interest in static metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update the transport api to add a lower level void emit(String jsonEvent) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the usage of emit(String)? Our clients define models that make it clear the types allowed and enforces a structure. Allowing for string type allows for flexibility, but introduces the likelihood of invalid events being emitted.

@mobuchowski
Copy link
Member

🚀 🚀 🚀

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few cosmetic comments (and a couple questions) but overall this looks great.
Please let me know which of those comments you would be considering adding.

List<ResolvedType> resolvedTypes = oneOfType
.getTypes()
.stream()
.filter(type -> type instanceof RefType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would throw an exception when there's another type than the ones you expect (RefType). Maybe just map with a cast to throw if not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's kind of workaround for:

"oneOf": [
    { "$ref": "#/$defs/RunEvent" },
    { "$ref": "#/$defs/DatasetEvent" },
    { "$ref": "#/$defs/JobEvent" }
  ]

We don't want to generate single Java class to handle that (may be undoable), but we want to generate code for each of types which is RefType. The method returns the first of the resolved types because it has to align with

public ResolvedType visit(...) {

contract and changing contract to a list or optional would require changes across a whole code generator.

@@ -20,4 +20,16 @@ public void emit(OpenLineage.RunEvent runEvent) {
// if DEBUG loglevel is enabled, this will double-log even due to OpenLineageClient also logging
log.info(OpenLineageClientUtils.toJson(runEvent));
}

@Override
public void emit(OpenLineage.DatasetEvent datasetEvent) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code generation will create a base class if you use allOf like in facets:
https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.json#L5-L8
but I'm not sure that it is necessary.

@Override
public void emit(@NonNull OpenLineage.DatasetEvent datasetEvent) {
final String eventAsJson = OpenLineageClientUtils.toJson(datasetEvent);
emit(eventAsJson);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but I'd do a static import and just emit(toJson(datasetEvent)) for those 3 methods

@@ -23,4 +23,12 @@ enum Type {
Transport(@NonNull final Type type) {}

public abstract void emit(OpenLineage.RunEvent runEvent);

public void emit(OpenLineage.DatasetEvent datasetEvent) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update the transport api to add a lower level void emit(String jsonEvent) ?

"properties": {
"dataset": {
"allOf": [
{ "$ref": "#/$defs/Dataset" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename Dataset to BaseDataset and then add a Dataset type that "extends" BaseDataset?

"properties": {
"dataset": {
"allOf": [
{ "$ref": "#/$defs/Dataset" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's best.

"properties": {
"dataset": {
"allOf": [
{ "$ref": "#/$defs/Dataset" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or create a named StaticDataset type to use here as a ref

"description": "The JSON Pointer (https://tools.ietf.org/html/rfc6901) URL to the corresponding version of the schema definition for this RunEvent",
"type": "string",
"format": "uri",
"example": "https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/DatasetEvent"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is how we decide what type of event this is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the approach has been tested within Marquez PR -> MarquezProject/marquez#2495

Comment on lines 94 to 97
"eventTime",
"dataset",
"producer",
"schemaURL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe eventTime, producer and schemaURL are part of a base Event type added with a AllOf like we do for Input/OutputDatasets and facets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! I will try it out.

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few cosmetic comments (and a couple questions) but overall this looks great.
Please let me know which of those comments you would be considering adding.

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but otherwise LGTM 💯 🥇

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski
Copy link
Collaborator Author

@julienledem Superb comments!

  • I've created StaticDataset
  • I've introduced BaseEvent with schemaUrl, eventTime and producer fields
  • I've reduced amount of emit methods by adding emit(String)

@pawel-big-lebowski pawel-big-lebowski merged commit cc3a71e into main Jun 7, 2023
@pawel-big-lebowski pawel-big-lebowski deleted the spec/runless-events branch June 7, 2023 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:client/java openlineage-java area:client/python openlineage-python area:documentation Improvements or additions to documentation area:integration/spark area:spec Specifications and standards for the project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add runless event to spec
6 participants