-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit job and dataset runless metadata #1880
Conversation
90a1967
to
07e0915
Compare
07e0915
to
27c2f15
Compare
spec/OpenLineage.json
Outdated
"properties": { | ||
"dataset": { | ||
"allOf": [ | ||
{ "$ref": "#/$defs/Dataset" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of workaround ;)
So, current code generation for Java works in a way such that base classes are generated as interfaces
and only top level objects are classes. Dataset is only an interface while InputDataset
and OutputDataset
is a class. So, we cannot use Dataset
directly here, instead we create an object with allOf
containing Dataset
. This advantage of this is that the existing code of generated client does not get modified. The disadvantage is a method name newDatasetEventDataset
to create an instance of DatasetEventDataset
as in case of un-named objects, generated class name is a parent name concatenated with a property name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename Dataset to BaseDataset and then add a Dataset type that "extends" BaseDataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or create a named StaticDataset
type to use here as a ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try StaticDataset
approach.
29257e3
to
62a11ab
Compare
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## main #1880 +/- ##
=========================================
Coverage 81.92% 81.92%
Complexity 100 100
=========================================
Files 85 85
Lines 3586 3586
Branches 27 27
=========================================
Hits 2938 2938
Misses 617 617
Partials 31 31 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
62a11ab
to
f1a4af2
Compare
f1a4af2
to
9bd8d8a
Compare
4f8c177
to
5677917
Compare
@@ -20,4 +20,16 @@ public void emit(OpenLineage.RunEvent runEvent) { | |||
// if DEBUG loglevel is enabled, this will double-log even due to OpenLineageClient also logging | |||
log.info(OpenLineageClientUtils.toJson(runEvent)); | |||
} | |||
|
|||
@Override | |||
public void emit(OpenLineage.DatasetEvent datasetEvent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My overcomplicating mind suggests me stuff like
public <E extends OpenLineage.Event> void emit(E datasetEvent) {
...but I don't thing it's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it work, we would need to introduce Event
class/interface within a code that is generated from spec. To avoid hardcoding, this should have been included within spec. So the json-schema would not only contain data definition, but also some hints on how to implement model classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code generation will create a base class if you use allOf like in facets:
https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.json#L5-L8
but I'm not sure that it is necessary.
@@ -23,4 +23,12 @@ enum Type { | |||
Transport(@NonNull final Type type) {} | |||
|
|||
public abstract void emit(OpenLineage.RunEvent runEvent); | |||
|
|||
public void emit(OpenLineage.DatasetEvent datasetEvent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not abstract
? Do not want to enforce implementation in Transports made for previous versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may have custom transport implementations. Making it abstract
would enforce implementation on their side and would break their code even though they may have no interest in static metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we update the transport api to add a lower level void emit(String jsonEvent)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the usage of emit(String)
? Our clients define models that make it clear the types allowed and enforces a structure. Allowing for string
type allows for flexibility, but introduces the likelihood of invalid events being emitted.
5677917
to
dda0d85
Compare
🚀 🚀 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few cosmetic comments (and a couple questions) but overall this looks great.
Please let me know which of those comments you would be considering adding.
List<ResolvedType> resolvedTypes = oneOfType | ||
.getTypes() | ||
.stream() | ||
.filter(type -> type instanceof RefType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would throw an exception when there's another type than the ones you expect (RefType). Maybe just map with a cast to throw if not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's kind of workaround for:
"oneOf": [
{ "$ref": "#/$defs/RunEvent" },
{ "$ref": "#/$defs/DatasetEvent" },
{ "$ref": "#/$defs/JobEvent" }
]
We don't want to generate single Java class to handle that (may be undoable), but we want to generate code for each of types which is RefType
. The method returns the first of the resolved types because it has to align with
public ResolvedType visit(...) {
contract and changing contract to a list or optional would require changes across a whole code generator.
@@ -20,4 +20,16 @@ public void emit(OpenLineage.RunEvent runEvent) { | |||
// if DEBUG loglevel is enabled, this will double-log even due to OpenLineageClient also logging | |||
log.info(OpenLineageClientUtils.toJson(runEvent)); | |||
} | |||
|
|||
@Override | |||
public void emit(OpenLineage.DatasetEvent datasetEvent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code generation will create a base class if you use allOf like in facets:
https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.json#L5-L8
but I'm not sure that it is necessary.
@Override | ||
public void emit(@NonNull OpenLineage.DatasetEvent datasetEvent) { | ||
final String eventAsJson = OpenLineageClientUtils.toJson(datasetEvent); | ||
emit(eventAsJson); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but I'd do a static import and just emit(toJson(datasetEvent))
for those 3 methods
@@ -23,4 +23,12 @@ enum Type { | |||
Transport(@NonNull final Type type) {} | |||
|
|||
public abstract void emit(OpenLineage.RunEvent runEvent); | |||
|
|||
public void emit(OpenLineage.DatasetEvent datasetEvent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we update the transport api to add a lower level void emit(String jsonEvent)
?
spec/OpenLineage.json
Outdated
"properties": { | ||
"dataset": { | ||
"allOf": [ | ||
{ "$ref": "#/$defs/Dataset" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename Dataset to BaseDataset and then add a Dataset type that "extends" BaseDataset?
spec/OpenLineage.json
Outdated
"properties": { | ||
"dataset": { | ||
"allOf": [ | ||
{ "$ref": "#/$defs/Dataset" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's best.
spec/OpenLineage.json
Outdated
"properties": { | ||
"dataset": { | ||
"allOf": [ | ||
{ "$ref": "#/$defs/Dataset" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or create a named StaticDataset
type to use here as a ref
spec/OpenLineage.json
Outdated
"description": "The JSON Pointer (https://tools.ietf.org/html/rfc6901) URL to the corresponding version of the schema definition for this RunEvent", | ||
"type": "string", | ||
"format": "uri", | ||
"example": "https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/DatasetEvent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this is how we decide what type of event this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the approach has been tested within Marquez PR -> MarquezProject/marquez#2495
spec/OpenLineage.json
Outdated
"eventTime", | ||
"dataset", | ||
"producer", | ||
"schemaURL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe eventTime, producer and schemaURL are part of a base Event
type added with a AllOf like we do for Input/OutputDatasets and facets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool! I will try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few cosmetic comments (and a couple questions) but overall this looks great.
Please let me know which of those comments you would be considering adding.
dda0d85
to
f2524a3
Compare
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
f2524a3
to
a8b5040
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, but otherwise LGTM 💯 🥇
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
a8b5040
to
186496a
Compare
@julienledem Superb comments!
|
Problem
Proposal: https://github.com/OpenLineage/OpenLineage/blob/main/proposals/1837/static_lineage.md
Closes: #1868
Solution
If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports
S3
andGCS
filesystem operations, tested with AWS EMR).One-line summary:
Checklist
SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project