-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job parent hierarchy api changes #1992
Conversation
500105b
to
1f0e2fe
Compare
Codecov Report
@@ Coverage Diff @@
## main #1992 +/- ##
=========================================
Coverage 78.62% 78.62%
Complexity 1003 1003
=========================================
Files 197 197
Lines 5459 5459
Branches 424 424
=========================================
Hits 4292 4292
Misses 723 723
Partials 444 444 📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
UUID symlinkTargetId, | ||
PGobject inputs) { | ||
UUID jobUuid = | ||
upsertJobNoParent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we want the upsertJobNoParent()
call to the JobRow
object similar to other upsert calls? This would keep contracts the same across DAOs but also avoid the subsequent findJobByUuidAsRow()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, do you mean that the upsertJobNoParent
query would go back to RETURNING *
instead of RETURNING uuid
? If that's what you mean, I made this change so that the subsequent findJobByUuidAsRow
call queries the jobs_view
- returning the FQN rather than the simple name.
p -> { | ||
if (event.getJob().getName().startsWith(p.getName() + '.')) { | ||
return event.getJob().getName().substring(p.getName().length() + 1); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: We may want to move this in a DbUtils
class to handle parsing the simple name:
DbUtils.simpleJobNameFor()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@collado-mike left some minor comments, but otherwise great work 💯 💯 🥇
As for keeping the jobs_view
, here are my thoughts (also in response to our offline discussion):
Upside:
Queries remain simple, meaning when querying the jobs
table, the name
column is still the simple name of the job, not the FQN. Also, the web UI should display the simple name of the job and depending on how jobs are named, parsing the FQN for displaying may result in the wrong name being used (not ideal). I think given the scope of the change, jobs_view
allows us to avoid any unknown cases around job naming. The view can also be seen as a migration step to eventually having the name
column in the jobs
table be the FQN. Meaning, we can add a simple_name
column to the jobs
table ensuring the simple name and FQN are clearly defined and possibly dropping the view all together (or keep it arounds as there are clear benefits).
Downside:
I think having the name
and simple_name
column in the jobs
table would ensure the FQN or the simple name would always be referenced correctly (outside just the view). But, a deeper discussion on how much benefits this provides can be had as the REST API is how metadata should be queried for in the first place.
@@ -133,6 +176,237 @@ public void testGetLineageForNonExistantDataset() { | |||
assertThat(response.join()).isEqualTo(404); | |||
} | |||
|
|||
@Test | |||
public void testOpenLineageJobHierarchy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Since this test is Airflow specific, I would name the test testOpenLineageJobHierarchyForAirflow()
a53fe74
to
55e0c6e
Compare
Signed-off-by: Michael Collado <collado.mike@gmail.com>
…s with parents Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
b388088
to
b64b7f9
Compare
Problem
Final PR for #1928, continued from #1980. This updates the behavior of the write APIs to update the job parent field for new events and updates the read APIs to return the
simpleName
field of the job as well as the FQN. Notably, parent jobs and parent runs are created if present in the OpenLineage event but not present in the Marquez database. This handles events from Airflow DAGs where the DAG is a parent job for all tasks even though no event is ever sent for the DAG itself. A few integration tests added to validate the behavior for receiving messages from Airflow and Spark.Closes: #1928
Solution
Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.
Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)