Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job parent hierarchy api changes #1992

Merged
merged 5 commits into from
May 20, 2022
Merged

Conversation

collado-mike
Copy link
Collaborator

Problem

Final PR for #1928, continued from #1980. This updates the behavior of the write APIs to update the job parent field for new events and updates the read APIs to return the simpleName field of the job as well as the FQN. Notably, parent jobs and parent runs are created if present in the OpenLineage event but not present in the Marquez database. This handles events from Airflow DAGs where the DAG is a parent job for all tasks even though no event is ever sent for the DAG itself. A few integration tests added to validate the behavior for receiving messages from Airflow and Spark.

Closes: #1928

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)

@collado-mike collado-mike requested a review from wslulciuc May 16, 2022 19:14
@collado-mike collado-mike force-pushed the job_parent_hierarchy_api_changes branch 3 times, most recently from 500105b to 1f0e2fe Compare May 16, 2022 21:41
@codecov
Copy link

codecov bot commented May 16, 2022

Codecov Report

Merging #1992 (9d97708) into main (9d97708) will not change coverage.
The diff coverage is n/a.

❗ Current head 9d97708 differs from pull request most recent head b64b7f9. Consider uploading reports for the commit b64b7f9 to get more accurate results

@@            Coverage Diff            @@
##               main    #1992   +/-   ##
=========================================
  Coverage     78.62%   78.62%           
  Complexity     1003     1003           
=========================================
  Files           197      197           
  Lines          5459     5459           
  Branches        424      424           
=========================================
  Hits           4292     4292           
  Misses          723      723           
  Partials        444      444           

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

UUID symlinkTargetId,
PGobject inputs) {
UUID jobUuid =
upsertJobNoParent(
Copy link
Member

@wslulciuc wslulciuc May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want the upsertJobNoParent() call to the JobRow object similar to other upsert calls? This would keep contracts the same across DAOs but also avoid the subsequent findJobByUuidAsRow() call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, do you mean that the upsertJobNoParent query would go back to RETURNING * instead of RETURNING uuid? If that's what you mean, I made this change so that the subsequent findJobByUuidAsRow call queries the jobs_view - returning the FQN rather than the simple name.

p -> {
if (event.getJob().getName().startsWith(p.getName() + '.')) {
return event.getJob().getName().substring(p.getName().length() + 1);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: We may want to move this in a DbUtils class to handle parsing the simple name:

DbUtils.simpleJobNameFor()

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collado-mike left some minor comments, but otherwise great work 💯 💯 🥇

As for keeping the jobs_view, here are my thoughts (also in response to our offline discussion):

Upside:

Queries remain simple, meaning when querying the jobs table, the name column is still the simple name of the job, not the FQN. Also, the web UI should display the simple name of the job and depending on how jobs are named, parsing the FQN for displaying may result in the wrong name being used (not ideal). I think given the scope of the change, jobs_view allows us to avoid any unknown cases around job naming. The view can also be seen as a migration step to eventually having the name column in the jobs table be the FQN. Meaning, we can add a simple_name column to the jobs table ensuring the simple name and FQN are clearly defined and possibly dropping the view all together (or keep it arounds as there are clear benefits).

Downside:

I think having the name and simple_name column in the jobs table would ensure the FQN or the simple name would always be referenced correctly (outside just the view). But, a deeper discussion on how much benefits this provides can be had as the REST API is how metadata should be queried for in the first place.

api/src/main/java/marquez/db/RunDao.java Show resolved Hide resolved
@@ -133,6 +176,237 @@ public void testGetLineageForNonExistantDataset() {
assertThat(response.join()).isEqualTo(404);
}

@Test
public void testOpenLineageJobHierarchy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Since this test is Airflow specific, I would name the test testOpenLineageJobHierarchyForAirflow()

@collado-mike collado-mike force-pushed the job_parent_hierarchy_backfills branch from a53fe74 to 55e0c6e Compare May 20, 2022 22:34
Base automatically changed from job_parent_hierarchy_backfills to main May 20, 2022 22:39
Signed-off-by: Michael Collado <collado.mike@gmail.com>
…s with parents

Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
@collado-mike collado-mike force-pushed the job_parent_hierarchy_api_changes branch from b388088 to b64b7f9 Compare May 20, 2022 22:41
@collado-mike collado-mike enabled auto-merge (squash) May 20, 2022 22:41
@collado-mike collado-mike merged commit dd5f53f into main May 20, 2022
@collado-mike collado-mike deleted the job_parent_hierarchy_api_changes branch May 20, 2022 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting Job grouping and hierarchy in Marquez
2 participants