Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed jobs_fqn table and moved FQN into jobs directly in order to enforce unique constraints #2448

Merged
merged 1 commit into from
Mar 14, 2023

Conversation

collado-mike
Copy link
Collaborator

Problem

The introduction of parent jobs and the jobs_fqn table intended to allow Marquez to support jobs that had the same name, but were triggered by different parents (e.g., a Spark job fired by different Airflow DAGs). The jobs table tracked the simple name of the job, while the jobs_fqn table tracked the fully qualified name (FQN). In addition, the jobs_fqn table became responsible for tracking the FQN of symlinked jobs, as it was too expensive to determine the new FQN of a job by following symlinks at query time. Instead, the FQN of a symlinked job is updated when the symlink is created so we return only the FQN of the symlink target rather than the FQN of the original job.

Unfortunately, this means that neither the jobs table nor the jobs_fqn table can enforce the uniqueness constraint we had on the fully qualified name of a job. Thus, in production, we see errors like the following when trying to load a job by its name:

java.lang.IllegalStateException: Multiple values for optional: ['JobRow(uuid=b971d547-ea9d-44c1-908f-9dcc14faba98, type=BATCH, createdAt=2022-12-10T10:01:56.653991Z, updatedAt=2022-12-10T10:01:56.653991Z, namespaceName=...']

In particular, this happens on two occasions when receiving Airflow OpenLineage events:

  1. We receive a FAIL event with no start event - the parent facet of the run is omitted, so Marquez creates a job with no parent, but the same FQN
  2. We receive a FAIL event prior to the START event - usually, this happens when requests are queued by the load balancer or sometimes when the START event itself is particularly large and deserializing takes longer than deserializing the FAIL event.

Solution

This change eliminates the job_fqn table and reestablishes the uniqueness constraint on the jobs table's name column. It also adds a simple_name column to the table, which is used by the view to return the column of the same name. Tests for the two cases mentioned above are added to ensure we can handle Airflow events that omit the parent facet.

The jobs_view is also updated to omit symlinked jobs so that the read queries no longer have to omit them. aliases are moved from the jobs_fqn table to the jobs table so old job names can still be found.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@collado-mike collado-mike requested a review from wslulciuc March 6, 2023 19:52
@boring-cyborg boring-cyborg bot added the api API layer changes label Mar 6, 2023
@collado-mike collado-mike force-pushed the fix/unique_job_fqn branch 2 times, most recently from be4dc89 to 0e6f707 Compare March 6, 2023 23:48
…nforce unique name constraints

Signed-off-by: Michael Collado <collado.mike@gmail.com>
@codecov
Copy link

codecov bot commented Mar 7, 2023

Codecov Report

Merging #2448 (c81e4ed) into main (8d28ed5) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main    #2448      +/-   ##
============================================
- Coverage     83.61%   83.60%   -0.01%     
+ Complexity     1214     1213       -1     
============================================
  Files           231      231              
  Lines          5522     5520       -2     
  Branches        266      266              
============================================
- Hits           4617     4615       -2     
  Misses          762      762              
  Partials        143      143              
Impacted Files Coverage Δ
api/src/main/java/marquez/db/JobDao.java 100.00% <ø> (ø)
api/src/main/java/marquez/db/RunDao.java 92.40% <ø> (ø)
api/src/main/java/marquez/api/JobResource.java 93.05% <100.00%> (ø)
api/src/main/java/marquez/db/OpenLineageDao.java 96.29% <100.00%> (-0.02%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @collado-mike! Make sure to open a follow up issue to remove the jobs_fqn table, otherwise 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants