-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix recursive views perf #2043
Fix recursive views perf #2043
Conversation
…possible Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
…changes can be easily compared in version control Signed-off-by: Michael Collado <collado.mike@gmail.com>
Codecov Report
@@ Coverage Diff @@
## main #2043 +/- ##
============================================
- Coverage 78.81% 78.79% -0.03%
+ Complexity 1013 1011 -2
============================================
Files 200 200
Lines 5579 5573 -6
Branches 422 422
============================================
- Hits 4397 4391 -6
Misses 730 730
Partials 452 452
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minors comments, but overall the changes seem reasonable (I double checked the SQL 😉). Thanks for the detailed writeup (as usual), @collado-mike! Also, since the DB write performance issues were observed only after deploying to production, having a not too exhaustive load test in CI with a report can help spot any DB bottlenecks sooner. I've opened #2047
api/src/main/resources/marquez/db/migration/V45__update_jobs_view_rule.sql
Outdated
Show resolved
Hide resolved
api/src/main/resources/marquez/db/migration/R__1_jobs_view_and_rewrite_function.sql
Outdated
Show resolved
Hide resolved
api/src/main/resources/marquez/db/migration/R__1_jobs_view_and_rewrite_function.sql
Outdated
Show resolved
Hide resolved
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Problem
In #1928, we introduced a
jobs_view
, which used recursive queries to construct a job's fully qualified name by collecting the job's ancestry and joining the names. Additionally, #1946 introduced job symlinks, which also used recursive queries to find the last link in a chain of symlinks and always return the ultimate target job. Unfortunately, while initial testing showed those queries performed reasonably well on read requests, we didn't sufficiently load test the OpenLineage write API. Unfortunately, under sustained load in a real production environment, the write API (which repeatedly queries thejobs_view
either directly or via theruns_view
) puts considerable strain on our Postgres database.Solution
One of the goals with using a view was to consistently construct job FQNs even when ancestors are renamed (using the symlink feature). This allows us to avoid big migration tasks updating the
runs
table with new job names when, e.g., an Airflow DAG or a Spark application is renamed, thus renaming all of its child tasks. However, constructing the FQN on every read is too computationally expensive, so this PR changes the view to compute the FQN on write. This continues to use the view, but now there is aTRIGGER
onINSERT
to compute the FQN for a job, as well as to traverse the symlink chain and write the new FQN and symlink target to a new table calledjobs_fqn
. This table also collects the past aliases of a job, so we can serve requests that still refer to the old job name, as described in the original symlink issue.I utilized writes to the
jobs_view
in order to capture the write by the trigger, so the code now references the view instead of the originaljobs
table. This also allowed me to return the full row from the insert query rather than just the job UUID, as it previously did. I also simplified the uniqueness constraint by adding aparent_job_id_string
field, which just converts the parent job id to a string or an empty string if null (we couldn't use a null UUID to enforce uniqueness for jobs that have no parents becauseNULL
!=NULL
in Postgres).Given the
jobs_view
and associated write function are likely to change in the future, I made their definitions repeatable migrations so that future changes can be captured in version control (I went ahead and did the same for theruns_view
for the same reason, though unrelated to the purpose of this PR).I also included an additional test for the DAG renaming use case.
The following snapshot shows performance in a load test both before and after the proposed code change. Both tests sustained a rate of 900 requests per minute to the OpenLineage write API. I used the same RDS db.m5.large instance for both tests (cleared the DB structure between tests).
(note that #2041 is included in the test deployment)
With the current code at HEAD, the DB's CPU utilization climbs to 95%+ within about an hour, then hovers just below 100%. The request graph shows the request rate dropping after that because pods are consistently being restarted due to failed health checks (can't access the database), so eventually only about 50% of requests are being handled.
Conversely, the DB CPU utilization never rises about ~50% with the new code changes. The request rate stays pretty steady throughout the test (there are still some OOM restarts, but that's consistent with what we see in production with an older version of Marquez).
Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)