-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OL facets - PR2 - read facets from views based on lineage_events table #2355
Conversation
c45ac3a
to
bad157a
Compare
@Slf4j | ||
public class V56_1__FacetViews implements JavaMigration { | ||
|
||
public static final String DATASET_FACETS_DEFINITION = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migration is written as Java code because facet definitions will be reused in a migration that writes data to facets table.
3b63ae1
to
c89504d
Compare
117f18c
to
1624f92
Compare
1624f92
to
0d322ef
Compare
c89504d
to
be0f179
Compare
0d322ef
to
bde93bd
Compare
Codecov Report
@@ Coverage Diff @@
## main #2355 +/- ##
============================================
+ Coverage 76.81% 77.11% +0.30%
- Complexity 1195 1234 +39
============================================
Files 226 228 +2
Lines 5473 5572 +99
Branches 443 447 +4
============================================
+ Hits 4204 4297 +93
- Misses 772 775 +3
- Partials 497 500 +3
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
docker/init-db.sh
Outdated
@@ -8,7 +8,7 @@ | |||
set -eu | |||
|
|||
psql -v ON_ERROR_STOP=1 --username "${POSTGRES_USER}" > /dev/null <<-EOSQL | |||
CREATE USER ${MARQUEZ_USER}; | |||
CREATE USER ${MARQUEZ_USER} SUPERUSER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wslulciuc @collado-mike @mobuchowski This line is worth looking at and perhaps a discussion.
As long we're using PSQL 12, super user is required to install uuid-ossp
extension. We can remove it once upgrading PostgreSQL version to 13.
Additionally, the migration code should work in a way such that it is allowed to run externally sql:
CREATE EXTENSION IF NOT EXISTS "uuid-ossp"
and run migration with no extra privileges.
The CHANGELOG info about this will be included in the PR3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to upgrade our postgres version, but we'll want to decouple the upgrade from the migration. Meaning, there's no way arounds having to require users to run the migration as superuser
. Since requiring superuser
is only for the migration, providing migration steps in the changelog would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, requiring the Postgres upgrade and/or running the migration process as a superuser seems like a serious hurdle. What's the principal gain the user gets from this? I.e., why do the facet tables require a UUID primary key to begin with? AFAIK, we only join with those tables based on the foreign keys (run_uuid, job_uuid, dataset_uuid). If we get rid of the primary keys on those facet records, we don't have to worry about new extensions or postgres upgrades and I don't know if there's any downside to eliminating them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uuid as a primary key has been proposed in proposal #2076 some time ago before I started working on that.
For me it looked like some general pattern in Marquez, that we create primary keys for tables so that database schema can be easily extended in future (easy joining of tables). For these tables, primary key cannot be derived from other columns. But I don't see any blocker in removing those columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, yes the migration would be a hurdle for users running Marquez with postgres 12
or below. As @pawel-big-lebowski mentioned, we do have a general pattern for defining a uuid
(=PK) column for most tables (mapping tables excluded ex: dataset_fields_tag_mapping
). It would be great to keep this convention as we don't know how the facet tables will be used or extended.
For the facet tables, the uuid
ensure uniqueness. For example, the schema for dataset_facets
has dataset_uuid
and run_uuid
but a run may have more than one facet defined with the same name, therefore uuid
would ensure we capture both values of the facet. To simply our merging logic for facets, the facet tables are, for now, append only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still append a new facet for the same run even without the UUID. The primary key is principally useful for fetching a specific record or referencing a specific record from another table. Neither of those is the case here.
Arguing that we don't know how the facet tables will be used doesn't make sense either, as we can always add the UUID column after the fact. When there's a reason to introduce the primary key, we can argue that the extra work is worth the effort. For now, the current migration requires a very substantial change in the system with no benefit. Following the pattern simply for the sake of the pattern isn't a good reason to require these changes to the system or the deployment process that users are following.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to update the copyright (2022
-> 2023
):
/*
* Copyright 2018-2023 contributors to the Marquez project
* SPDX-License-Identifier: Apache-2.0
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove metadata.json as it's not used.
973832e
to
2df36cf
Compare
2df36cf
to
80116ab
Compare
Not sure why anything related to |
80116ab
to
1d1fec4
Compare
), selected_dataset_version_facets AS ( | ||
SELECT dv.uuid, dv.dataset_name, dv.namespace_name, dv.run_uuid, df.lineage_event_time, df.facet | ||
FROM selected_dataset_version_runs dv | ||
LEFT JOIN lineage_events le ON le.run_uuid = dv.run_uuid | ||
LEFT JOIN dataset_facets_view df ON df.dataset_uuid = dv.dataset_uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original query returned facets that were attached to the specific dataset version that is being queried. This query returns all facets that are attached to the dataset. This will include things like data quality facets on previous or newer versions of the dataset, which may not reflect the state of the selected dataset version. We need to be able to attach and query facets to specific dataset versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great remark, I've added missing join on run_uuid
.
1d1fec4
to
5967b0c
Compare
), selected_dataset_version_facets AS ( | ||
SELECT dv.uuid, dv.dataset_name, dv.namespace_name, dv.run_uuid, df.lineage_event_time, df.facet | ||
FROM selected_dataset_version_runs dv | ||
LEFT JOIN lineage_events le ON le.run_uuid = dv.run_uuid | ||
LEFT JOIN dataset_facets_view df ON df.dataset_uuid = dv.dataset_uuid AND df.run_uuid = dv.run_uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The selected_dataset_version_facets
CTE is expensive to create because we need to join on the runs_input_mapping
table to find all of the runs that read from or wrote to the selected dataset version. Rather than going through this extra hop, can we add the dataset version UUID to the facet table so we can simply query all facets that apply to a specific dataset version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ;-) I had similar thoughts on this but the counterargument in my head was to avoid doing endlessly more & more in a single PR.
I think you're right and there won't be better moment to fix this.
I've included it in a second commit within this PR.
5967b0c
to
298b4e1
Compare
…able Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
298b4e1
to
b4b1903
Compare
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. There's a comment on the Dataset facet retrieval, but I think it's an optimization and can be done now or later.
), dataset_runs AS ( | ||
SELECT d.uuid, d.name, d.namespace_name, dv.run_uuid, dv.lifecycle_state, event_time, event | ||
SELECT d.uuid, d.name, d.namespace_name, dv.run_uuid, dv.lifecycle_state, lineage_event_time, facet | ||
FROM selected_datasets d | ||
INNER JOIN dataset_versions dv ON dv.uuid = d.current_version_uuid | ||
INNER JOIN dataset_versions AS dv ON dv.uuid = d.current_version_uuid | ||
LEFT JOIN LATERAL ( | ||
SELECT run_uuid, event_time, event FROM lineage_events | ||
WHERE run_uuid = dv.run_uuid | ||
) e ON e.run_uuid = dv.run_uuid | ||
SELECT run_uuid, lineage_event_time, facet FROM dataset_facets_view | ||
WHERE dataset_uuid = dv.dataset_uuid | ||
) df ON df.run_uuid = dv.run_uuid | ||
UNION | ||
SELECT d.uuid, d.name, d.namespace_name, rim.run_uuid, lifecycle_state, event_time, event | ||
SELECT d.uuid, d.name, d.namespace_name, rim.run_uuid, lifecycle_state, lineage_event_time, facet | ||
FROM selected_datasets d | ||
INNER JOIN dataset_versions dv ON dv.uuid = d.current_version_uuid | ||
LEFT JOIN runs_input_mapping rim ON dv.uuid = rim.dataset_version_uuid | ||
LEFT JOIN LATERAL ( | ||
SELECT run_uuid, event_time, event FROM lineage_events | ||
WHERE run_uuid = rim.run_uuid | ||
) e ON e.run_uuid = rim.run_uuid | ||
SELECT dataset_uuid, run_uuid, lineage_event_time, facet FROM dataset_facets_view | ||
WHERE dataset_uuid = dv.dataset_uuid AND run_uuid = rim.run_uuid | ||
) df ON df.run_uuid = rim.run_uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only purpose of this CTE is to determine which runs have input or output facets for the current dataset version. Given that the dataset_facets_view
now has dataset_version_uuid
, I think we can drop this whole subquery and join directly on df.dataset_version_uuid=
d.current_version_uuid` below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #2407
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 💯 🚀 🥇
Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com
Problem
Migration of facets from
lineage_events
to separate facets tables will be asynchronous for some users as it will require manual step after database schema is migrated to newer version.We want to have an option to switch Marquez code to read facets from
lineage_events
before the manual migration and switch to usingjob_facets
,dataset_facets
andrun_facets
tables after the migration. We cannot split it into separate releases, as migration is manual and asynchronous and we never know when the user runs it (please note users may run upgrade multiple versions in one run).Solution
job_facets_vew
,dataset_facets_view
andrun_facets_view
will be created on the top of existinglineage_events
table.lineage_events
table to facets' tables and views will be replaced to point atjob_facets
,dataset_facets
andrun_facets
tables.Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)