-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a utility to drop unknown references (and enforce referential integrity) #1800
Conversation
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1800 +/- ##
==========================================
+ Coverage 97.26% 97.27% +0.01%
==========================================
Files 49 50 +1
Lines 4679 4733 +54
==========================================
+ Hits 4551 4604 +53
- Misses 128 129 +1 ☔ View full report in Codecov by Sentry. |
1b2c7f2
to
cb8a21b
Compare
sdv/_utils.py
Outdated
|
||
|
||
def _get_relationship_idx_for_child(relationships, child_table): | ||
return [idx for idx, rel in enumerate(relationships) if rel['child_table_name'] == child_table] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need the index here or can we pass the relationship itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in 995a684
The issue mentions doing input validation on the provided data / metadata. It doesn't look like this is present. |
f3ddf79
to
995a684
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a couple of small nitpicks
sdv/_utils.py
Outdated
@@ -200,3 +201,80 @@ def _format_invalid_values_string(invalid_values, num_values): | |||
return f'{invalid_values[:num_values] + extra_missing_values}' | |||
|
|||
return f'{invalid_values}' | |||
|
|||
|
|||
def _find_root_tables(relationships): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to _get_root_tables
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in 3c640b1
sdv/_utils.py
Outdated
|
||
def _get_relationship_idx_for_parent(relationships, parent_table): | ||
return [ | ||
idx for idx, rel in enumerate(relationships) if rel['parent_table_name'] == parent_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing, do we need the index here or can we just pass the relationship?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in f110a87
sdv/_utils.py
Outdated
relationship_idx = _get_relationship_idx_for_parent(relationships, root) | ||
for idx in relationship_idx: | ||
relationship = relationships[idx] | ||
parent_table = relationship['parent_table_name'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is just the root table, can we move this outside of the for loop over the children?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in bff9b0a
sdv/_utils.py
Outdated
relationship = relationships[idx] | ||
parent_table = relationship['parent_table_name'] | ||
child_table = relationship['child_table_name'] | ||
parent_column = relationship['parent_primary_key'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, this shouldn't change for any of the children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in bff9b0a
sdv/_utils.py
Outdated
valid_parent_idx = [ | ||
idx for idx in data[parent_table].index if idx not in table_to_idx_to_drop.get( | ||
parent_table, set() | ||
) | ||
] | ||
valid_parent_values = set(data[parent_table].loc[valid_parent_idx, parent_column]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to doing it once outside of the for loop as well? Once we're evaluating it as a root table, the valid parent values shouldn't change again, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in d1777fc
sdv/_utils.py
Outdated
if child_table not in table_to_idx_to_drop: | ||
table_to_idx_to_drop[child_table] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of setting this here you could make table_to_idx_to_drop
a defaultdict(set)
, then you could also change the table_to_idx_to_drop.get
below to just indexing in directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks, done in 8b7c528
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for addressing
def _get_relationship_for_child(relationships, child_table): | ||
return [rel for rel in relationships if rel['child_table_name'] == child_table] | ||
|
||
|
||
def _get_relationship_for_parent(relationships, parent_table): | ||
return [rel for rel in relationships if rel['parent_table_name'] == parent_table] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: these probably don't need to be helper functions anymore.
metadata.validate_data(data) | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also just add a unit test to make sure that valid data is returned as-is?
CU-86azc5bc2
Resolve #1792