Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a utility to drop unknown references (and enforce referential integrity) #1800

Merged
merged 16 commits into from
Feb 26, 2024

Conversation

R-Palazzo
Copy link
Contributor

CU-86azc5bc2
Resolve #1792

@R-Palazzo R-Palazzo requested a review from a team as a code owner February 20, 2024 16:14
@sdv-team
Copy link
Contributor

@R-Palazzo R-Palazzo removed the request for review from a team February 20, 2024 16:14
@codecov-commenter
Copy link

codecov-commenter commented Feb 20, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (ea4c2cc) 97.26% compared to head (d1777fc) 97.27%.

❗ Current head d1777fc differs from pull request most recent head f110a87. Consider uploading reports for the commit f110a87 to get more accurate results

Files Patch % Lines
sdv/utils.py 95.23% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1800      +/-   ##
==========================================
+ Coverage   97.26%   97.27%   +0.01%     
==========================================
  Files          49       50       +1     
  Lines        4679     4733      +54     
==========================================
+ Hits         4551     4604      +53     
- Misses        128      129       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@R-Palazzo R-Palazzo force-pushed the issue-1792-add-drop_unknown_references branch from 1b2c7f2 to cb8a21b Compare February 21, 2024 13:58
@amontanez24 amontanez24 requested review from rwedge and removed request for amontanez24 February 21, 2024 16:43
sdv/_utils.py Outdated Show resolved Hide resolved
sdv/_utils.py Outdated


def _get_relationship_idx_for_child(relationships, child_table):
return [idx for idx, rel in enumerate(relationships) if rel['child_table_name'] == child_table]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need the index here or can we pass the relationship itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in 995a684

@rwedge
Copy link
Contributor

rwedge commented Feb 21, 2024

The issue mentions doing input validation on the provided data / metadata. It doesn't look like this is present.

Base automatically changed from issue-1793-functions-private to main February 22, 2024 09:00
@R-Palazzo R-Palazzo force-pushed the issue-1792-add-drop_unknown_references branch from f3ddf79 to 995a684 Compare February 22, 2024 10:17
@R-Palazzo
Copy link
Contributor Author

@rwedge thanks for the comment. Done in 462ecf2

Copy link
Contributor

@frances-h frances-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of small nitpicks

sdv/_utils.py Outdated
@@ -200,3 +201,80 @@ def _format_invalid_values_string(invalid_values, num_values):
return f'{invalid_values[:num_values] + extra_missing_values}'

return f'{invalid_values}'


def _find_root_tables(relationships):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to _get_root_tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in 3c640b1

sdv/_utils.py Outdated

def _get_relationship_idx_for_parent(relationships, parent_table):
return [
idx for idx, rel in enumerate(relationships) if rel['parent_table_name'] == parent_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing, do we need the index here or can we just pass the relationship?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in f110a87

sdv/_utils.py Outdated
relationship_idx = _get_relationship_idx_for_parent(relationships, root)
for idx in relationship_idx:
relationship = relationships[idx]
parent_table = relationship['parent_table_name']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just the root table, can we move this outside of the for loop over the children?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in bff9b0a

sdv/_utils.py Outdated
relationship = relationships[idx]
parent_table = relationship['parent_table_name']
child_table = relationship['child_table_name']
parent_column = relationship['parent_primary_key']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this shouldn't change for any of the children.

Copy link
Contributor Author

@R-Palazzo R-Palazzo Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in bff9b0a

sdv/_utils.py Outdated
Comment on lines 257 to 262
valid_parent_idx = [
idx for idx in data[parent_table].index if idx not in table_to_idx_to_drop.get(
parent_table, set()
)
]
valid_parent_values = set(data[parent_table].loc[valid_parent_idx, parent_column])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to doing it once outside of the for loop as well? Once we're evaluating it as a root table, the valid parent values shouldn't change again, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done in d1777fc

sdv/_utils.py Outdated
Comment on lines 253 to 254
if child_table not in table_to_idx_to_drop:
table_to_idx_to_drop[child_table] = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting this here you could make table_to_idx_to_drop a defaultdict(set), then you could also change the table_to_idx_to_drop.get below to just indexing in directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks, done in 8b7c528

Copy link
Contributor

@frances-h frances-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for addressing

Comment on lines +213 to +218
def _get_relationship_for_child(relationships, child_table):
return [rel for rel in relationships if rel['child_table_name'] == child_table]


def _get_relationship_for_parent(relationships, parent_table):
return [rel for rel in relationships if rel['parent_table_name'] == parent_table]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: these probably don't need to be helper functions anymore.

Comment on lines +26 to +27
metadata.validate_data(data)
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also just add a unit test to make sure that valid data is returned as-is?

@R-Palazzo R-Palazzo merged commit c314751 into main Feb 26, 2024
37 checks passed
@R-Palazzo R-Palazzo deleted the issue-1792-add-drop_unknown_references branch February 26, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a utility to drop unknown references (and enforce referential integrity)
5 participants