-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add utility function to simplify multi-table schemas #1874
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good! I think in a couple of places we can maybe reduce the number of functions and simplify the logic a few steps. It would be helpful to add a docstring for the _get_num_column_to_drop
so we can discuss it more
Also in general can we fix some of the names:
- childs -> children
- grandchilds -> grandchildren
- simplify -> simplified if it is being used as an adjective
sdv/multi_table/utils.py
Outdated
all_descendants_to_keep = all_descendants_per_root[root_with_max_descendants] | ||
|
||
simplify_metadata, childs, grandchilds = _simplify_relationships( | ||
simplify_metadata, root_with_max_descendants, all_descendants_to_keep | ||
) | ||
|
||
simplify_metadata = _simplify_non_descendants_tables( | ||
simplify_metadata, root_with_max_descendants, all_descendants_to_keep | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be simplified by:
- Getting a list of all table names that are within 2 of the root (all of the root's children or grandchildren)
- Looping through all tables and pruning the ones that aren't in that list
tests/integration/utils/test_poc.py
Outdated
def test_simplify_schema_invalid_metadata(): | ||
"""Test ``simplify_schema`` when the metadata is not invalid.""" | ||
# Setup | ||
real_data, metadata = download_demo('multi_table', 'Bupa_v1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're in the process of fixing the demo metadata, we might want to update this to modify the downloaded demo so that it is guaranteed to be invalid
sdv/multi_table/utils.py
Outdated
root_tables = _get_root_tables(relationships) | ||
all_descendants = {} | ||
for root in root_tables: | ||
order_descendances = _get_n_order_descendants(relationships, root, order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: typo
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1874 +/- ##
==========================================
+ Coverage 97.46% 97.55% +0.08%
==========================================
Files 51 52 +1
Lines 4978 5143 +165
==========================================
+ Hits 4852 5017 +165
Misses 126 126 ☔ View full report in Codecov by Sentry. |
sdv/multi_table/utils.py
Outdated
|
||
Returns: | ||
int: | ||
Number of columns to drop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this returns a tuple
tests/unit/utils/test_poc.py
Outdated
def test_simplify_schema_invalid_data(): | ||
"""Test ``simplify_schema`` when the data is not valid.""" | ||
# Setup | ||
real_data, metadata = download_demo('multi_table', 'fake_hotels') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we shouldn't download data for the unit tests
5d72c6a
to
8ca4158
Compare
Thanks for the review @amontanez24, I addressed latest comments in this commit 8ca4158. |
a7cf99b
to
eab810d
Compare
sdv/multi_table/utils.py
Outdated
def _get_relationship_for_child(relationships, child_table): | ||
return [rel for rel in relationships if rel['child_table_name'] == child_table] | ||
|
||
|
||
def _get_relationship_for_parent(relationships, parent_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: can we make plural (ie. get_relationships)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in 209b720
sdv/multi_table/utils.py
Outdated
for root in root_tables: | ||
order_descendants = _get_n_order_descendants(relationships, root, order) | ||
all_descendant_root = set() | ||
for order_key in order_descendants: | ||
all_descendant_root.update(order_descendants[order_key]) | ||
|
||
all_descendants[root] = all_descendant_root | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we prevent having to recompute children and grandchildren later by having this method return a dict in the following format:
root -> {'order_1': ..., 'order_2': ..., 'num_descendants': 6}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done in 209b720
sdv/multi_table/hma.py
Outdated
self._estimate_columns_traversal(table_name, columns_per_table, visited) | ||
|
||
return columns_per_table | ||
"We recommend simplifying your metadata schema using 'sdv.utils.simplify_schema'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: should be "using 'sdv.utils.poc.simplify_schema'"
sdv/multi_table/utils.py
Outdated
elif is_child_to_drop and child_table in metadata.tables: | ||
del metadata.tables[child_table] | ||
|
||
return metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this function will actually modify the metadata object we pass in, we should remove the return. The convention is usually to have functions that modify arguments in-place to not return anything.
sdv/multi_table/utils.py
Outdated
for column in columns_to_drop: | ||
del columns[column] | ||
|
||
return metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, we should remove the return since we're directly modifying the passed in metadata object
sdv/multi_table/utils.py
Outdated
- num_cols_to_drop (int): Number of columns to drop from the child table. | ||
- modelable_columns (dict): Dictionary that maps the modelable sdtype to their columns. | ||
""" | ||
num_column_parameter = 4 # for beta distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pull this from HMA instead of using a magic number? Maybe update the distribution to number of parameters dict to a constant, and pull it and HMA's default distribution (in case it ever changes) directly from HMA?
sdv/multi_table/utils.py
Outdated
columns_to_sdtypes = { | ||
column: columns[column]['sdtype'] for column in columns | ||
} | ||
sdtypes_to_columns = defaultdict(list) | ||
for col, sdtype in columns_to_sdtypes.items(): | ||
sdtypes_to_columns[sdtype].append(col) | ||
|
||
modelable_columns = { | ||
key: value for key, value in sdtypes_to_columns.items() if key in MODELABLE_SDTYPE | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to something like:
modelable_columns = defaultdict(list)
for col, col_meta in columns.items():
if col_meta['sdtype'] in MODELABLE_SDTYPE:
modelable_columns[col_meta['sdtype'].append(col)
Thanks @frances-h, @amontanez24 for the review. All comments should have been addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM🍾 🎆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for addressing! 🎉
CU-86azkea74
Resolve #1832
Formula to determine how many columns to drop for a child
Some other info:
class
orstatic methods
so I can use them outside HMA. I preferred keeping them inside HMA, though, because they are very specific to this model (not used for HSA, for instance.)The method
_simplify_data(data, metadata)
makes your data match your metadata.It can be useful when you're updating your metadata (removing columns and tables...) to make it work along with your data. If we want, it can be public.