Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utility function to simplify multi-table schemas #1832

Closed
frances-h opened this issue Mar 5, 2024 · 0 comments · Fixed by #1874
Closed

Add utility function to simplify multi-table schemas #1832

frances-h opened this issue Mar 5, 2024 · 0 comments · Fixed by #1874
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@frances-h
Copy link
Contributor

frances-h commented Mar 5, 2024

Problem Description

Currently, HMA cannot run on certain multi-table schemas. We issue a warning when a schema will generate too many columns, and we should provide a utility function to easily reduce a multi-table schema so it can successfully run on HMA.

Expected behavior

Add a new utility function utils.simplify_schema:
Parameters:

  • data - the data dictionary
  • metadata - the MultiTableMetadata for this dataset

Returns:

  • A data dictionary mapping table names to simplified tables
  • MultiTableMetadata for the simplified data schema
from sdv.utils import simplify_schema

simple_data, simple_metadata = simplify_schema(
  data=my_data,
  metadata=my_metadata
)

Algorithm overview

For every root table:

  • drop any table that is depth > 2 away from the parent (i.e. keep only direct children and grandchildren) and count the number of tables connected to the root

Select the root with the greatest number of descendant tables
Calculate the number of extended columns we can add to the root (we can reuse the logic used to generate the warning in HMA)
Allocate a # of augmented columns to each child relationship
For each child:

  • Determine the number of modelable columns and add the number of child relationships for that child
  • If the number of modelable columns will generate more than the allowed number of extended columns, drop modelable columns from the child
    • Try to keep a variety of sdtypes
    • If we cannot drop columns so that we will not exceed the maximum number of extended_columns, drop any grandchild tables until we can

For each grandchild:

  • Drop all modelable columns (grandchildren should only generate a num_rows column in their parents)

Additional context

We should also change the warning in HMA to point to this utility function:

>>> synthesizer = HMASynthesizer(metadata)
PerformanceAlert: Using the HMASynthesizer on this metadata schema is not recommended because HMA will generate a large number of columns

Table Name   # Columns in Metadata   Est # Columns
users        12                      123123123
transactions   
...

We recommend simplifying your metadata schema using utils.simplify_schema
@frances-h frances-h added feature request Request for a new feature new Automatic label applied to new issues labels Mar 5, 2024
@npatki npatki removed the new Automatic label applied to new issues label Mar 5, 2024
@amontanez24 amontanez24 added this to the 1.12.0 milestone Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants