Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a parent has 0/1 children, HMASynthesizer may create constant data #1895

Closed
npatki opened this issue Apr 4, 2024 · 0 comments · Fixed by #2068
Closed

If a parent has 0/1 children, HMASynthesizer may create constant data #1895

npatki opened this issue Apr 4, 2024 · 0 comments · Fixed by #2068
Assignees
Labels
bug Something isn't working data:multi-table Related to multi-table, relational datasets
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Apr 4, 2024

Environment Details

  • SDV version: 1.11.0

Error Description

If many rows of a parent table have 0/1 children, then the HMASynthesizer is not well-suited to handle the data. Since data from 0/1 children cannot be aggregated (eg. a correlation cannot be computed), the HMA end up with a lot of NaN values which are then filled with a mean.

This resulting data is hard to model. As a result, the synthetic data is often constant at a min/max value -- the result of clipping out-of-bounds synthetic data.

Steps to reproduce

import pandas as pd
import numpy as np

from sdv.metadata import MultiTableMetadata
from sdv.multi_table import HMASynthesizer

parent_table = pd.DataFrame(data={
    'id': [i for i in range(1000)],
    'col_A': list(np.random.choice(['A', 'B', 'C', 'D', 'E'], size=1000))
})

child_table_data = {
    'parent_id': [],
    'col_B': [],
    'col_C': []
}

for i in range(1000):
  num_children = np.random.choice([0, 1, 10, 15], p=[0.4, 0.5, 0.05, 0.05])
  if num_children == 0:
    continue

  child_table_data['parent_id'].extend([i]*num_children)
  child_table_data['col_B'].extend([round(i, 2) for i in np.random.uniform(low=0, high=10, size=num_children)])
  child_table_data['col_C'].extend(list(np.random.choice(['A', 'B', 'C', 'D', 'E'], size=num_children)))

child_table = pd.DataFrame(data=child_table_data)

data = {
    'parent': parent_table,
    'child': child_table
}

metadata = MultiTableMetadata.load_from_dict({
    'tables': {
        'parent': {
            'primary_key': 'id',
            'columns': {
                'id': { 'sdtype': 'id' },
                'col_A': { 'sdtype': 'categorical' }
            }
        },
        'child': {
            'columns': {
                'parent_id': { 'sdtype': 'id' },
                'col_B': { 'sdtype': 'numerical' },
                'col_C': { 'sdtype': 'categorical' }
            }
        }

    },
    'relationships': [{
        'parent_table_name': 'parent',
        'child_table_name': 'child',
        'parent_primary_key': 'id',
        'child_foreign_key': 'parent_id'
    }]
})

synth = HMASynthesizer(metadata)
synth.fit(data)
synthetic_data = synth.sample()

Observe how many values in the synthetic child are constant:

image
@npatki npatki added the bug Something isn't working label Apr 4, 2024
@npatki npatki added the data:multi-table Related to multi-table, relational datasets label Apr 4, 2024
@gsheni gsheni self-assigned this May 30, 2024
@amontanez24 amontanez24 added this to the 1.15.0 milestone Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:multi-table Related to multi-table, relational datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants