Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inequality constraint cannot be applied to compare datetime to date #2275

Closed
npatki opened this issue Nov 1, 2024 · 1 comment · Fixed by #2293
Closed

Inequality constraint cannot be applied to compare datetime to date #2275

npatki opened this issue Nov 1, 2024 · 1 comment · Fixed by #2293
Assignees
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Nov 1, 2024

Environment Details

  • SDV version: 1.17.1 (latest)

Error Description

This bug was first noticed by a Slack user.

In my table, I have a datetime column (submission_timestamp) and a date column (due_date). I want to synthesize data with an Inequality constraint showing that submission_timestamp <= due_date.

image

However, I am unable to apply an Inequality constraint to this data; SDV complains that the data violates the constraint.

The problem is that the date does not have enough granularity;

  • SDV (and Python in general) assumes that a due date such as 2016-10-12 is referring to the beginning of the day (2016-10-12 00:00:00)
  • However, in my dataset, a due date of 2016-10-12 is referring to the end of the day (2016-10-12 11:59:59) because it is ok to make a submission any time exactly on that day. This is why the inequality submission_timestamp <= due_date should be true.

Expected Behavior

I expect that when strict_boundaries=False with a date column, the assumed timestamp allow for the loosest possible interpretation:

  • If a date is a high column, we should assume it is referring to the end of day
  • If a date is a low column, we should assume it is referring to the beginning of day

The opposite should be assumed when strict_boundaries=True.

Steps to reproduce

import pandas as pd

from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer

data = pd.DataFrame(data={
    'SUBMISSION_TIMESTAMP': ['2016-07-10 17:04:00', '2016-07-11 13:23:00', '2016-07-12 08:45:30', '2016-07-11 12:00:00', '2016-07-12 10:30:00'],
    'DUE_DATE': ['2016-07-10', '2016-07-11', '2016-07-12', '2016-07-13', '2016-07-14'],
    
})

metadata = Metadata.load_from_dict({
    'tables': {
        'table': {
            'columns': {
                'SUBMISSION_TIMESTAMP': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d %H:%M:%S' },
                'DUE_DATE': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d'}
            }
        }
    }
})
synthesizer = GaussianCopulaSynthesizer(metadata)

constraint = {
    'constraint_class': 'Inequality',
    'constraint_parameters': {
        'low_column_name': 'SUBMISSION_TIMESTAMP',
        'high_column_name': 'DUE_DATE',
        'strict_boundaries': False
    }
}

synthesizer.add_constraints([constraint])

synthesizer.fit(data)
synthesizer.sample(num_rows=10)
ConstraintsNotMetError: 
Data is not valid for the 'Inequality' constraint:
  SUBMISSION_TIMESTAMP    DUE_DATE
0  2016-07-10 17:04:00  2016-07-10
1  2016-07-11 13:23:00  2016-07-11
2  2016-07-12 08:45:30  2016-07-12
@npatki npatki added bug Something isn't working feature:constraints Related to inputting rules or business logic labels Nov 1, 2024
@npatki
Copy link
Contributor Author

npatki commented Nov 1, 2024

Workaround

For any users encountering this: One workaround is simply to add 1 day to the date column for the purposes of SDV modeling. After creating synthetic data, it can be moved back one.

import pandas as pd

data_copy = data.copy()

# add 1 day to each value in the high column and save it back in the original format
data_copy['DUE_DATE'] = pd.to_datetime(data_copy['DUE_DATE']) + pd.DateOffset(1)
data_copy['DUE_DATE'] = data_copy['DUE_DATE'].dt.strftime('%Y-%m-%d')

# now fit and sample as usual
synthesizer = GaussianCopulaSynthesizer(metadata)

constraint = {
    'constraint_class': 'Inequality',
    'constraint_parameters': {
        'low_column_name': 'SUBMISSION_TIMESTAMP',
        'high_column_name': 'DUE_DATE',
        'strict_boundaries': False
    }
}

synthesizer.add_constraints([constraint])
synthesizer.fit(data_copy)
synthetic_data = synthesizer.sample(num_rows=5)

# finally subtract 1 day to the high column and save it back to the original format
synthetic_data['DUE_DATE'] = pd.to_datetime(synthetic_data['DUE_DATE']) - pd.DateOffset(1)
synthetic_data['DUE_DATE'] = synthetic_data['DUE_DATE'].dt.strftime('%Y-%m-%d')

@npatki npatki changed the title Inequality constraint cannot be applied to compare datetime to date (end-of-day) Inequality constraint cannot be applied to compare datetime to date Nov 1, 2024
@pvk-developer pvk-developer self-assigned this Nov 13, 2024
@amontanez24 amontanez24 added this to the 1.17.2 milestone Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants