Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warning when user tries to train a model using datetime values without a datetime_format set #1847

Closed
srinify opened this issue Mar 11, 2024 · 1 comment · Fixed by #1897
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@srinify
Copy link
Contributor

srinify commented Mar 11, 2024

Situation

If you read data into a pandas dataframe, datetime columns often are (by default) set to the object (aka string) dtype. The user then must take some action to make SDV synthesize better datetime values:

  • user can manually cast the column to datetime dtype
  • user can set a datetime_format when creating the metadata for SDV to use

It's easy for someone (especially a new user) to skip this step entirely, causing issues later.

Suggested Warning

After discussing with Neha, I'm opening this feature request. Ideally, we can add a warning when the user tries to make progress with SDV (e.g. maybe training a synthesizer) that they should add a datetime_format.

@srinify srinify added feature request Request for a new feature new Automatic label applied to new issues and removed new Automatic label applied to new issues labels Mar 11, 2024
@npatki
Copy link
Contributor

npatki commented Mar 11, 2024

Note that metadata auto-detection will generally pick up a datetime_format for most common cases. But there are other ways of creating metadata, for example manually writing a Python dict or JSON file. In such cases, there may not be a datetime_format.

This warning should only appear when:

  • In metadata, thesdtype is 'datetime' AND
  • In metadata, there is no datetime_format specified AND
  • In the data, the dtype (storage type) is 'object'

Suggested API

This warning should result from metadata.validate_data function. Ideally the warning can pretty-print a list of columns to watch out for.

>>> metadata.validate_data(data)
Warning: No 'datetime_format' is present in the metadata for the following columns:

Table Name    Column Name    sdtype    datetime_format
users         start_date     datetime  None
users         end_date       datetime  None
sessions      timestamp      datetime  None

Without this specification, SDV may not be able to accurately parse the data. We recommend adding datetime formats using 'update_column'.

Note: For single table, we can exclude the Table Name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
4 participants