Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-100% Data Validity #1899

Closed
prupireddy opened this issue Apr 5, 2024 · 6 comments
Closed

Sub-100% Data Validity #1899

prupireddy opened this issue Apr 5, 2024 · 6 comments
Labels
bug Something isn't working resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@prupireddy
Copy link

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.11.0
  • Python version: 3.11.7
  • Operating System: Windows 10 Enterprise

Error Description

I have a PAR model running on a health dataset. I noticed that my Data Validity drops from 100% when I include a Days_Supplied feature. The website said to contact you if that occurs.
DataValidityIssue.xlsx

Steps to reproduce

I've attached an excel file that has the true values on the left and the synthetic on the right. For privacy reasons, I cannot send the full data and code.

@prupireddy prupireddy added bug Something isn't working new Automatic label applied to new issues labels Apr 5, 2024
@srinify
Copy link
Contributor

srinify commented Apr 12, 2024

Hi there @prupireddy it looks like the reason the score isn't 100% is because the PARSynthesizer model isn't adhering to the min and max values in your original dataset column.

I have 2 questions:

  1. When you crafted your SingleTableMetadata object, what sdtype was detected or did you assign? You can run print(your_single_table_metdata_object) on your machine to look this up and just tell me the one one for this Days_Supplied column. I'm also curious what pandas DataFrame dtype this column is (int or float)?

  2. Did you set the enforce_min_max_values parameter when defining your PARSynthesizer object? Or did you skip this and just left the model use the default value?

@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Apr 12, 2024
@prupireddy
Copy link
Author

  1. Categorical was detected (I didn't assign). The pandas dtype is float64.
  2. No I did not; it used the default.

Thank you

@npatki
Copy link
Contributor

npatki commented Apr 17, 2024

Hi @prupireddy and @srinify there is currently a known issue that PARSynthesizer specifically has when the column is categorical but it is represented in a float format. I wonder if this is the root cause? #1910.

I would start by confirming whether this column (Days_Supplied) was correctly detected in the metadata, as the detection is not guaranteed to be 100% accurate. Does this column truly represent discrete categories or is it numerical? To help you decide, see this sdtypes reference.

  1. If it's supposed to be numerical, please update your metadata and try with the updated version. There are currently no known bugs in PARSynthesizer for numerical data.
  2. If it's supposed to categorical, then you can try the workaround I've listed in PAR DiagnosticReport not 1.0 with float categorical columns #1910.

@npatki
Copy link
Contributor

npatki commented Apr 25, 2024

Hi @prupireddy I noticed you closed the issue. Does that mean you were able to come up with a resolution?

For our knowledge (and perhaps to help others running into the same problem), you could clarify what the issue was?

@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Apr 25, 2024
@prupireddy
Copy link
Author

prupireddy commented Apr 25, 2024

Since the true data type was supposed to be numerical, I followed your first suggestion and just hardcoded it to numerical (as opposed to having it get detected which gave categorical). This resolved the issue.

@npatki
Copy link
Contributor

npatki commented Apr 25, 2024

Great. Appreciate the confirmation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

3 participants