Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synthesize command: schema-informed synthetic data generator #235

Open
jqnatividad opened this issue Apr 4, 2022 · 9 comments
Open

synthesize command: schema-informed synthetic data generator #235

jqnatividad opened this issue Apr 4, 2022 · 9 comments
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. qsv pro requires backend/cloud services

Comments

@jqnatividad
Copy link
Collaborator

jqnatividad commented Apr 4, 2022

Using fake, and scanning for the faker keywords in a schema-generated jsonschema description, create more realistic fake test data.

Already, qsv has the generate command, but it doesn't really generate realistic test data - it generates random data informed by profiling a training CSV, and because each generated value is randomly generated based on the training profile, its not as performant.

When enums are specified for a field, use the enums instead.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Apr 4, 2022
@github-actions
Copy link

github-actions bot commented Jun 8, 2022

Stale issue message

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Apr 7, 2023

#902 sets the stage to revisit this. Once fake is done, we can then remove the generate command which is just not performant enough and tends to generate gobbly-gook test data anyway.

@github-actions
Copy link

github-actions bot commented Jun 7, 2023

Stale issue message

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2023
@jqnatividad
Copy link
Collaborator Author

We should also use frequency when generating "fake" data so that it mirrors the training data more closely.

@github-actions
Copy link

github-actions bot commented Sep 3, 2023

Stale issue message

@jqnatividad
Copy link
Collaborator Author

scheduling for 0.117.0 release

Copy link

github-actions bot commented Dec 2, 2023

Stale issue message

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 9, 2023
@jqnatividad jqnatividad added the qsv pro requires backend/cloud services label Jan 5, 2024
@jqnatividad
Copy link
Collaborator Author

Removing generate command even before fake is done, as generate is unmaintained and has old dependencies weighing down qsv.

@jqnatividad jqnatividad reopened this Jan 5, 2024
@jqnatividad jqnatividad changed the title fake command: schema-informed fake data generator synthesize command: schema-informed synthetic data generator Jan 17, 2024
@jqnatividad
Copy link
Collaborator Author

Instead of fake, with all its negative connotations, name the command synthesize instead...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. qsv pro requires backend/cloud services
Projects
None yet
Development

No branches or pull requests

1 participant