Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer: new command to infer additional dataset metadata based on summary stats/frequency table #2184

Open
jqnatividad opened this issue Oct 1, 2024 · 2 comments
Labels
datapusher+ for Datapusher+ enhancement New feature or request. Once marked with this label, its in the backlog. qsv pro requires backend/cloud services

Comments

@jqnatividad
Copy link
Owner

jqnatividad commented Oct 1, 2024

Date/Datetime formats

  • when --infer-date is enabled, format should be set to the format used
  • for all Date/Datetime columns, on the last pass right before displaying results, it will check min, max, median and modes to see if they match one of the 19 date formats recognized by qsv-dateparser
  • if there are multiple formats detected, set format to either "multi_date", "multi_datetime", or "multi_date_datetime"

Location

  • add --infer-location flag
  • for all String columns, on the last pass right before displaying results, it will check min, max, median and modes to see if they match common location formats - https://www.maptools.com/tutorials/lat_lon/formats
  • for Float columns:
    • if the range is between -90 and 90, its inferred to be latitude format
    • if the range is between -180 and 180, infer longitude format

Email

  • add --infer-email flag
  • for all String columns, on the last pass right before displaying results, it will check min, max, median and modes to see if they match common email formats using the email_address crate

Using the same approach above (looking at summary stats min, max, median, modes), also infer:

  • Hostnames with --infer-hostnames option
  • IP addresses with --infer-ipaddress option, for both ipv4 and ipv6 formats
  • Phone numbers with --infer-phoneno option
  • Currency with --infer-currency option, adding currency symbol metadata to the format entry - e.g. "currency - USD ( $ )", "currency - JPY (¥)", "currency = PHP (₱)", "currency - ? ($)", etc.
    As some currency symbols like the $ is used in several countries, it will use "?" instead of the three-letter ISO 4217 code if it cannot infer it.

Also add -F, --infer-all-formats convenience option.

If a CSV is indexed and --format-sample <sample_size> option is used, randomly sample the CSV to further verify if the inferred format using the summary stats is correct.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Oct 1, 2024
@jqnatividad
Copy link
Owner Author

jqnatividad commented Oct 1, 2024

For qsv pro, add the option to infer custom formats using luau or python scripts.
These scripts will have the added ability to lookup reference data maintained in https://data.dathere.com
(e.g. ISO code tables, congressional district, school district, Census geoid, etc.), other CKAN instances, and internal databases/data sources.

@jqnatividad jqnatividad added the qsv pro requires backend/cloud services label Oct 1, 2024
@jqnatividad jqnatividad changed the title stats: add format inferencing infer: new command to infer additional dataset characteristics based on summary stats/frequency table Oct 3, 2024
@jqnatividad
Copy link
Owner Author

jqnatividad commented Oct 3, 2024

make this a new "smart" command instead, so we don't overload stats with options..

Still, the format command will just add a format column to stats's output or the stats cache.

@jqnatividad jqnatividad changed the title infer: new command to infer additional dataset characteristics based on summary stats/frequency table infer: new command to infer additional dataset metadata based on summary stats/frequency table Oct 16, 2024
@jqnatividad jqnatividad pinned this issue Oct 16, 2024
@jqnatividad jqnatividad added the datapusher+ for Datapusher+ label Oct 21, 2024
@jqnatividad jqnatividad unpinned this issue Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datapusher+ for Datapusher+ enhancement New feature or request. Once marked with this label, its in the backlog. qsv pro requires backend/cloud services
Projects
None yet
Development

No branches or pull requests

2 participants
@jqnatividad and others