You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
yields the file below in only 3.60 seconds. The only difference being we didn't use the --infer-dates option and date fields and their min/max values are treated as Strings.
Clearly, --infer-dates is a very expensive operation, and understandably so, since qsv's date parser engine has to parse and recognize 15 different date formats, with each format having several permutations.
Currently, DP+ uses the --infer-dates option during its analysis phase, which is something I'd still like to keep as its very useful when it does infer a column is a date field.
Perhaps, we should only attempt to infer dates when a quick initial scan of the CSV headers suggest the presence of a date field (i.e. search for the presence of "date", "time", "timestamp", "datetime" anywhere in a column name)?
The text was updated successfully, but these errors were encountered:
With
qsv stats
we collect descriptive statistics when we infer each column's data type during the Analysis phase of a DP+ job.For example, using the benchmark data from qsv based on a 1M row , 512 mb, 41 column sample of NYC's 311 data, the command:
yields the file below in 0.27 seconds:
Adding the
--everything
and--infer-dates
options...yields the file below in 103.89 seconds. More than 3 orders of magnitude slower!
while the command:
yields the file below in only 3.60 seconds. The only difference being we didn't use the
--infer-dates
option and date fields and their min/max values are treated asString
s.Clearly,
--infer-dates
is a very expensive operation, and understandably so, since qsv's date parser engine has to parse and recognize 15 different date formats, with each format having several permutations.Currently, DP+ uses the
--infer-dates
option during its analysis phase, which is something I'd still like to keep as its very useful when it does infer a column is a date field.Perhaps, we should only attempt to infer dates when a quick initial scan of the CSV headers suggest the presence of a date field (i.e. search for the presence of "date", "time", "timestamp", "datetime" anywhere in a column name)?
The text was updated successfully, but these errors were encountered: