-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/PERF: Add cache='infer' to to_datetime #18255
Comments
For me the main question is: what would be the cost of inferring on a typical all unique strings case compared to parsing it.
So that would give quite a slowdown in that case. Or can this inferring step be done more performantly? |
a good heuristic here are
N1 ~ 50000 you already did some work on figuring out the experimental data, so could just set based on those parameters. |
@mroeschke |
There's a bit more. Currently |
I just updated pandas to 0.25.1, and noticed this new behaviour. I would much prefer an option to override the 'infer', since the current behaviour causes a dramatic slowdown in my specific case, from less than 0.5 seconds to almost 2 minutes.
|
Looking back on this issue, I am not convinced this feature is a good idea. I think pandas should aim to be "less smart" in general and developing constant heuristics on when to use the cache may not be realistic given changes in performance in other operations. Going to close this issue out, but happy to reopen if there is a resurgence of interest |
xref PR #17077
Now that a
cache
keyword has been added toto_datetime
, ideally the default should be set tocache='infer'
which would inspect the input data to determine whether caching would be a more efficient conversion.From some research (here and here), date strings, especially ones with timezones offsets, can benefit from conversion with a cache of dates. The rules of thumb of whether to convert with a cache should be based on a combination of input data type, proportion of duplicate values, and number of dates to convert.
Additionally, I'd be nice to resolve existing
to_datetime
performance issues (e.g. #17410) just so the rules of thumb informing the inference step are not misguided by these issues.The text was updated successfully, but these errors were encountered: