ENH/PERF: Add cache='infer' to to_datetime #18255

mroeschke · 2017-11-13T07:21:03Z

Now that a cache keyword has been added to to_datetime, ideally the default should be set to cache='infer' which would inspect the input data to determine whether caching would be a more efficient conversion.

From some research (here and here), date strings, especially ones with timezones offsets, can benefit from conversion with a cache of dates. The rules of thumb of whether to convert with a cache should be based on a combination of input data type, proportion of duplicate values, and number of dates to convert.

Additionally, I'd be nice to resolve existing to_datetime performance issues (e.g. #17410) just so the rules of thumb informing the inference step are not misguided by these issues.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-11-13T08:28:24Z

For me the main question is: what would be the cost of inferring on a typical all unique strings case compared to parsing it.
Is suppose to know the number of unique ones we would need to call unique?

In [58]: idx = pd.date_range("1990-01-01", periods=100000, freq='H')

In [60]: idx_string = idx.strftime('%Y-%m-%d %H:%M:%S')

In [62]: %timeit pd.to_datetime(idx_string)
31.8 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [64]: %timeit pd.unique(idx_string)
26.3 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So that would give quite a slowdown in that case. Or can this inferring step be done more performantly?

jreback · 2017-11-13T13:19:40Z

a good heuristic here are

if < N1, just cache=False
take first N2 values and check ratio = nuniques / N2
if a good ratio, then go ahead and cache=True

N1 ~ 50000
N2 may depend on the dtype as well.

you already did some work on figuring out the experimental data, so could just set based on those parameters.

jbrockmendel · 2018-07-14T16:05:05Z

@mroeschke to_datetime got a cache kwarg a little while ago. Did that handle this issue or is there more?

mroeschke · 2018-07-14T17:00:03Z

There's a bit more. Currently cache only accepts True/False, and the idea would to add infer that could determine from some light analysis of the incoming data if cache should be True or False.

spatbord · 2019-09-19T10:17:01Z

I just updated pandas to 0.25.1, and noticed this new behaviour. I would much prefer an option to override the 'infer', since the current behaviour causes a dramatic slowdown in my specific case, from less than 0.5 seconds to almost 2 minutes.
I retrieve a list of stock prices from a database: about 3 years of data for 20k stocks, sorted by first stock and then date (as julianday in the database). When converting to datetime, it infers whether to use the cache in this case based on the first 500 entries, and it decides not to use the cache since the first 500 are unique dates (that's about 2 years of data for the first stock).

timeit.timeit(stmt='pd.to_datetime(prices["Date"], origin="julian", unit="D", cache=True)', globals=globals(), number=1)
107.68747780000001

prices = prices.sort_values(by='Date')
timeit.timeit(stmt='pd.to_datetime(prices["Date"], origin="julian", unit="D", cache=True)', globals=globals(), number=1)
0.4012885999999867

mroeschke · 2021-06-11T05:52:49Z

Looking back on this issue, I am not convinced this feature is a good idea. I think pandas should aim to be "less smart" in general and developing constant heuristics on when to use the cache may not be realistic given changes in performance in other operations.

Going to close this issue out, but happy to reopen if there is a resurgence of interest

mroeschke changed the title ~~ENH: Add cache='infer' to to_datetime~~ ENH/PERF: Add cache='infer' to to_datetime Nov 13, 2017

jreback added Difficulty Intermediate Performance Memory or execution speed performance Datetime Datetime data dtype labels Nov 13, 2017

jreback added this to the Next Major Release milestone Nov 13, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/PERF: Add cache='infer' to to_datetime #18255

ENH/PERF: Add cache='infer' to to_datetime #18255

mroeschke commented Nov 13, 2017

jorisvandenbossche commented Nov 13, 2017

jreback commented Nov 13, 2017 •

edited

Loading

jbrockmendel commented Jul 14, 2018

mroeschke commented Jul 14, 2018

spatbord commented Sep 19, 2019 •

edited

Loading

mroeschke commented Jun 11, 2021

ENH/PERF: Add cache='infer' to to_datetime #18255

ENH/PERF: Add cache='infer' to to_datetime #18255

Comments

mroeschke commented Nov 13, 2017

jorisvandenbossche commented Nov 13, 2017

jreback commented Nov 13, 2017 • edited Loading

jbrockmendel commented Jul 14, 2018

mroeschke commented Jul 14, 2018

spatbord commented Sep 19, 2019 • edited Loading

mroeschke commented Jun 11, 2021

jreback commented Nov 13, 2017 •

edited

Loading

spatbord commented Sep 19, 2019 •

edited

Loading