Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: Add cache='infer' to to_datetime #18255

Closed
mroeschke opened this issue Nov 13, 2017 · 6 comments
Closed

ENH/PERF: Add cache='infer' to to_datetime #18255

mroeschke opened this issue Nov 13, 2017 · 6 comments
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@mroeschke
Copy link
Member

xref PR #17077

Now that a cache keyword has been added to to_datetime, ideally the default should be set to cache='infer' which would inspect the input data to determine whether caching would be a more efficient conversion.

From some research (here and here), date strings, especially ones with timezones offsets, can benefit from conversion with a cache of dates. The rules of thumb of whether to convert with a cache should be based on a combination of input data type, proportion of duplicate values, and number of dates to convert.

Additionally, I'd be nice to resolve existing to_datetime performance issues (e.g. #17410) just so the rules of thumb informing the inference step are not misguided by these issues.

@mroeschke mroeschke changed the title ENH: Add cache='infer' to to_datetime ENH/PERF: Add cache='infer' to to_datetime Nov 13, 2017
@jorisvandenbossche
Copy link
Member

For me the main question is: what would be the cost of inferring on a typical all unique strings case compared to parsing it.
Is suppose to know the number of unique ones we would need to call unique?

In [58]: idx = pd.date_range("1990-01-01", periods=100000, freq='H')

In [60]: idx_string = idx.strftime('%Y-%m-%d %H:%M:%S')

In [62]: %timeit pd.to_datetime(idx_string)
31.8 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [64]: %timeit pd.unique(idx_string)
26.3 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So that would give quite a slowdown in that case. Or can this inferring step be done more performantly?

@jreback jreback added Difficulty Intermediate Performance Memory or execution speed performance Datetime Datetime data dtype labels Nov 13, 2017
@jreback jreback added this to the Next Major Release milestone Nov 13, 2017
@jreback
Copy link
Contributor

jreback commented Nov 13, 2017

a good heuristic here are

  • if < N1, just cache=False
  • take first N2 values and check ratio = nuniques / N2
  • if a good ratio, then go ahead and cache=True

N1 ~ 50000
N2 may depend on the dtype as well.

you already did some work on figuring out the experimental data, so could just set based on those parameters.

@jbrockmendel
Copy link
Member

@mroeschke to_datetime got a cache kwarg a little while ago. Did that handle this issue or is there more?

@mroeschke
Copy link
Member Author

There's a bit more. Currently cache only accepts True/False, and the idea would to add infer that could determine from some light analysis of the incoming data if cache should be True or False.

@spatbord
Copy link

spatbord commented Sep 19, 2019

I just updated pandas to 0.25.1, and noticed this new behaviour. I would much prefer an option to override the 'infer', since the current behaviour causes a dramatic slowdown in my specific case, from less than 0.5 seconds to almost 2 minutes.
I retrieve a list of stock prices from a database: about 3 years of data for 20k stocks, sorted by first stock and then date (as julianday in the database). When converting to datetime, it infers whether to use the cache in this case based on the first 500 entries, and it decides not to use the cache since the first 500 are unique dates (that's about 2 years of data for the first stock).

timeit.timeit(stmt='pd.to_datetime(prices["Date"], origin="julian", unit="D", cache=True)', globals=globals(), number=1)
107.68747780000001

prices = prices.sort_values(by='Date')
timeit.timeit(stmt='pd.to_datetime(prices["Date"], origin="julian", unit="D", cache=True)', globals=globals(), number=1)
0.4012885999999867

@mroeschke
Copy link
Member Author

Looking back on this issue, I am not convinced this feature is a good idea. I think pandas should aim to be "less smart" in general and developing constant heuristics on when to use the cache may not be realistic given changes in performance in other operations.

Going to close this issue out, but happy to reopen if there is a resurgence of interest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants