Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: a faster way to construct a pandas.Timestamp from an epoch time #14658

Closed
radekholy24 opened this issue Nov 14, 2016 · 10 comments
Closed

RFE: a faster way to construct a pandas.Timestamp from an epoch time #14658

radekholy24 opened this issue Nov 14, 2016 · 10 comments
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Timezones Timezone data dtype

Comments

@radekholy24
Copy link

radekholy24 commented Nov 14, 2016

pandas.to_datetime called with an int is too slow for my use case. Basically, I have a loop that sequentially gets an integer from a generator of about 1 000 000 numbers, converts it to pandas.Timestamp and passes it to a function. A profiler says that the call of pandas.to_datetime takes about 40 % of the total run time of my program.

Compared to datetime.datetime.fromtimestamp, it's more than 60 times slower:

$ python -m timeit -n 1000000 -s 'import datetime' 'datetime.datetime.fromtimestamp(30, tz=datetime.timezone.utc)'
1000000 loops, best of 3: 0.889 usec per loop
$ python -m timeit -n 1000000 -s 'import pandas' 'pandas.to_datetime(30, utc=True, unit="s")'
1000000 loops, best of 3: 62.8 usec per loop
$ python -c 'import pandas;pandas.show_versions()'

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-47-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

Can you please provide/document a faster way to instantiate a pandas.Timestamp instance from an epoch time?

@radekholy24 radekholy24 changed the title RFE: a faster way to construct pandas.Timestamp from epoch times RFE: a faster way to construct a pandas.Timestamp from an epoch time Nov 14, 2016
@jreback
Copy link
Contributor

jreback commented Nov 14, 2016

why would you do this in a loop?
simply pass the entire list

@jreback
Copy link
Contributor

jreback commented Nov 14, 2016

In [10]: r = list(range(100000))

In [11]: %timeit [ datetime.datetime.fromtimestamp(30+v, tz=datetime.timezone.utc) for v in r ]
1 loop, best of 3: 251 ms per loop

In [12]: %timeit pd.to_datetime(r, utc=True, unit='s')
10 loops, best of 3: 84.5 ms per loop

@jreback jreback closed this as completed Nov 14, 2016
@jreback jreback added Performance Memory or execution speed performance Datetime Datetime data dtype Timezones Timezone data dtype labels Nov 14, 2016
@jreback jreback added this to the No action milestone Nov 14, 2016
@radekholy24
Copy link
Author

Because the whole data from the generator do not fit into memory?
But yeah, I can do that in my case.

@TomAugspurger
Copy link
Contributor

FYI @PyDeQ

In [652]: %timeit pd.Timestamp.utcfromtimestamp(30)
The slowest run took 11.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.76 µs per loop

vs

In [653]: %timeit datetime.datetime.fromtimestamp(30, tz=datetime.timezone.utc)
The slowest run took 14.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.62 µs per loop

But agreed with @jreback, you're much better off using vectorized methods in pandas.

@radekholy24
Copy link
Author

radekholy24 commented Nov 14, 2016

@TomAugspurger thanks. Unfortunately, pd.Timestamp.utcfromtimestamp is not documented.

@TomAugspurger
Copy link
Contributor

Mind opening a PR to fix that?

@radekholy24
Copy link
Author

No promises but I can consider doing a PR in case of spare time, sure.

@TomAugspurger
Copy link
Contributor

#5218 seems to be the reason it's not in the API docs at the moment.

@jorisvandenbossche
Copy link
Member

Using just plain Timestamp constructor is actually also fast:

In [51]: %timeit pd.Timestamp.utcfromtimestamp(30)
The slowest run took 39.10 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.77 µs per loop

In [52]: %timeit pd.Timestamp(30, unit='s', tz='UTC')
The slowest run took 14.56 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.97 µs per loop

And this one is documented (so I would prefer this over Timestamp.utcfromtimestamp)

(and it also gives me the impression that the performance of to_datetime can certainly be improved for this case)

@radekholy24
Copy link
Author

radekholy24 commented Nov 15, 2016

@jorisvandenbossche, what document do you mean? So far, I've found only examples with strings as the first arguments and no unit nor tz arguments.
Anyway, thanks. That is actually what I expected to be the resolution of this request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

4 participants