Regression in 0.24: to_timedelta handling of float values #25077

Sup3rGeo · 2019-02-01T17:23:05Z

Code Sample, a copy-pastable example if possible

Creating an TimedeltaIndex with 1 microsecond steps:

import pandas as pd
import scipy as sp
t = sp.arange(0,1,1e-6)
index = pd.to_timedelta(t, unit='s')

Problem description

Output in 0.24:

We have repeated index values (e.g. 99992 in the image) and missing ones (e.g. 99994 in the image).

This looks like some sort of rounding issue.

Expected Output

Output in 0.23.4:

In this version everything is okay.

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.0
pytest: 4.1.0
pip: 18.1
setuptools: 39.0.1
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.2
IPython: 7.2.0
sphinx: 1.8.3
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: 1.7.0
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-02-01T17:29:46Z

Can you fill out the issue?

…

On Feb 1, 2019, at 11:23, Victor Maryama ***@***.***> wrote: Code Sample, a copy-pastable example if possible # Your code here Problem description [this should explain why the current behaviour is a problem and why the expected output is a better solution.] Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates! Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary. For documentation-related issues, you can check the latest versions of the docs on master here: https://pandas-docs.github.io/pandas-docs-travis/ If the issue has not been resolved there, go ahead and file it in the issue tracker. Expected Output Output of pd.show_versions() — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Sup3rGeo · 2019-02-01T17:31:10Z

Sorry, was in the process of doing it!

TomAugspurger · 2019-02-01T19:27:36Z

Thanks. Likely some rounding issue. Are floating point values allowed in to_timedelta?

gfyoung · 2019-02-07T06:55:11Z

cc @jreback @mroeschke

Feel free to change milestone depending on what your thoughts on this are.

jorisvandenbossche · 2019-02-07T12:09:41Z

Are floating point values allowed in to_timedelta?

It's not really documented very clearly (under unit there is a mention of "float number"), but in any case it works in practice and is also tested behaviour:

pandas/pandas/tests/indexes/timedeltas/test_tools.py

Lines 104 to 108 in 5278cc6

    
           # Tests with fractional seconds as input: 
        
           expected = np.array( 
        
               [0, 500000000, 800000000, 1200000000], dtype='timedelta64[ns]') 
        
           result = to_timedelta([0., 0.5, 0.8, 1.2], unit='s', box=False) 
        
           tm.assert_numpy_array_equal(expected, result)

So clearly a regression I would say.

jreback · 2019-02-07T12:27:30Z

this is just a precision issue.

In [7]: index.asi8
Out[7]: 
array([        0,      1000,      2000, ..., 999996999, 999998000,
       999999000])

In [8]: np.diff(index.asi8)
Out[8]: array([1000, 1000, 1000, ...,  999, 1001, 1000])

try this to see full precision printing

In [17]: [t._repr_base(format='all') for t in index[0:10]]
Out[17]: 
['0 days 00:00:00.000000000',
 '0 days 00:00:00.000001000',
 '0 days 00:00:00.000002000',
 '0 days 00:00:00.000003000',
 '0 days 00:00:00.000004000',
 '0 days 00:00:00.000005000',
 '0 days 00:00:00.000006000',
 '0 days 00:00:00.000007000',
 '0 days 00:00:00.000008000',
 '0 days 00:00:00.000009000']

not a bug / regression at all; I suppose docs could be enhanced

jorisvandenbossche · 2019-02-07T12:32:26Z

It's not just about display. The actual integer values have changed between 0.23 and 0.24. Before 0.24, the floats were converted to ints with a better precison.

0.23.4:

In [16]: arr = np.arange(0, 1, 1e-6)

In [17]: pd.to_timedelta(arr, unit='s').asi8
Out[17]: 
array([        0,      1000,      2000, ..., 999997000, 999998000,
       999999000])

0.24.1:

In [10]: arr = np.arange(0, 1, 1e-6)    

In [11]: pd.to_timedelta(arr, unit='s').asi8   
Out[11]: 
array([        0,      1000,      2000, ..., 999996999, 999998000,
       999999000])

Notice the 999997000 vs 999996999

So in any case, it is a regression in behaviour (whether we think it was supported behaviour or not can be another discussion)

jreback · 2019-02-07T12:46:18Z

are these exactly the same numpy version? this is at the precision limit

jorisvandenbossche · 2019-02-07T12:50:28Z

Yes, the same numpy version apparently. But, it's only the 6th decimal, that's a precision we should be able to handle.

Sup3rGeo · 2019-02-07T13:17:38Z

Just to add to the discussion from a user application perspective:

>>> import pandas as pd
>>> import scipy as sp
>>> t = sp.arange(0,1,1e-6)
>>> index = pd.to_timedelta(t, unit='s')
>>> y = sp.sin(t)
>>> series = pd.Series(y, index=index)

>>> # This works
... series["300ms"]
0.29552020666133955


>>> # This don't
... series["400ms"]

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Program Files\Python36\lib\site-packages\pandas\core\series.py", line 868, in __getitem__
    result = self.index.get_value(self, key)
  File "C:\Program Files\Python36\lib\site-packages\pandas\core\indexes\timedeltas.py", line 501, in get_value
    return self.get_value_maybe_box(series, key)
  File "C:\Program Files\Python36\lib\site-packages\pandas\core\indexes\timedeltas.py", line 521, in get_value_maybe_box
    values = self._engine.get_value(com.values_from_object(series), key)
  File "pandas\_libs\index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 430, in pandas._libs.index.DatetimeEngine.get_loc
KeyError: Timedelta('0 days 00:00:00.400000')

>>> series["399.99ms":"400.01ms"]
00:00:00.399991    0.389410
00:00:00.399991    0.389411
00:00:00.399993    0.389412
00:00:00.399993    0.389413
00:00:00.399995    0.389414
00:00:00.399995    0.389415
00:00:00.399997    0.389416
00:00:00.399997    0.389417
00:00:00.399999    0.389417
00:00:00.399999    0.389418
00:00:00.400001    0.389419
00:00:00.400001    0.389420
00:00:00.400003    0.389421
00:00:00.400004    0.389422
00:00:00.400005    0.389423
00:00:00.400006    0.389424
00:00:00.400007    0.389425
00:00:00.400008    0.389426
00:00:00.400009    0.389427
00:00:00.400010    0.389428
dtype: float64

jorisvandenbossche · 2019-02-08T07:42:09Z

I think this is caused by #23539. In that PR, a new code path for floats was added in sequence_to_td64:

pandas/pandas/core/arrays/timedeltas.py

Lines 918 to 926 in 0eddba8

    
           elif is_float_dtype(data.dtype): 
        
               # treat as multiples of the given unit.  If after converting to nanos, 
        
               #  there are fractional components left, these are truncated 
        
               #  (i.e. NOT rounded) 
        
               mask = np.isnan(data) 
        
               coeff = np.timedelta64(1, unit) / np.timedelta64(1, 'ns') 
        
               data = (coeff * data).astype(np.int64).view('timedelta64[ns]') 
        
               data[mask] = iNaT 
        
               copy = False

While before that PR, floats were handled by (the slower) array_to_timedelta64 (which is now exposed in objects_to_td64ns). That one is converting the example data from this issue correctly:

In [12]: data = np.arange(0, 1, 1e-6)  

In [13]: pd.core.arrays.timedeltas.sequence_to_td64ns(data, unit='s')  
Out[13]: 
(array([        0,      1000,      2000, ..., 999996999, 999998000,
        999999000], dtype='timedelta64[ns]'), None)

In [14]: pd.core.arrays.timedeltas.objects_to_td64ns(data, unit='s')
Out[14]: 
array([        0,      1000,      2000, ..., 999997000, 999998000,
       999999000], dtype='timedelta64[ns]')

In the array_to_timedelta64 code, there is some special handling for float precision:

pandas/pandas/_libs/tslibs/timedeltas.pyx

Lines 293 to 299 in 0eddba8

    
           # cast the unit, multiply base/frace separately 
        
           # to avoid precision issues from float -> int 
        
           base = <int64_t>ts 
        
           frac = ts - base 
        
           if p: 
        
               frac = round(frac, p) 
        
           return <int64_t>(base * m) + <int64_t>(frac * m)

cc @jbrockmendel

jbrockmendel · 2019-02-08T17:09:47Z

@Sup3rGeo it looks like @jorisvandenbossche has found the source of the issue. The quoted code in arrays.timedeltas needs to use the quoted code in tslibs.timedeltas instead of numpy's astype. Want to make a PR with a fix?

Sup3rGeo · 2019-02-18T16:05:29Z

@jbrockmendel maybe! Will check if I can sneak this into the weekend.

TomAugspurger · 2019-03-06T15:35:21Z

I think we're planning a 0.24.2 release for sometime next week. @Sup3rGeo do you have time to submit a PR in the next few days?

jorisvandenbossche · 2019-03-06T17:19:49Z

I can also do it tomorrow afternoon, if that might help.

Sup3rGeo changed the title ~~Regression in 0.24: Timedeltaindex frk has missing/repeated values~~ Regression in 0.24: Timedeltaindex from to_timedelta has missing/repeated values Feb 1, 2019

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Timedelta Timedelta data type Regression Functionality that used to work in a prior pandas version labels Feb 7, 2019

gfyoung added this to the 0.24.2 milestone Feb 7, 2019

jreback removed the Regression Functionality that used to work in a prior pandas version label Feb 7, 2019

jreback modified the milestones: 0.24.2, Contributions Welcome Feb 7, 2019

jreback added the Docs label Feb 7, 2019

jorisvandenbossche modified the milestones: Contributions Welcome, 0.24.2 Feb 7, 2019

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Feb 7, 2019

jorisvandenbossche changed the title ~~Regression in 0.24: Timedeltaindex from to_timedelta has missing/repeated values~~ Regression in 0.24: to_timedelta handling of float values Feb 8, 2019

jorisvandenbossche mentioned this issue Mar 4, 2019

RLS: 0.24.x #24949

Closed

jorisvandenbossche mentioned this issue Mar 11, 2019

REGR: to_timedelta precision issues with floating data #25651

Merged

jreback closed this as completed in #25651 Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in 0.24: to_timedelta handling of float values #25077

Regression in 0.24: to_timedelta handling of float values #25077

Sup3rGeo commented Feb 1, 2019 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Feb 1, 2019 via email

Sup3rGeo commented Feb 1, 2019

TomAugspurger commented Feb 1, 2019

gfyoung commented Feb 7, 2019 •

edited

Loading

jorisvandenbossche commented Feb 7, 2019 •

edited

Loading

jreback commented Feb 7, 2019

jorisvandenbossche commented Feb 7, 2019 •

edited

Loading

jreback commented Feb 7, 2019

jorisvandenbossche commented Feb 7, 2019

Sup3rGeo commented Feb 7, 2019

jorisvandenbossche commented Feb 8, 2019 •

edited

Loading

jbrockmendel commented Feb 8, 2019

Sup3rGeo commented Feb 18, 2019

TomAugspurger commented Mar 6, 2019

jorisvandenbossche commented Mar 6, 2019

Regression in 0.24: to_timedelta handling of float values #25077

Regression in 0.24: to_timedelta handling of float values #25077

Comments

Sup3rGeo commented Feb 1, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Feb 1, 2019 via email

Sup3rGeo commented Feb 1, 2019

TomAugspurger commented Feb 1, 2019

gfyoung commented Feb 7, 2019 • edited Loading

jorisvandenbossche commented Feb 7, 2019 • edited Loading

jreback commented Feb 7, 2019

jorisvandenbossche commented Feb 7, 2019 • edited Loading

jreback commented Feb 7, 2019

jorisvandenbossche commented Feb 7, 2019

Sup3rGeo commented Feb 7, 2019

jorisvandenbossche commented Feb 8, 2019 • edited Loading

jbrockmendel commented Feb 8, 2019

Sup3rGeo commented Feb 18, 2019

TomAugspurger commented Mar 6, 2019

jorisvandenbossche commented Mar 6, 2019

Sup3rGeo commented Feb 1, 2019 •

edited

Loading

Output of `pd.show_versions()`

gfyoung commented Feb 7, 2019 •

edited

Loading

jorisvandenbossche commented Feb 7, 2019 •

edited

Loading

jorisvandenbossche commented Feb 7, 2019 •

edited

Loading

jorisvandenbossche commented Feb 8, 2019 •

edited

Loading