Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.merge error with empty frame and multiple datetime64[ns, UTC] columns #25014

Closed
josham opened this issue Jan 29, 2019 · 8 comments
Closed
Labels
Bug Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Milestone

Comments

@josham
Copy link
Contributor

josham commented Jan 29, 2019

Code Sample, a copy-pastable example if possible

x = pd.DataFrame([
    [pd.Timestamp('2018-01-01', tz='UTC'), 4.0, pd.Timestamp('2019-01-01', tz='UTC')]
], columns=['date', 'value', 'date2'])
y = x[:0]
y.merge(x, on='date')
Traceback (most recent call last):
  File "/scratch.py", line 8, in <module>
    z = y.merge(x, on='date')
  File "/python/lib/python3.6/site-packages/pandas/core/frame.py", line 6877, in merge
    copy=copy, indicator=indicator, validate=validate)
  File "/python/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 48, in merge
    return op.get_result()
  File "/python/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 560, in get_result
    concat_axis=0, copy=self.copy)
  File "/python/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 2061, in concatenate_block_managers
    concatenate_join_units(join_units, concat_axis, copy=copy),
  File "/python/lib/python3.6/site-packages/pandas/core/internals/concat.py", line 240, in concatenate_join_units
    for ju in join_units]
  File "/python/lib/python3.6/site-packages/pandas/core/internals/concat.py", line 240, in <listcomp>
    for ju in join_units]
  File "/python/lib/python3.6/site-packages/pandas/core/internals/concat.py", line 223, in get_reindexed_values
    fill_value=fill_value)
  File "/python/lib/python3.6/site-packages/pandas/core/algorithms.py", line 1579, in take_nd
    return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
  File "/python/lib/python3.6/site-packages/pandas/core/arrays/datetimelike.py", line 589, in take
    fill_value = self._validate_fill_value(fill_value)
  File "/python/lib/python3.6/site-packages/pandas/core/arrays/datetimes.py", line 656, in _validate_fill_value
    "Got '{got}'.".format(got=fill_value))
ValueError: 'fill_value' should be a Timestamp. Got '-9223372036854775808'.

If there is no timezone specified it works as expected:

x = pd.DataFrame([
    [pd.Timestamp('2018-01-01'), 4.0, pd.Timestamp('2019-01-01')]
], columns=['date', 'value', 'date2'])
y = x[:0]
y.merge(x, on='date')
Empty DataFrame
Columns: [value_x, date2_x, date, value_y, date2_y]
Index: []

It also works if there is only one date column:

x = pd.DataFrame([
    [pd.Timestamp('2018-01-01', tz='UTC'), 4.0]
], columns=['date', 'value'])
y = x[:0]
y.merge(x, on='date')
Empty DataFrame
Columns: [value_x, date, value_y]
Index: []

Problem description

It seems like the issue is that iNaT is being passed as the fill_value rather than NaT.

Expected Output

Empty DataFrame
Columns: [value_x, date2_x, date, value_y, date2_y]
Index: []

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: 4.1.1
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.0
bs4: None
html5lib: None
sqlalchemy: 1.2.16
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@mroeschke
Copy link
Member

Thanks for the report. iNaT should be a valid fill_value here as its passed internally? @jbrockmendel

If so, this would be the fix:

--- a/pandas/core/arrays/datetimes.py
+++ b/pandas/core/arrays/datetimes.py
@@ -651,7 +651,7 @@ class DatetimeArray(dtl.DatetimeLikeArrayMixin,
         elif isinstance(fill_value, (datetime, np.datetime64)):
             self._assert_tzawareness_compat(fill_value)
             fill_value = Timestamp(fill_value).value
-        else:
+        elif fill_value != iNaT:
             raise ValueError("'fill_value' should be a Timestamp. "
                              "Got '{got}'.".format(got=fill_value))
         return fill_value

@mroeschke mroeschke added Bug Timezones Timezone data dtype labels Jan 30, 2019
@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jan 30, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.24.1 milestone Jan 30, 2019
@jorisvandenbossche
Copy link
Member

This is a regression, so would be nice to include the fix in 0.24.1. @jbrockmendel the above is the correct fix?

@jbrockmendel
Copy link
Member

Disallowing iNaT was an intentional choice (I'd have to dig out the PR and preceeding discussion). Is passing NaT not an option?

@jorisvandenbossche
Copy link
Member

Is passing NaT not an option?

It's not the user that is doing this, but our algos code. So I think you are the best one placed to answer your question :-)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 30, 2019

Spoke a bit too fast: it is not directly the algos code, but pandas.core.internals.concat.get_empty_dtype_and_na that gives this and it is then passed through to concat -> take_nd

@TomAugspurger
Copy link
Contributor

I'll take a look at what breaks if upcasted_na is NaT for datetime.

@jorisvandenbossche
Copy link
Member

@TomAugspurger you cannot fill an empty M8 array with it, as eg done here:

else:
missing_arr = np.empty(self.shape, dtype=empty_dtype)
missing_arr.fill(fill_value)
return missing_arr

@TomAugspurger
Copy link
Contributor

Yes, so the answer to

I'll take a look at what breaks if upcasted_na is NaT for datetime.

is quite a lot :)

I think changing it just before calling algos.take_nd is best...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

5 participants