Date type output changes when using df.column.unique()[i] #28625

jwhendy · 2019-09-26T01:35:14Z

Apologies if this is somehow expected. I admit I don't know the intricacies of all date types and what each of these is designed to return, but I found the result surprising.

Code Sample, a copy-pastable example if possible

import pandas as pd
dates = [pd.Timestamp(year=2019, month=m, d=1) for m in range(1, 6)]
df = pd.DataFrame({'date': dates})

>>> df.iloc[2].date
Timestamp('2019-03-01 00:00:00')

>>> df.iloc[2].date.year
2019

>>> df.date.unique()[2]
numpy.datetime64('2019-03-01T00:00:00.000000000')

>>> df.date.unique()[2].year
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.datetime64' object has no attribute 'year'

Problem description

I was subsetting data based on unique dates in a data frame. Part of my code used a string formatter, assuming I could access x.year when I got the AttributeError above.

Expected Output

Intuitively, I would expect that any incantation of obtaining values from a data frame should give me the type that's there, not a new type I don't expect.

Again, not being familiar with the intricacies of the various types and what's going on behind the curtains, this could easily be a false assumption and somehow pd.Timestamp and np.datetime64 are more closely related than I understand and the call to .unique() is expected to cast one to the other?

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.1-arch1-1-ARCH
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.0.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : 2.6.3
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2019-09-26T18:35:21Z

I think the dates = [pd.Timestamp(year=2019, month=m, d=1) for m in range(6)] line in the OP is wrong. Can you update it to something like dates = [pd.Timestamp(year=2019, month=m, day=1) for m in range(1, 6)] please

jbrockmendel · 2019-09-26T18:36:37Z

Yah, it looks like in this case Series.unique is returning a np.ndarray when it should be returning a DatetimeArray. A PR to fix this would be welcome.

jorisvandenbossche · 2019-09-26T18:42:59Z

AFAIK, this is currently the "correct" behaviour (or at least the documented behaviour), although surprising / losing some functionality.

jwhendy · 2019-09-27T02:14:21Z

@jbrockmendel Sorry about the goof, totally my fault for re-typing it vs. copying out of the terminal.

@jorisvandenbossche is right, I believe, and after up series.unique, it turns out this is actually one of the examples!

>>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique()
array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

While documented, from a practical standpoint, this still seems odd. Maybe a more creative mind will think of a use case for the current behavior, but I'm drawing a blank. I think the "correct" behavior for unique() should send back a non-redundant list of the actual things, not almost the things.

I could dig around if this could be reasonably solved by a novice/moderate programmer. I have not yet explored the internals of pandas at this point :)

simonjayhawkins · 2020-04-23T14:05:28Z

I'll close this since I think it is covered by #22824

Series.unique returns array, Series.drop_duplicates returns Series. Returning a plain np.ndarray is quite unusual for a Series method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO

ping me to reopen if I'm missing something.

simonjayhawkins closed this as completed Apr 23, 2020

simonjayhawkins added the Duplicate Report Duplicate issue or pull request label Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Date type output changes when using df.column.unique()[i] #28625

Date type output changes when using df.column.unique()[i] #28625

jwhendy commented Sep 26, 2019 •

edited

Loading

INSTALLED VERSIONS

jbrockmendel commented Sep 26, 2019

jbrockmendel commented Sep 26, 2019

jorisvandenbossche commented Sep 26, 2019

jwhendy commented Sep 27, 2019

simonjayhawkins commented Apr 23, 2020

Date type output changes when using df.column.unique()[i] #28625

Date type output changes when using df.column.unique()[i] #28625

Comments

jwhendy commented Sep 26, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jbrockmendel commented Sep 26, 2019

jbrockmendel commented Sep 26, 2019

jorisvandenbossche commented Sep 26, 2019

jwhendy commented Sep 27, 2019

simonjayhawkins commented Apr 23, 2020

jwhendy commented Sep 26, 2019 •

edited

Loading

Output of `pd.show_versions()`