Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date type output changes when using df.column.unique()[i] #28625

Closed
jwhendy opened this issue Sep 26, 2019 · 5 comments
Closed

Date type output changes when using df.column.unique()[i] #28625

jwhendy opened this issue Sep 26, 2019 · 5 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@jwhendy
Copy link

jwhendy commented Sep 26, 2019

Apologies if this is somehow expected. I admit I don't know the intricacies of all date types and what each of these is designed to return, but I found the result surprising.

Code Sample, a copy-pastable example if possible

import pandas as pd
dates = [pd.Timestamp(year=2019, month=m, d=1) for m in range(1, 6)]
df = pd.DataFrame({'date': dates})

>>> df.iloc[2].date
Timestamp('2019-03-01 00:00:00')

>>> df.iloc[2].date.year
2019

>>> df.date.unique()[2]
numpy.datetime64('2019-03-01T00:00:00.000000000')

>>> df.date.unique()[2].year
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.datetime64' object has no attribute 'year'

Problem description

I was subsetting data based on unique dates in a data frame. Part of my code used a string formatter, assuming I could access x.year when I got the AttributeError above.

Expected Output

Intuitively, I would expect that any incantation of obtaining values from a data frame should give me the type that's there, not a new type I don't expect.

Again, not being familiar with the intricacies of the various types and what's going on behind the curtains, this could easily be a false assumption and somehow pd.Timestamp and np.datetime64 are more closely related than I understand and the call to .unique() is expected to cast one to the other?

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.1-arch1-1-ARCH
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.0.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : 2.6.3
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@jbrockmendel
Copy link
Member

I think the dates = [pd.Timestamp(year=2019, month=m, d=1) for m in range(6)] line in the OP is wrong. Can you update it to something like dates = [pd.Timestamp(year=2019, month=m, day=1) for m in range(1, 6)] please

@jbrockmendel
Copy link
Member

Yah, it looks like in this case Series.unique is returning a np.ndarray when it should be returning a DatetimeArray. A PR to fix this would be welcome.

@jorisvandenbossche
Copy link
Member

AFAIK, this is currently the "correct" behaviour (or at least the documented behaviour), although surprising / losing some functionality.

@jwhendy
Copy link
Author

jwhendy commented Sep 27, 2019

@jbrockmendel Sorry about the goof, totally my fault for re-typing it vs. copying out of the terminal.

@jorisvandenbossche is right, I believe, and after up series.unique, it turns out this is actually one of the examples!

>>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique()
array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

While documented, from a practical standpoint, this still seems odd. Maybe a more creative mind will think of a use case for the current behavior, but I'm drawing a blank. I think the "correct" behavior for unique() should send back a non-redundant list of the actual things, not almost the things.

I could dig around if this could be reasonably solved by a novice/moderate programmer. I have not yet explored the internals of pandas at this point :)

@simonjayhawkins
Copy link
Member

I'll close this since I think it is covered by #22824

Series.unique returns array, Series.drop_duplicates returns Series. Returning a plain np.ndarray is quite unusual for a Series method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO

ping me to reopen if I'm missing something.

@simonjayhawkins simonjayhawkins added the Duplicate Report Duplicate issue or pull request label Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants