Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: unique() casts type pd.Timestamp to numpy.datetime64 #35448

Closed
2 of 3 tasks
SebastianoX opened this issue Jul 29, 2020 · 10 comments
Closed
2 of 3 tasks

BUG: unique() casts type pd.Timestamp to numpy.datetime64 #35448

SebastianoX opened this issue Jul 29, 2020 · 10 comments
Labels
API Design Bug Duplicate Report Duplicate issue or pull request

Comments

@SebastianoX
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample:

import pandas as pd


df = pd.DataFrame({"date": ["2019-02-10", "2019-02-10", "2019-02-11"]})
df["date"] = pd.to_datetime(df["date"])

print("Type before the for cycle:")
print(type(df["date"][0]))  # pandas._libs.tslibs.timestamps.Timestamp

for day in df["date"].unique(): 
    print("Type in the loop:")
    print(type(day))  # here is a numpy.datetime64

which returns:

Type before the for cycle:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Type in the loop:
<class 'numpy.datetime64'>
Type in the loop:
<class 'numpy.datetime64'>

Problem description

The function unique() should not cast the data type.

Expected Output

types of df_target["date"].unique() should be the same as in set(df_target["date"].to_list()). E.g.

import pandas as pd


df = pd.DataFrame({"date": ["2019-02-10", "2019-02-10", "2019-02-11"]})
df["date"] = pd.to_datetime(df["date"])

print("Type before the for cycle:")
print(type(df["date"][0])) 

for day in set(df["date"].to_list()): 
    print("Type in the loop:")
    print(type(day))

Returning:

Type before the for cycle:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Type in the loop:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Type in the loop:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.7.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.5.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.0.5
numpy            : 1.19.0
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.1.1
setuptools       : 47.3.1
Cython           : None
pytest           : 5.4.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.16.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.2.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.4.1
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None
@SebastianoX SebastianoX added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2020
@jreback
Copy link
Contributor

jreback commented Jul 29, 2020

there have been a number of discussions about this - pls look for duplicate issues before opening a new one

@SebastianoX
Copy link
Author

Thanks for your answer @jreback .
Before posting I looked for duplicate issues / stackoverflow questions / google in general and I could not see any.
Please do link the discussions/issues here, so that I and other interested developers can find them.
If it is a duplicate feel free to close it.

@simonjayhawkins
Copy link
Member

I'll close this since I think it is covered by #22824

Series.unique returns array, Series.drop_duplicates returns Series. Returning a plain np.ndarray is quite unusual for a Series method, and furthermore the differences between these closely-related methods are confusing from a user perspective, IMO

@simonjayhawkins simonjayhawkins added API Design Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2020
@simonjayhawkins simonjayhawkins added this to the No action milestone Jul 29, 2020
@jreback
Copy link
Contributor

jreback commented Jul 29, 2020

though having a dedicated issue for this might be ok (as that catch all unique issue brings up many topics)

we cannot change this to return a DatetimeArray till 2.0 in any event (nor can we deprecate anything)

@SebastianoX
Copy link
Author

SebastianoX commented Jul 29, 2020

@simonjayhawkins for what I can understand #22824 is a different issue.

The problem of the current issue, is not that unique() returns an array. The problem is that the objects of type Timestamp in a colum are casted to objects of type np.datetime64 in the numpy array returned when unique() is invoked on this column.

@SebastianoX
Copy link
Author

Let me add a clearer example:

import pandas as pd 
 
 
df = pd.DataFrame({"date": ["2019-02-10", "2019-02-11"]}) 
df["date"] = pd.to_datetime(df["date"]) 
 
print("Date Types in column date:") 
for day in df["date"]: 
    print(type(day))  # this is pandas._libs.tslibs.timestamps.Timestamp 
 
print("Unique date Types in column date:") 
for day in df["date"].unique():  
    print(type(day))  # this is np.datetime64 

The code returns:

Date Types in column date:
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Unique date Types in column date:
<class 'numpy.datetime64'>
<class 'numpy.datetime64'>

@simonjayhawkins
Copy link
Member

@simonjayhawkins for what I can understand #22824 is a different issue.

The problem of the current issue, is not that unique() returns an array. The problem is that the objects of type Timestamp in a colum are casted to objects of type np.datetime64 in the numpy array returned when unique() is invoked on this column.

OK but I don't think that's clear from the OP. Feel free to open a new issue.

@SebastianoX
Copy link
Author

You do not think it is clear as in "I think it is covered by #22824"?
Anyway, new issue is on its way.

@simonjayhawkins
Copy link
Member

The first case is iterating over a Series, the second case is iterating over a numpy array. An MRE doesn't need this comparison. just the output of .unique and the expected output.

as @jreback states in #35448 (comment)

we cannot change this to return a DatetimeArray till 2.0 in any event (nor can we deprecate anything)

@SebastianoX
Copy link
Author

@simonjayhawkins please let me know if now is clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

3 participants