Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

Merged
merged 64 commits into from
May 28, 2020

Conversation

fujiaxiang
Copy link
Member

@fujiaxiang fujiaxiang commented Jan 9, 2020

Added DataFrame.differences and Series.differences methods.

Have not yet added whatsnew entries. Will do so after review of API and behavior design.

A few design considerations open for discussion (among other things):

  1. Index/column names: (self, other) vs (left, right) vs (old, new)
    I used the (self, other) pair as I think they naturally match with the parameter names and the concept of comparing self with other. Although (left, right) or (old, new) may be slight more intuitive for users in some use cases, these new functions not necessarily compare two objects that are left and right (may be stacked on axis=0, or old and new.

  2. Parameter name: axis
    I feel axis may not be a very clear name, since we are stacking the output along this axis, rather than comparing objects along this axis. I considered using the name stack_axis or orient like DataFrame.to_json does.

  3. I made the methods such that two NaNs are considered equal, hence not different. I think this should be the desired behavior since we are trying to compare two objects. However, under current implementation, a None will be considered equal to np.nan too. This is probably something we need to be careful for.

Let me know what you guys think. I'm ready to be scrutinized and criticized!

pandas/core/frame.py Outdated Show resolved Hide resolved
@fujiaxiang
Copy link
Member Author

ping

pandas/core/frame.py Outdated Show resolved Hide resolved
@jreback jreback added the API - Consistency Internal Consistency of API/Behavior label Jan 18, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will have a look soon.

cc @pandas-dev/pandas-core

@fujiaxiang fujiaxiang closed this Jan 30, 2020
@fujiaxiang fujiaxiang reopened this Jan 30, 2020
@fujiaxiang
Copy link
Member Author

ping

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, pls merge master

doc/source/whatsnew/v1.1.0.rst Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
pandas/core/generic.py Outdated Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Feb 9, 2020

@fujiaxiang
Copy link
Member Author

ping

@fujiaxiang
Copy link
Member Author

ping

@jreback jreback changed the title ENH: Added DataFrame.differences and Series.differences (GH30429) ENH: Added DataFrame.compare and Series.compare (GH30429) May 12, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, a few small comments.

@pandas-dev/pandas-core any comments.

doc/source/whatsnew/v1.1.0.rst Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
pandas/tests/frame/methods/test_compare.py Show resolved Hide resolved
@mroeschke
Copy link
Member

Do we have a test where the two objects are not aligned?

e.g.

df1 = pd.DataFrame(np.ones((3,3)))
df2 = pd.DataFrame(np.zeros((2, 1)))

@jreback jreback added this to the 1.1 milestone May 12, 2020
@fujiaxiang
Copy link
Member Author

Do we have a test where the two objects are not aligned?

e.g.

df1 = pd.DataFrame(np.ones((3,3)))
df2 = pd.DataFrame(np.zeros((2, 1)))

@mroeschke I have just added some. I'm expecting it to raise ValueError using pytes.raises.

@fujiaxiang fujiaxiang requested a review from jreback May 15, 2020 04:28

Keep all original rows and columns and also all original values

>>> df.compare(df2, keep_shape=True, keep_equal=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does keep_equal make sense w/o keep_shape==True? IOW does it stand on its own? can you add an example of just using it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I would say so. It helps identify which one could be the "anomaly". I myself have come across such use cases.

For example, say I'm looking at the date a company publishes 2 types of their quarterly report.

>>> import pandas as pd
>>> df1 = pd.DataFrame(columns=['Q1_filing', 'Q2_filing', 'Q3_filing', 'Q4_filing'], index=list(range(2010, 2020)))
>>> df1['Q1_filing'] = 'Mar'
>>> df1['Q2_filing'] = 'Jun'
>>> df1['Q3_filing'] = 'Sep'
>>> df1['Q4_filing'] = 'Dec'
>>> df1
     Q1_filing Q2_filing Q3_filing Q4_filing
2010       Mar       Jun       Sep       Dec
2011       Mar       Jun       Sep       Dec
2012       Mar       Jun       Sep       Dec
2013       Mar       Jun       Sep       Dec
2014       Mar       Jun       Sep       Dec
2015       Mar       Jun       Sep       Dec
2016       Mar       Jun       Sep       Dec
2017       Mar       Jun       Sep       Dec
2018       Mar       Jun       Sep       Dec
2019       Mar       Jun       Sep       Dec

>>> df2 = df1.copy()
>>> df2.loc[2015, 'Q1_filing'] = 'Apr'
>>> df2.loc[2016, 'Q2_filing'] = 'Jul'

By comparing the two, I can see that the discrepancy is in 2015 and 2016, but I don't know which one deviated from the norm.

>>> df1.compare(df2)
     Q1_filing       Q2_filing
          self other      self other
2015       Mar   Apr       NaN   NaN
2016       NaN   NaN       Jun   Jul

The natural thing for me to do now is look at 2015 Q2_filing and 2016 Q1_filing where they agree with each other. (You can of course look at the whole thing but sometimes data is too big and I just want to take a look at the relevant ones first)

>>> df1.compare(df2, keep_equal=True)
     Q1_filing       Q2_filing
          self other      self other
2015       Mar   Apr       Jun   Jun
2016       Mar   Mar       Jun   Jul

With this result I know probably something is off for the second type of reports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added the example in frame.py. keep_equal does not stand on its own for Series so I did not add anything there.

@jreback
Copy link
Contributor

jreback commented May 25, 2020

question above and can you merge master.

@fujiaxiang fujiaxiang requested a review from jreback May 27, 2020 03:42
@fujiaxiang
Copy link
Member Author

question above and can you merge master.

answered above

@jreback
Copy link
Contributor

jreback commented May 27, 2020

k thanks @fujiaxiang

any other comments @pandas-dev/pandas-core will merge in 24 hours otherwise.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback merged commit c9d183d into pandas-dev:master May 28, 2020
@jreback
Copy link
Contributor

jreback commented May 28, 2020

thanks @fujiaxiang very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggestion: add feature to show in detail changes in 1 df over time
8 participants