ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

fujiaxiang · 2020-01-09T16:14:08Z

Added DataFrame.differences and Series.differences methods.

Have not yet added whatsnew entries. Will do so after review of API and behavior design.

A few design considerations open for discussion (among other things):

Index/column names: (self, other) vs (left, right) vs (old, new)
I used the (self, other) pair as I think they naturally match with the parameter names and the concept of comparing self with other. Although (left, right) or (old, new) may be slight more intuitive for users in some use cases, these new functions not necessarily compare two objects that are left and right (may be stacked on axis=0, or old and new.
Parameter name: axis
I feel axis may not be a very clear name, since we are stacking the output along this axis, rather than comparing objects along this axis. I considered using the name stack_axis or orient like DataFrame.to_json does.
I made the methods such that two NaNs are considered equal, hence not different. I think this should be the desired behavior since we are trying to compare two objects. However, under current implementation, a None will be considered equal to np.nan too. This is probably something we need to be careful for.

Let me know what you guys think. I'm ready to be scrutinized and criticized!

closes Suggestion: add feature to show in detail changes in 1 df over time #30429
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pandas/core/frame.py

fujiaxiang · 2020-01-16T14:58:55Z

ping

pandas/core/frame.py

…ies_differences

…30429)

jreback

will have a look soon.

cc @pandas-dev/pandas-core

…ies_differences # Conflicts: # pandas/core/series.py

…ies_differences

fujiaxiang · 2020-01-31T02:36:46Z

ping

jreback

some comments, pls merge master

doc/source/whatsnew/v1.1.0.rst

pandas/core/frame.py

pandas/core/generic.py

jreback · 2020-02-09T22:23:48Z

cc @TomAugspurger @jorisvandenbossche

…ies_differences # Conflicts: # pandas/core/series.py

…ies_differences

fujiaxiang · 2020-05-01T06:49:08Z

ping

fujiaxiang · 2020-05-12T01:41:27Z

ping

jreback

looks good, a few small comments.

@pandas-dev/pandas-core any comments.

doc/source/whatsnew/v1.1.0.rst

pandas/core/generic.py

pandas/tests/frame/methods/test_compare.py

mroeschke · 2020-05-12T16:40:08Z

Do we have a test where the two objects are not aligned?

e.g.

df1 = pd.DataFrame(np.ones((3,3)))
df2 = pd.DataFrame(np.zeros((2, 1)))

fujiaxiang · 2020-05-15T02:55:30Z

Do we have a test where the two objects are not aligned?

e.g.
df1 = pd.DataFrame(np.ones((3,3)))
df2 = pd.DataFrame(np.zeros((2, 1)))

@mroeschke I have just added some. I'm expecting it to raise ValueError using pytes.raises.

jreback · 2020-05-17T21:38:38Z

pandas/core/frame.py

+
+Keep all original rows and columns and also all original values
+
+>>> df.compare(df2, keep_shape=True, keep_equal=True)


does keep_equal make sense w/o keep_shape==True? IOW does it stand on its own? can you add an example of just using it

Yes I would say so. It helps identify which one could be the "anomaly". I myself have come across such use cases.

For example, say I'm looking at the date a company publishes 2 types of their quarterly report.

>>> import pandas as pd >>> df1 = pd.DataFrame(columns=['Q1_filing', 'Q2_filing', 'Q3_filing', 'Q4_filing'], index=list(range(2010, 2020))) >>> df1['Q1_filing'] = 'Mar' >>> df1['Q2_filing'] = 'Jun' >>> df1['Q3_filing'] = 'Sep' >>> df1['Q4_filing'] = 'Dec' >>> df1 Q1_filing Q2_filing Q3_filing Q4_filing 2010 Mar Jun Sep Dec 2011 Mar Jun Sep Dec 2012 Mar Jun Sep Dec 2013 Mar Jun Sep Dec 2014 Mar Jun Sep Dec 2015 Mar Jun Sep Dec 2016 Mar Jun Sep Dec 2017 Mar Jun Sep Dec 2018 Mar Jun Sep Dec 2019 Mar Jun Sep Dec >>> df2 = df1.copy() >>> df2.loc[2015, 'Q1_filing'] = 'Apr' >>> df2.loc[2016, 'Q2_filing'] = 'Jul'

By comparing the two, I can see that the discrepancy is in 2015 and 2016, but I don't know which one deviated from the norm.

>>> df1.compare(df2) Q1_filing Q2_filing self other self other 2015 Mar Apr NaN NaN 2016 NaN NaN Jun Jul

The natural thing for me to do now is look at 2015 Q2_filing and 2016 Q1_filing where they agree with each other. (You can of course look at the whole thing but sometimes data is too big and I just want to take a look at the relevant ones first)

>>> df1.compare(df2, keep_equal=True) Q1_filing Q2_filing self other self other 2015 Mar Apr Jun Jun 2016 Mar Mar Jun Jul

With this result I know probably something is off for the second type of reports.

Have added the example in frame.py. keep_equal does not stand on its own for Series so I did not add anything there.

jreback · 2020-05-25T22:13:19Z

question above and can you merge master.

…ies_differences # Conflicts: # pandas/core/series.py

fujiaxiang · 2020-05-27T04:38:18Z

question above and can you merge master.

answered above

jreback · 2020-05-27T13:19:50Z

k thanks @fujiaxiang

any other comments @pandas-dev/pandas-core will merge in 24 hours otherwise.

WillAyd

lgtm

jreback · 2020-05-28T17:08:52Z

thanks @fujiaxiang very nice!

fujiaxiang added 2 commits January 9, 2020 23:52

ENH: Added DataFrame.differences and Series.differences (GH30429)

c13af19

CLN: reformatted docstring (GH30429)

8f5d0fb

WillAyd reviewed Jan 9, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

fujiaxiang added 2 commits January 10, 2020 22:16

ENH: Extracted differences() from DataFrame and Series into NDFrame

c5b793a

Merge branch 'master' into dataframe_and_series_differences

5eff415

WillAyd requested changes Jan 16, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

fujiaxiang added 2 commits January 18, 2020 10:35

Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…

0bc8529

…ies_differences

ENH: organized docstring using _shared_doc and reduced duplicates (GH…

d22e21a

…30429)

fujiaxiang requested a review from WillAyd January 18, 2020 04:09

fujiaxiang added 3 commits January 18, 2020 12:12

ENH: added argument type indication (GH30429)

83f31df

ENH: reordered imports (GH30429)

488c8a8

ENH: removed inconsistent type indication (GH30429)

322ff20

jreback added the API - Consistency Internal Consistency of API/Behavior label Jan 18, 2020

jreback requested changes Jan 18, 2020

View reviewed changes

fujiaxiang added 4 commits January 30, 2020 22:54

Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…

71f5eef

…ies_differences # Conflicts: # pandas/core/series.py

ENH: Added whatsnew entry (GH30429)

e50172c

ENH: Minor correction in whatsnew entry (GH30429)

4a82bec

ENH: Minor correction in whatsnew entry (GH30429)

b2849ed

fujiaxiang closed this Jan 30, 2020

fujiaxiang reopened this Jan 30, 2020

fujiaxiang added 2 commits January 31, 2020 08:47

Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…

8e0e441

…ies_differences

ENH: Correction in whatsnew entry (GH30429)

ff7a572

fujiaxiang requested a review from jreback January 31, 2020 02:36

jreback requested changes Feb 9, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/core/generic.py Show resolved Hide resolved

pandas/core/generic.py Show resolved Hide resolved

fujiaxiang added 3 commits February 10, 2020 11:50

ENH: updated whatsnew (GH31200)

bc969e8

ENH: added doc references (GH31200)

26c6ca6

Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…

de2195b

…ies_differences # Conflicts: # pandas/core/series.py

fujiaxiang added 6 commits April 29, 2020 16:32

Merge branch 'master' into dataframe_and_series_differences

c5246d6

resolved a linting issue

91758c8

Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…

7dde706

…ies_differences

updated whatsnew entry

774ff5d

Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…

eb6d33d

…ies_differences

Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…

5d34fc4

…ies_differences

Merge branch 'master' into dataframe_and_series_differences

c358e3d

jreback changed the title ~~ENH: Added DataFrame.differences and Series.differences (GH30429)~~ ENH: Added DataFrame.compare and Series.compare (GH30429) May 12, 2020

jreback reviewed May 12, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Show resolved Hide resolved

pandas/core/generic.py Show resolved Hide resolved

pandas/tests/frame/methods/test_compare.py Show resolved Hide resolved

jreback added this to the 1.1 milestone May 12, 2020

added doc in user guide merging.rst and more tests

0189623

removed trailing space in docstring and blackified code

b0b3e24

fujiaxiang requested a review from jreback May 15, 2020 04:28

jreback requested changes May 17, 2020

View reviewed changes

fujiaxiang added 2 commits May 27, 2020 11:12

Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…

cdb03b2

…ies_differences # Conflicts: # pandas/core/series.py

added one more example in docstring of DataFrame.compare

007eeb7

fujiaxiang requested a review from jreback May 27, 2020 03:42

jreback approved these changes May 27, 2020

View reviewed changes

WillAyd approved these changes May 27, 2020

View reviewed changes

jreback merged commit c9d183d into pandas-dev:master May 28, 2020

simonjayhawkins mentioned this pull request Sep 17, 2020

BUG: Concat typing #36409

Merged

5 tasks

qmarcou mentioned this pull request Aug 2, 2024

ENH: Add option for DataFrame.compare and Series.compare to overlook differences involving missing values #59390

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

fujiaxiang commented Jan 9, 2020 •

edited

Loading

fujiaxiang commented Jan 16, 2020

jreback left a comment

fujiaxiang commented Jan 31, 2020

jreback left a comment

jreback commented Feb 9, 2020

fujiaxiang commented May 1, 2020

fujiaxiang commented May 12, 2020

jreback left a comment

mroeschke commented May 12, 2020

fujiaxiang commented May 15, 2020

jreback May 17, 2020

fujiaxiang May 27, 2020

fujiaxiang May 27, 2020

jreback commented May 25, 2020

fujiaxiang commented May 27, 2020

jreback commented May 27, 2020

WillAyd left a comment

jreback commented May 28, 2020


		Keep all original rows and columns and also all original values

		>>> df.compare(df2, keep_shape=True, keep_equal=True)

ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

Conversation

fujiaxiang commented Jan 9, 2020 • edited Loading

fujiaxiang commented Jan 16, 2020

jreback left a comment

Choose a reason for hiding this comment

fujiaxiang commented Jan 31, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback commented Feb 9, 2020

fujiaxiang commented May 1, 2020

fujiaxiang commented May 12, 2020

jreback left a comment

Choose a reason for hiding this comment

mroeschke commented May 12, 2020

fujiaxiang commented May 15, 2020

jreback May 17, 2020

Choose a reason for hiding this comment

fujiaxiang May 27, 2020

Choose a reason for hiding this comment

fujiaxiang May 27, 2020

Choose a reason for hiding this comment

jreback commented May 25, 2020

fujiaxiang commented May 27, 2020

jreback commented May 27, 2020

WillAyd left a comment

Choose a reason for hiding this comment

jreback commented May 28, 2020

fujiaxiang commented Jan 9, 2020 •

edited

Loading