-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added DataFrame.compare and Series.compare (GH30429) #30852
ENH: Added DataFrame.compare and Series.compare (GH30429) #30852
Conversation
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will have a look soon.
cc @pandas-dev/pandas-core
…ies_differences # Conflicts: # pandas/core/series.py
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, pls merge master
…ies_differences # Conflicts: # pandas/core/series.py
ping |
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, a few small comments.
@pandas-dev/pandas-core any comments.
Do we have a test where the two objects are not aligned? e.g.
|
@mroeschke I have just added some. I'm expecting it to raise |
|
||
Keep all original rows and columns and also all original values | ||
|
||
>>> df.compare(df2, keep_shape=True, keep_equal=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does keep_equal make sense w/o keep_shape==True? IOW does it stand on its own? can you add an example of just using it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I would say so. It helps identify which one could be the "anomaly". I myself have come across such use cases.
For example, say I'm looking at the date a company publishes 2 types of their quarterly report.
>>> import pandas as pd
>>> df1 = pd.DataFrame(columns=['Q1_filing', 'Q2_filing', 'Q3_filing', 'Q4_filing'], index=list(range(2010, 2020)))
>>> df1['Q1_filing'] = 'Mar'
>>> df1['Q2_filing'] = 'Jun'
>>> df1['Q3_filing'] = 'Sep'
>>> df1['Q4_filing'] = 'Dec'
>>> df1
Q1_filing Q2_filing Q3_filing Q4_filing
2010 Mar Jun Sep Dec
2011 Mar Jun Sep Dec
2012 Mar Jun Sep Dec
2013 Mar Jun Sep Dec
2014 Mar Jun Sep Dec
2015 Mar Jun Sep Dec
2016 Mar Jun Sep Dec
2017 Mar Jun Sep Dec
2018 Mar Jun Sep Dec
2019 Mar Jun Sep Dec
>>> df2 = df1.copy()
>>> df2.loc[2015, 'Q1_filing'] = 'Apr'
>>> df2.loc[2016, 'Q2_filing'] = 'Jul'
By comparing the two, I can see that the discrepancy is in 2015 and 2016, but I don't know which one deviated from the norm.
>>> df1.compare(df2)
Q1_filing Q2_filing
self other self other
2015 Mar Apr NaN NaN
2016 NaN NaN Jun Jul
The natural thing for me to do now is look at 2015 Q2_filing and 2016 Q1_filing where they agree with each other. (You can of course look at the whole thing but sometimes data is too big and I just want to take a look at the relevant ones first)
>>> df1.compare(df2, keep_equal=True)
Q1_filing Q2_filing
self other self other
2015 Mar Apr Jun Jun
2016 Mar Mar Jun Jul
With this result I know probably something is off for the second type of reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added the example in frame.py. keep_equal
does not stand on its own for Series
so I did not add anything there.
question above and can you merge master. |
…ies_differences # Conflicts: # pandas/core/series.py
answered above |
k thanks @fujiaxiang any other comments @pandas-dev/pandas-core will merge in 24 hours otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
thanks @fujiaxiang very nice! |
Added
DataFrame.differences
andSeries.differences
methods.Have not yet added whatsnew entries. Will do so after review of API and behavior design.
A few design considerations open for discussion (among other things):
Index/column names: (self, other) vs (left, right) vs (old, new)
I used the (self, other) pair as I think they naturally match with the parameter names and the concept of comparing
self
withother
. Although (left, right) or (old, new) may be slight more intuitive for users in some use cases, these new functions not necessarily compare two objects that are left and right (may be stacked onaxis=0
, or old and new.Parameter name: axis
I feel axis may not be a very clear name, since we are stacking the output along this axis, rather than comparing objects along this axis. I considered using the name
stack_axis
ororient
likeDataFrame.to_json
does.I made the methods such that two NaNs are considered equal, hence not different. I think this should be the desired behavior since we are trying to compare two objects. However, under current implementation, a
None
will be considered equal tonp.nan
too. This is probably something we need to be careful for.Let me know what you guys think. I'm ready to be scrutinized and criticized!
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff