-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add pd.asof_merge #13358
ENH: add pd.asof_merge #13358
Conversation
@sinhrks @jorisvandenbossche @wesm API question. now we have |
raise MergeError("incompatible merge keys, " | ||
"must be the same type") | ||
|
||
# no duplicates on rhs join keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the function could handle duplicates automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its easy, my only question is this too magical. IOW would a user expect to get the dups? (e.g. it will work right now w/dups)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed to auto-deduplicate right if needed (all internal now).
Current coverage is 84.32%@@ master #13358 diff @@
==========================================
Files 138 138
Lines 50929 51058 +129
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42926 43055 +129
Misses 8003 8003
Partials 0 0
|
b9ff55d
to
5b4f3d8
Compare
I think adding to the name space (with in reason) is better than shoe horning partially overlapping APIs together. Having args/kwargs which are not used / disallowed / have different meaning based on the value of other input should be avoided. |
try: | ||
rhs = right.iloc[rby.indices[key]] | ||
except KeyError: | ||
# key doesn't exist in left |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't the key exist in the left if the failure is from a lookup on the right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing ticker for a quotes (looks like I am bugging on it ATM
In [1]: trades = pd.read_csv('pandas/tools/tests/data/trades.csv')
In [2]: trades.time = pd.to_datetime(trades.time)
In [3]: quotes = pd.read_csv('pandas/tools/tests/data/quotes.csv')
In [4]: quotes.time = pd.to_datetime(quotes.time)
In [5]: pd.asof_merge(trades, quotes, on='time', by='ticker')
Out[5]:
time ticker price quantity marketCenter bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 NASDAQ 51.95 51.95
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NASDAQ 51.95 51.95
2 2016-05-25 13:30:00.048 GOOG 720.77 100 NASDAQ 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 NASDAQ 720.50 720.93
4 2016-05-25 13:30:00.048 GOOG 720.93 200 NASDAQ 720.50 720.93
5 2016-05-25 13:30:00.048 GOOG 720.93 300 NASDAQ 720.50 720.93
6 2016-05-25 13:30:00.048 GOOG 720.93 600 NASDAQ 720.50 720.93
7 2016-05-25 13:30:00.048 GOOG 720.93 44 NASDAQ 720.50 720.93
8 2016-05-25 13:30:00.074 AAPL 98.67 478343 NASDAQ NaN NaN
9 2016-05-25 13:30:00.075 AAPL 98.67 478343 NASDAQ 98.55 98.56
10 2016-05-25 13:30:00.075 AAPL 98.66 6 NASDAQ 98.55 98.56
11 2016-05-25 13:30:00.075 AAPL 98.65 30 NASDAQ 98.55 98.56
12 2016-05-25 13:30:00.075 AAPL 98.65 75 NASDAQ 98.55 98.56
13 2016-05-25 13:30:00.075 AAPL 98.65 20 NASDAQ 98.55 98.56
14 2016-05-25 13:30:00.075 AAPL 98.65 35 NASDAQ 98.55 98.56
15 2016-05-25 13:30:00.075 AAPL 98.65 10 NASDAQ 98.55 98.56
16 2016-05-25 13:30:00.075 AAPL 98.55 6 ARCA 98.55 98.56
17 2016-05-25 13:30:00.075 AAPL 98.55 6 ARCA 98.55 98.56
18 2016-05-25 13:30:00.076 AAPL 98.56 1000 ARCA 98.55 98.56
19 2016-05-25 13:30:00.076 AAPL 98.56 200 ARCA 98.55 98.56
20 2016-05-25 13:30:00.076 AAPL 98.56 300 ARCA 98.55 98.56
21 2016-05-25 13:30:00.076 AAPL 98.56 400 ARCA 98.55 98.56
22 2016-05-25 13:30:00.076 AAPL 98.56 600 ARCA 98.55 98.56
23 2016-05-25 13:30:00.076 AAPL 98.56 200 ARCA 98.55 98.56
24 2016-05-25 13:30:00.078 MSFT 51.95 783 NASDAQ 51.92 51.95
25 2016-05-25 13:30:00.078 MSFT 51.95 100 NASDAQ 51.92 51.95
26 2016-05-25 13:30:00.078 MSFT 51.95 100 NASDAQ 51.92 51.95
In [6]: pd.asof_merge(trades, quotes[quotes.ticker!='MSFT'], on='time', by='ticker')
IndexError: indices are out-of-bounds
In [7]: quotes[quotes.ticker!='MSFT']
Out[7]:
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.50 720.93
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.048 GOOG 720.50 720.93
6 2016-05-25 13:30:00.048 GOOG 720.50 720.93
7 2016-05-25 13:30:00.072 GOOG 720.50 720.88
8 2016-05-25 13:30:00.075 AAPL 98.55 98.56
9 2016-05-25 13:30:00.076 AAPL 98.55 98.56
10 2016-05-25 13:30:00.076 AAPL 98.55 98.56
11 2016-05-25 13:30:00.076 AAPL 98.55 98.56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have NaN in the resulting DataFrame when the right misses a key, just like in a regular left merge().
(This should be part of the regression tests as well, and I totally forgot about it earlier when I came-up with examples. Sorry about that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [5]: pd.asof_merge(trades, quotes[quotes.ticker!='MSFT'], on='time', by='ticker')
Out[5]:
time ticker price quantity marketCenter bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 NASDAQ NaN NaN
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NASDAQ NaN NaN
2 2016-05-25 13:30:00.048 GOOG 720.77 100 NASDAQ 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 NASDAQ 720.50 720.93
4 2016-05-25 13:30:00.048 GOOG 720.93 200 NASDAQ 720.50 720.93
5 2016-05-25 13:30:00.048 GOOG 720.93 300 NASDAQ 720.50 720.93
6 2016-05-25 13:30:00.048 GOOG 720.93 600 NASDAQ 720.50 720.93
7 2016-05-25 13:30:00.048 GOOG 720.93 44 NASDAQ 720.50 720.93
8 2016-05-25 13:30:00.074 AAPL 98.67 478343 NASDAQ NaN NaN
9 2016-05-25 13:30:00.075 AAPL 98.67 478343 NASDAQ 98.55 98.56
10 2016-05-25 13:30:00.075 AAPL 98.66 6 NASDAQ 98.55 98.56
11 2016-05-25 13:30:00.075 AAPL 98.65 30 NASDAQ 98.55 98.56
12 2016-05-25 13:30:00.075 AAPL 98.65 75 NASDAQ 98.55 98.56
13 2016-05-25 13:30:00.075 AAPL 98.65 20 NASDAQ 98.55 98.56
14 2016-05-25 13:30:00.075 AAPL 98.65 35 NASDAQ 98.55 98.56
15 2016-05-25 13:30:00.075 AAPL 98.65 10 NASDAQ 98.55 98.56
16 2016-05-25 13:30:00.075 AAPL 98.55 6 ARCA 98.55 98.56
17 2016-05-25 13:30:00.075 AAPL 98.55 6 ARCA 98.55 98.56
18 2016-05-25 13:30:00.076 AAPL 98.56 1000 ARCA 98.55 98.56
19 2016-05-25 13:30:00.076 AAPL 98.56 200 ARCA 98.55 98.56
20 2016-05-25 13:30:00.076 AAPL 98.56 300 ARCA 98.55 98.56
21 2016-05-25 13:30:00.076 AAPL 98.56 400 ARCA 98.55 98.56
22 2016-05-25 13:30:00.076 AAPL 98.56 600 ARCA 98.55 98.56
23 2016-05-25 13:30:00.076 AAPL 98.56 200 ARCA 98.55 98.56
24 2016-05-25 13:30:00.078 MSFT 51.95 783 NASDAQ NaN NaN
25 2016-05-25 13:30:00.078 MSFT 51.95 100 NASDAQ NaN NaN
26 2016-05-25 13:30:00.078 MSFT 51.95 100 NASDAQ NaN NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4588 reared its ugly head, so had to 'fix' it.
6344d19
to
0cf6339
Compare
tolerance: integer or Timedelta, optional, default None | ||
select asof tolerance within this range, must be compatible | ||
to the merge index. | ||
strict: boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, I found this keyword name counter-intuitive. More verbose, but would something like allow_exact_matches=True
be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"strict" means "strictly less-than", as opposed to "less-than-or-equal-to". I agree that the parameter's explanation could be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about one of these:
strict='less_than_or_equal' | 'less_than'
match='less_than_or_equal' | 'less_than'
match_less_that_or_equal=True | False
allow_exact_matches=True | False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyone have thoughts on strict
-> new name? (from above)
@jreback Just a general comment: for me (as a not financial guy :-)), the current explanation of the method (and also the examples) is not really clarifying on what it exactly does .. (at a first read, without looking at the code, which a user should not have to do). So we will have to do a better job explaining it I think. Some things I was thinking:
Will take a more detailed look myself later. |
f346fa0
to
961ac76
Compare
@jreback I think this needs to be a new API. I think adding |
Ok taking @chrisaycock suggestion, changed Here's the new signature.
|
any other comments? |
What about this for the help statement?
|
updated according to @chrisaycock doc-string |
|
||
Similar to an ``pd.ordered_merge()``, an ``pd.asof_merge()`` allows one to propogate forward | ||
values that are missing. ``pd.asof_merge()`` allows some specific options on this propogation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This propagates forward values that are not missing, right? Also, not sure what the second sentence is saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote basically using the new doc-string.
To @jorisvandenbossche's point, how about adding a smaller context-free example that makes it a little easier to see exactly what's going on. Something like this: (edit: changed a = pd.DataFrame({'a':[1, 5, 10],
'left_val':['a', 'b', 'c']})
b = pd.DataFrame({'a': [1, 2, 3, 6, 7],
'right_val': [1, 2, 3, 6, 7],})
pd.asof_merge(a, b, on='a') It might also be good to show an |
sure maybe just in the actual docs / or in doc-string as well? |
yeah, I guess was just thinking about the docs |
allow_exact_matches : boolean, default True | ||
|
||
- If True, allow matching the same 'on' value | ||
(i.e. less-than-or-equal-to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you align the (i.e.
with If
(otherwise sphinx will complain about unexpected indentation). Same for the lines below (and for check_duplicates
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sure. that makes sense.
01ff187
to
4e37577
Compare
any final issues? @jorisvandenbossche @TomAugspurger |
|
||
A long-time requested feature has been added thru the ``pd.merge_asof()`` function, to | ||
support asof style joining of time-series. (:issue:`1870`). Full documentaion are | ||
:ref:`here <mergeing.merge_asof>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mergeing -> merging
@jreback I added another bunch of doc comments (they are not critical, I can also do a PR later if you want) But I wanted to come back to the duplicate handling for a moment. In case of only
So in case of an exact match, you get both duplicates (consistent with (stressing again that I do not know the financial application of this method, but just look at it based on my understanding of the general concept of an asof join, so I don't know what the 'logical' way is to treat duplicates. I just notice that it is inconsistent with |
ENH: add DataFrame.asof ENH: add pd.asof_merge closes pandas-dev#1870
your example of the duplicates is correct. This is exactly the purpose of an asof merge, NOT to duplicate (pun!) the point of merge. An asof-merge will get you exactly 1 value for each of the left. That's why duplicates are a bit odd, you normally don't have them.
I added a test with your examples in any event (as well as updated the docs). |
@jorisvandenbossche ready to go. |
(sorry for the late answer) Further, I have another question on the duplicate. There is also a difference in handling of duplicates when using
So when not using |
(BTW, it's OK for me to merge this version while I further ask questions, also easier to test when in master and to further clean-up docs) |
Using a slightly other example (combining both cases - duplicate equal and non-equal keys in the right frame - in one dataframe:
So actually (in this case), the |
@jreback OK, I just see you actually changed this ( |
ok merged, will see about docs and do a followup probably |
closes #1870
xref #2941
Here is a notebook of example usage and timings