-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asof function implementation #1222
Comments
I'm assuming this is the issue for "make it not just convert to Pandas DataFrame." Here are my thoughts on how this might work, would appreciate feedback: First, naive "do a map() on each partition" is bad, because it means lots of partitions lower down in the index will do extra work that gets thrown away. Unless there are a lot of NaN, you only have to check the bit of the index based on the So possibly parallelism isn't really the thing to do, just make it more efficient than converting to pandas DataFrame first. For a single
This requires a iterate-in-reverse-index-values-over-partitions in ... For multiple |
@devin-petersohn is the above... reasonable? Off the mark? Ideas are welcome. |
@itamarst Sorry for the delayed response, I was taking a much needed break. My initial thoughts are that the iterative approach would likely be relatively expensive, because some results much be gathered to the driver program to check if they are valid. As you say, the
new_idx = compute_asof(self.index, where, subset) # compute_asof here represents the logic to get the locs from the index
return self.loc[new_idx].set_axis(where, axis=0) |
The problem is that
And there's an example: >>> s = pd.Series([1, 2, np.nan, 4], index=[10, 20, 30, 40])
>>> s
10 1.0
20 2.0
30 NaN
40 4.0
dtype: float64
>>> s.asof(20)
2.0 So it's looking at the actual values too, you can't work purely with the index. Assuming by "driver" you mean the process that the user is talking to, I don't think you can avoid sending data back. But perhaps one can send very little... In particular, assuming
Sending back bools is not strictly necessary for single column, but is good for multiple columns case, where you'd send back an array of bools, one bool per column. |
Yes, by driver I meant the process the user is submitting commands to. What about a |
Yeah, if |
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
…llback (modin-project#1989) Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
No description provided.
The text was updated successfully, but these errors were encountered: