Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asof function implementation #1222

Closed
aregm opened this issue Apr 15, 2020 · 6 comments · Fixed by #1989
Closed

asof function implementation #1222

aregm opened this issue Apr 15, 2020 · 6 comments · Fixed by #1989
Labels
pandas.dataframe Related to pandas.dataframe module

Comments

@aregm
Copy link
Collaborator

aregm commented Apr 15, 2020

No description provided.

@aregm aregm added the pandas.dataframe Related to pandas.dataframe module label Apr 15, 2020
@itamarst
Copy link
Contributor

I'm assuming this is the issue for "make it not just convert to Pandas DataFrame."

Here are my thoughts on how this might work, would appreciate feedback:

First, naive "do a map() on each partition" is bad, because it means lots of partitions lower down in the index will do extra work that gets thrown away. Unless there are a lot of NaN, you only have to check the bit of the index based on the where clause, not earlier partitions.

So possibly parallelism isn't really the thing to do, just make it more efficient than converting to pandas DataFrame first.

For a single where item:

  1. Running the Pandas algorithm that does this on the partition that is closest to the where.
  2. If that doesn't return anything, do this on partition matching lower part of the index.
  3. Repeat until you hit beginning of the index.

This requires a iterate-in-reverse-index-values-over-partitions in ... BasePandasFrame? Or something.

For multiple where items, one would similarly start at highest point index, iterate in reverse. An optimization would be skipping partitions once you'd found the higher values. So e.g. if where is [10, 20, 1000], once you'd found result for 1000 you might be able to skip a few partition chunks until you reach the one for 20.

@itamarst
Copy link
Contributor

@devin-petersohn is the above... reasonable? Off the mark? Ideas are welcome.

@devin-petersohn
Copy link
Collaborator

@itamarst Sorry for the delayed response, I was taking a much needed break.

My initial thoughts are that the iterative approach would likely be relatively expensive, because some results much be gathered to the driver program to check if they are valid. As you say, the map approach would yield a large amount of unnecessary rows that will need to be thrown out. Then the question becomes how to do that. I agree with your statement "do a map() on each partition is bad" for the reasons you mention.

asof is done on the index, so would it be possible to do something like this without triggering computation?:

new_idx = compute_asof(self.index, where, subset)  # compute_asof here represents the logic to get the locs from the index
return self.loc[new_idx].set_axis(where, axis=0)

@itamarst
Copy link
Contributor

itamarst commented Aug 24, 2020

The problem is that asof isn't just index.

The last row (for each element in where, if list) without any NaN is taken. In case of a DataFrame, the last row without NaN considering only the subset of columns (if not None).

And there's an example:

>>> s = pd.Series([1, 2, np.nan, 4], index=[10, 20, 30, 40])
>>> s
10    1.0
20    2.0
30    NaN
40    4.0
dtype: float64
>>> s.asof(20)
2.0

So it's looking at the actual values too, you can't work purely with the index.

Assuming by "driver" you mean the process that the user is talking to, I don't think you can avoid sending data back. But perhaps one can send very little...

In particular, assuming asof with a single column:

  1. Find lowest value in index that still matches the where (typically a date). This is index-only, and filters out a bunch of the data already.
  2. Then, iteratively starting with highest-in-index chunk, check just if the chunk is completely NaN. So you're only sending back a bool.
  3. Once you find chunk that isn't completely NaN, you can just run normal asof logic on that chunk.

Sending back bools is not strictly necessary for single column, but is good for multiple columns case, where you'd send back an array of bools, one bool per column.

@devin-petersohn
Copy link
Collaborator

Yes, by driver I meant the process the user is submitting commands to.

What about a dropna, which is embarrassingly parallel, followed by asof on the index? That would simplify things from an implementation perspective and should be fast.

@itamarst
Copy link
Contributor

Yeah, if dropna is fast then yeah that works.

itamarst added a commit to itamarst/modin that referenced this issue Aug 31, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst added a commit to itamarst/modin that referenced this issue Aug 31, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst added a commit to itamarst/modin that referenced this issue Sep 1, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst added a commit to itamarst/modin that referenced this issue Sep 1, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst added a commit to itamarst/modin that referenced this issue Sep 2, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst added a commit to itamarst/modin that referenced this issue Sep 2, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst added a commit to itamarst/modin that referenced this issue Sep 3, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst added a commit to itamarst/modin that referenced this issue Sep 3, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst added a commit to itamarst/modin that referenced this issue Sep 3, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
devin-petersohn pushed a commit that referenced this issue Sep 4, 2020
Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020
…llback (modin-project#1989)

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pandas.dataframe Related to pandas.dataframe module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants