asof function implementation #1222

aregm · 2020-04-15T03:45:04Z

No description provided.

itamarst · 2020-08-11T17:23:36Z

I'm assuming this is the issue for "make it not just convert to Pandas DataFrame."

Here are my thoughts on how this might work, would appreciate feedback:

First, naive "do a map() on each partition" is bad, because it means lots of partitions lower down in the index will do extra work that gets thrown away. Unless there are a lot of NaN, you only have to check the bit of the index based on the where clause, not earlier partitions.

So possibly parallelism isn't really the thing to do, just make it more efficient than converting to pandas DataFrame first.

For a single where item:

Running the Pandas algorithm that does this on the partition that is closest to the where.
If that doesn't return anything, do this on partition matching lower part of the index.
Repeat until you hit beginning of the index.

This requires a iterate-in-reverse-index-values-over-partitions in ... BasePandasFrame? Or something.

For multiple where items, one would similarly start at highest point index, iterate in reverse. An optimization would be skipping partitions once you'd found the higher values. So e.g. if where is [10, 20, 1000], once you'd found result for 1000 you might be able to skip a few partition chunks until you reach the one for 20.

itamarst · 2020-08-11T17:24:04Z

@devin-petersohn is the above... reasonable? Off the mark? Ideas are welcome.

devin-petersohn · 2020-08-24T12:52:44Z

@itamarst Sorry for the delayed response, I was taking a much needed break.

My initial thoughts are that the iterative approach would likely be relatively expensive, because some results much be gathered to the driver program to check if they are valid. As you say, the map approach would yield a large amount of unnecessary rows that will need to be thrown out. Then the question becomes how to do that. I agree with your statement "do a map() on each partition is bad" for the reasons you mention.

asof is done on the index, so would it be possible to do something like this without triggering computation?:

new_idx = compute_asof(self.index, where, subset)  # compute_asof here represents the logic to get the locs from the index
return self.loc[new_idx].set_axis(where, axis=0)

itamarst · 2020-08-24T15:35:19Z

The problem is that asof isn't just index.

The last row (for each element in where, if list) without any NaN is taken. In case of a DataFrame, the last row without NaN considering only the subset of columns (if not None).

And there's an example:

>>> s = pd.Series([1, 2, np.nan, 4], index=[10, 20, 30, 40])
>>> s
10    1.0
20    2.0
30    NaN
40    4.0
dtype: float64
>>> s.asof(20)
2.0

So it's looking at the actual values too, you can't work purely with the index.

Assuming by "driver" you mean the process that the user is talking to, I don't think you can avoid sending data back. But perhaps one can send very little...

In particular, assuming asof with a single column:

Find lowest value in index that still matches the where (typically a date). This is index-only, and filters out a bunch of the data already.
Then, iteratively starting with highest-in-index chunk, check just if the chunk is completely NaN. So you're only sending back a bool.
Once you find chunk that isn't completely NaN, you can just run normal asof logic on that chunk.

Sending back bools is not strictly necessary for single column, but is good for multiple columns case, where you'd send back an array of bools, one bool per column.

devin-petersohn · 2020-08-27T13:09:35Z

Yes, by driver I meant the process the user is submitting commands to.

What about a dropna, which is embarrassingly parallel, followed by asof on the index? That would simplify things from an implementation perspective and should be fast.

itamarst · 2020-08-28T17:29:59Z

Yeah, if dropna is fast then yeah that works.

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

…llback (modin-project#1989) Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

aregm added the pandas.dataframe Related to pandas.dataframe module label Apr 15, 2020

itamarst added a commit to itamarst/modin that referenced this issue Aug 31, 2020

FEAT-modin-project#1222: Better tests for DataFrame.asof().

ab2ca74

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

itamarst added a commit to itamarst/modin that referenced this issue Aug 31, 2020

FEAT-modin-project#1222: Implement DataFrame.asof() using Modin.

ca779db

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

itamarst mentioned this issue Aug 31, 2020

FEAT-#1222: Implement DataFrame.asof() without Pandas fallback #1989

Merged

8 tasks

itamarst added a commit to itamarst/modin that referenced this issue Sep 1, 2020

FEAT-modin-project#1222: Reformat with Black.

fac42a4

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

itamarst added a commit to itamarst/modin that referenced this issue Sep 1, 2020

FEAT-modin-project#1222: No need for transpose() when doing squeeze().

4774748

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

itamarst added a commit to itamarst/modin that referenced this issue Sep 2, 2020

FEAT-modin-project#1222: Merge 'upstream/master' into asof/1222

67115f8

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

itamarst added a commit to itamarst/modin that referenced this issue Sep 2, 2020

FEAT-modin-project#1222: Document that asof is now in Modin.

604398b

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

itamarst added a commit to itamarst/modin that referenced this issue Sep 3, 2020

FEAT-modin-project#1222: Add more tests for Series.asof().

17a847b

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

itamarst added a commit to itamarst/modin that referenced this issue Sep 3, 2020

FEAT-modin-project#1222: Series.asof now uses Modin path as well.

3da35dc

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

itamarst added a commit to itamarst/modin that referenced this issue Sep 3, 2020

FEAT-modin-project#1222: More thorough testing of asof().

dafa362

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

devin-petersohn closed this as completed in #1989 Sep 4, 2020

devin-petersohn pushed a commit that referenced this issue Sep 4, 2020

FEAT-#1222: Implement DataFrame.asof() without Pandas fallback (#1989)

0768600

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020

FEAT-modin-project#1222: Implement DataFrame.asof() without Pandas fa…

07fbd71

…llback (modin-project#1989) Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asof function implementation #1222

asof function implementation #1222

aregm commented Apr 15, 2020

itamarst commented Aug 11, 2020

itamarst commented Aug 11, 2020

devin-petersohn commented Aug 24, 2020

itamarst commented Aug 24, 2020 •

edited

Loading

devin-petersohn commented Aug 27, 2020

itamarst commented Aug 28, 2020

asof function implementation #1222

asof function implementation #1222

Comments

aregm commented Apr 15, 2020

itamarst commented Aug 11, 2020

itamarst commented Aug 11, 2020

devin-petersohn commented Aug 24, 2020

itamarst commented Aug 24, 2020 • edited Loading

devin-petersohn commented Aug 27, 2020

itamarst commented Aug 28, 2020

itamarst commented Aug 24, 2020 •

edited

Loading