-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REFACTOR-#3519: align mask
interface with its docstring
#3520
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3520 +/- ##
===========================================
- Coverage 85.34% 46.14% -39.20%
===========================================
Files 195 181 -14
Lines 16196 15560 -636
===========================================
- Hits 13822 7180 -6642
- Misses 2374 8380 +6006
Continue to review full report at Codecov.
|
index = [] if index is None else index | ||
columns = [] if columns is None else columns | ||
index = slice(None) if index is None else index | ||
columns = slice(None) if columns is None else columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug in the BaseQueryCompiler
. We have to grab the whole axis if the indexer is None, previously it was taking no elements instead.
modin/engines/base/frame/data.py
Outdated
] | ||
new_index = self.index[ | ||
# Pandas Index is more likely to preserve its metadata if the indexer is slice | ||
slice(row_numeric_idx.start, row_numeric_idx.stop, row_numeric_idx.step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example:
>>> index = pandas.date_range("2017/1/4", periods=10)
>>> repr(index[slice(2, 5)].freq)
'<Day>'
>>> repr(index[range(2, 5)].freq)
'None'
@@ -1190,19 +1194,31 @@ def apply_func_to_indices_both_axis( | |||
""" | |||
partition_copy = partitions.copy() | |||
row_position_counter = 0 | |||
|
|||
def compute_offset(internal_idx, remote_part, axis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_dict_of_block_index
that is used to build arguments for this function may set slices as internal partition's indexers when processing slice-like (and so ranges) objects.
Before this PR, ranges weren't considered as slice-like objects and so this function never met slices as internal indexers, now it does. The following diffs add support of processing slices as internal indexers.
return row_lookup, col_lookup | ||
lookups = [] | ||
for axis, axis_loc in enumerate((row_loc, col_loc)): | ||
if isinstance(axis_loc, slice) and axis_loc.step is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same condition as before but negated in order to make it simpler
modin/engines/base/frame/data.py
Outdated
indexer = pandas.RangeIndex( | ||
indexer.start, indexer.stop, indexer.step | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the modin discuss we agreed on using pure python's range, however, it appeared that modin_frame.mask
apply to indexers operations that are only list/np.array specific (for example this place), pandas.RangeIndex
supports such behavior, when python's range doesn't. So I replaced range
with pandas.RangeIndex
to allow modin_frame.mask
processing range objects in the same way as numpy arrays.
Do we have any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about np.arange
instead? I prefer to avoid depending on pandas here if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.RangeIndex
is much more lightweight than the np.arange
. np.arange
creates an actual array from the passed range when pandas.RangeIndex
just stores the range info (start, stop, step) and can mimic to the regular arrays.
07d9d15
to
f237582
Compare
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@@ -236,9 +237,9 @@ def __getitem__(self, row_lookup, col_lookup, ndim): | |||
|
|||
Parameters | |||
---------- | |||
row_lookup : slice or scalar | |||
row_lookup : slice(None), range or np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we limit to slice(None)
from just any slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slice(None)
is a bit of a hacky value, which is used by indexing flow at the front-end as a IsFullAxisGrab
constant. Conceptually, there should be only two types of indexers: range and np.array, user's key is supposed to be converted to one of these types (e.g. slice to a range, python's list to a numpy array).
However, converting slices into something requires knowledge about the length of the indexing axis, retrieving this info from a lazy frame triggers the whole delayed computation tree.
Execution triggering seems inevitable when we're speaking about partially defined slices, we will need axis length anyway to compute its missing part, but doing the same when the slice is slice(None)
seems completely redundant. That's why _compute_lookup
function just returns the slice(None)
itself if it meets it (other slices are being converted to ranges).
So, we end up with that the valid indexers at the front-end are: range, np.array, and slice(None), and actually all of them are listed in this docstring.
Ideally, we probably want to have some class called Indexer
which would store the indexer as well as its meta-information (is_full_axis_grab
, is_range_like
and so on). This also would be able to return valid-for-backend indexers by request (None
for slice(None), range for slices != slice(None), list of ints for a boolean mask, etc). So the idea is to introduce one entity for indexers instead of juggling with a slice(None), ranges, and numpy arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add something like this to a docstring? That slice(None)
is special and its meaning:
slice(None)
is a bit of a hacky value, which is used by indexing flow at the front-end as aIsFullAxisGrab
constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done at fe2b388
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Co-authored-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@vnlitvinov @YarShev a reminder to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address #3520 (comment), everything else looks good to me.
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
range(len(self.axes[axis])), | ||
[slice(None)] * len(self.axes[axis]), | ||
range(self._partitions.shape[axis]), | ||
[slice(None)] * self._partitions.shape[axis], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expression is supposed to return a dictionary that maps partition's number to its internal indexer. Previously these lines were producing a dictionary of index labels, which is incorrect.
Until this PR these lines weren't reachable, and so the error wasn't visible.
|
||
@pytest.mark.parametrize("has_partitions_shape_cache", [True, False]) | ||
@pytest.mark.parametrize("has_frame_shape_cache", [True, False]) | ||
def test_apply_func_to_both_axis(has_partitions_shape_cache, has_frame_shape_cache): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was added in order to cover cases where we aren't passing partition shapes to the apply_func_to_indices_both_axis
.
row_lengths=None, | ||
col_widths=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added in order to not suffer from the partition's shape cache missing.
Please visit this thread for more info.
@@ -1148,19 +1160,43 @@ def apply_func_to_indices_both_axis( | |||
""" | |||
partition_copy = partitions.copy() | |||
row_position_counter = 0 | |||
|
|||
if row_lengths is None: | |||
row_lengths = [None] * len(row_partitions_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lines are actually covered by test_apply_func_to_both_axis
, they're marked as uncovered because test-internals
CI job does not report coverage data to Codecov. We probably need to fix this in a separate PR. Created an issue (#3628) for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM! Left a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM! Left a few comments.
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@YarShev applied suggestions and CI is green |
Co-authored-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
@YarShev @vnlitvinov all the comments have been addressed and CI has passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@vnlitvinov a reminder to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@@ -446,15 +447,26 @@ def mask( | |||
If both `row_indices` and `row_numeric_idx` are set, `row_indices` will be used. | |||
The same rule applied to `col_indices` and `col_numeric_idx`. | |||
""" | |||
# Check on all possible ranges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev the analogous method in cuDFOnRayDataframe still uses the old code here. any reason not to update there too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that I've just missed cuDFOnRayDataframe
when I did this PR, so basically, there's no reason not to propagate the same logic to the cuDFOnRayDataframe.take_2d_labels_or_positional
Signed-off-by: Dmitry Chigarev dmitry.chigarev@intel.com
What do these changes do?
Implementation of the idea discussed at the modin discuss.
This PR aligns masking implementation via
loc/iloc
with the written docstrings. This PR gets rid of passing slices to the backend as indexers, instead of them, it passespandas.RangeIndex
which is list-like (doc-string requires this) and has all the benefits of slices at the same time.flake8 modin
black --check modin
git commit -s
docs/developer/architecture.rst
is up-to-datePerformance changes
With the following ASV benchmarks added:
extra benchmarks
Results are the following:
Overall, performance seems to stay the same.
Extra benchmarks of
apply_func_to_indices_both_axis
with slice indexers, shows that the performance depending on caches hasn't changed:Ray engine
Dask engine
Reproducer
You can find more info about why it is required to test in this thread.