You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
IndexingRelated to indexing on series/frames, not to indexes themselvesInternalsRelated to non-user accessible pandas implementationRefactorInternal refactoring of code
where you can see that we first select the slice for each column (all blocks), and then in a second loop iteration select the columns we need. For this specific case, that means that we are also unnecessarily indexing the columns (blocks) that we don't need + we are creating an intermediate DataFrame.
(for this specific case with a single block in the example above, it won't matter much, but in general you can have many blocks of course. Also, for a slice it won't be that important, but eg boolean filtering or integer indexing all columns is more expensive)
Similarly, consider a second example with a setitem operation:
df=pd.DataFrame(np.random.randn(10, 4), columns=["a", "b", "c", "d"])
df["d"] =1# making it a multi-block df to not take the _setitem_single_block pathdf.loc[1:4, ["b", "c"]] =0.0
This specific case is handled in _setitem_with_indexer_split_path with
So also here the operation is done column by column independently. The _setitem_single_column then accesses the column as a Series, updates that series and sets it again in the DataFrame (self.obj._iset_item(loc, ser)).
Also here this can be less efficient in some cases (eg if you assign into multiple columns of the same block), although I think this will be less prevalent as the getitem case above. But I also ran into this in my Copy-on-Write POC (#41878) where the implementation updating each column by updating a Series is problematic (I want to have the final update logic inside the manager, so I can manage the reference tracking / copy on write there, for this specific PR. For that purpose, I introduced a SingleArrayManager.setitem_column() method).
In general, I think it would be a cleaner interface to leave it up to the Manager to see how it handles the multiple dimensions of the indexer (after all validation, pre-processing of the indexers (up to purely positional indexers) and the value in indexing.py, of course). And that way, try to move some of the "split path" or not complexity into the internals.
We could for example have a DataManager.getitem/setitem method with a signature like
for the case where indexing a DataFrame results in a DataFrame (so not reducing the dimension). It's of course a bit a question how eg value would be passed if it was eg a DataFrame initially (and not a scalar like in my simple example).
So this issue is meant for an initial discussion of the topic. I ran into some of those issues while working on the ArrayManager/Copy-on-Write indexing related code, and so tried to put some thoughts here, to see what other are thinking about it.
Indexing is a complex topic, so in practice there are much more cases to consider than the simple examples above (and most will probably only really become apparent while trying to refactor something). But the general idea is to move more of the logic from indexing.py (after pre-processing/validation/conversion to positional indexers) into the internals.
I'm sympathetic to this idea, particularly the part about _setitem_with_indexer_split_path.
We have a long-running goal of simplifying internals and trimming the non-internals places that use it. How this idea affects these goals will depend on the implementation.
I strongly prefer that anything moved into internals be positional-only.
some of the costs discussed in _getitem_tuple_same_dim would be avoided with either CoW or columns-always-views
All else equal, I'd rather recurse into DataFrame indexing methods than go into a Manager method
Helps keep AM vs BM behavior in sync
But easy to imagine perf improved by using Manager methods
I think an updated version of #40380 might be a helpful step.
IndexingRelated to indexing on series/frames, not to indexes themselvesInternalsRelated to non-user accessible pandas implementationRefactorInternal refactoring of code
Currently, when indexing 2 dimensions at once, we typically handle each dimension independently in the
indexing.py
code.For example, consider the following
getitem
operation where we have both a row and column indexer:This specific case is finally handled in
_getitem_tuple_same_dim
withpandas/pandas/core/indexing.py
Lines 786 to 791 in 226876a
where you can see that we first select the slice for each column (all blocks), and then in a second loop iteration select the columns we need. For this specific case, that means that we are also unnecessarily indexing the columns (blocks) that we don't need + we are creating an intermediate DataFrame.
(for this specific case with a single block in the example above, it won't matter much, but in general you can have many blocks of course. Also, for a slice it won't be that important, but eg boolean filtering or integer indexing all columns is more expensive)
Similarly, consider a second example with a
setitem
operation:This specific case is handled in
_setitem_with_indexer_split_path
withpandas/pandas/core/indexing.py
Lines 1722 to 1723 in 226876a
So also here the operation is done column by column independently. The
_setitem_single_column
then accesses the column as a Series, updates that series and sets it again in the DataFrame (self.obj._iset_item(loc, ser)
).Also here this can be less efficient in some cases (eg if you assign into multiple columns of the same block), although I think this will be less prevalent as the getitem case above. But I also ran into this in my Copy-on-Write POC (#41878) where the implementation updating each column by updating a Series is problematic (I want to have the final update logic inside the manager, so I can manage the reference tracking / copy on write there, for this specific PR. For that purpose, I introduced a
SingleArrayManager.setitem_column()
method).In general, I think it would be a cleaner interface to leave it up to the Manager to see how it handles the multiple dimensions of the indexer (after all validation, pre-processing of the indexers (up to purely positional indexers) and the value in indexing.py, of course). And that way, try to move some of the "split path" or not complexity into the internals.
We could for example have a
DataManager.getitem/setitem
method with a signature likefor the case where indexing a DataFrame results in a DataFrame (so not reducing the dimension). It's of course a bit a question how eg
value
would be passed if it was eg a DataFrame initially (and not a scalar like in my simple example).So this issue is meant for an initial discussion of the topic. I ran into some of those issues while working on the ArrayManager/Copy-on-Write indexing related code, and so tried to put some thoughts here, to see what other are thinking about it.
Indexing is a complex topic, so in practice there are much more cases to consider than the simple examples above (and most will probably only really become apparent while trying to refactor something). But the general idea is to move more of the logic from indexing.py (after pre-processing/validation/conversion to positional indexers) into the internals.
cc @jbrockmendel @jreback
The text was updated successfully, but these errors were encountered: