From 55cff539af8571ec7ab4606c2d7f9faf25084d06 Mon Sep 17 00:00:00 2001 From: Thomas Li <47963215+lithomas1@users.noreply.github.com> Date: Fri, 17 Feb 2023 12:27:51 -0500 Subject: [PATCH 01/13] PDEP 8: Inplace methods in pandas Co-Authored-By: Joris Van den Bossche <1020496+jorisvandenbossche@users.noreply.github.com> Co-Authored-By: Patrick Hoefler <61934744+phofl@users.noreply.github.com> --- .../pdeps/0008-inplace-methods-in-pandas.md | 402 ++++++++++++++++++ 1 file changed, 402 insertions(+) create mode 100644 web/pandas/pdeps/0008-inplace-methods-in-pandas.md diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md new file mode 100644 index 0000000000000..9f8b0ff2d6fe5 --- /dev/null +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -0,0 +1,402 @@ +# PDEP-7: In-place methods in pandas + +- Created: 16 February 2023 +- Status: Under discussion +- Discussion: [PR xxx](https://github.com/pandas-dev/pandas/pull/xxx) +- Authors: [Thomas Li](https://github.com/lithomas1), + [Patrick Hoefler](https://github.com/phofl), + [Joris Van den Bossche](https://github.com/jorisvandenbossche) +- Revision: 1 + +## Abstract + +This PDEP proposes that: + +- The ``inplace`` parameter will be removed from any methods that never can be done inplace +- The ``inplace`` parameter will also be removed from any methods that modify the shape of a pandas object's values or + don't modify the internal values of a pandas object at all. +- In contrast, the ``inplace`` parameter will be kept for any methods that only modify the underlying data of a pandas + object. + - For example, the ``fillna`` method would retain its ``inplace`` keyword, while ``dropna`` (potentially shrinks the + length of a ``DataFrame``/``Series``) and ``rename`` (alters labels not values) would lose their ``inplace`` + keywords + - For those methods, since Copy-on-Write behavior will lazily copy if the result is unchanged, users should reassign + to the same variable to imitate behavior of the ``inplace`` keyword. + e.g. ``df = df.dropna()`` for a DataFrame with no null values. +- The ``copy`` parameter will also be removed, except in constructors and in functions/methods that convert array-likes + to pandas objects (e.g. the ``pandas.array`` function) and functions/methods that export pandas objects to other data + types (e.g. ``DataFrame/Series.to_numpy`` method). +- Open Questions + (These questions are deferred to a later revision, and will not affect the acceptance process of this PDEP.) + - Should ``inplace=True`` return the original pandas object that was operated inplace on? + - What should happen when ``inplace=True`` but the original pandas object cannot be operated inplace on because it + shares its values with another object? + +## Motivation and Scope + +The `inplace=True` keyword has been a controversial topic for many years. It is generally seen (at least by several +pandas maintainers and educators) as bad practice and often unnecessary, but at the same time it is also widely used, +partly because of confusion around the impact of the keyword. + +Generally, we assume that people use the keyword for the following reasons: + +1. Because they think it is more efficient (it is faster and/or can save memory) +2. To save the result to the same variable / update the original variable (avoid the pattern of reassigning to the same + variable name) + +For the first reason: efficiency is an important aspect. However, in practice it is not always the case +that `inplace=True` improves anything. Some of the methods with an `inplace` keyword can actually work inplace, but +others still make a copy under the hood anyway. In addition, with the introduction of Copy-on-Write, there are now other +ways to avoid making unnecessary copies by default (without needing to specify a keyword). The next section gives a +detailed overview of those different cases. + +For the second reason: we are convinced that this is not worth it. While it might save some keystrokes (if you have a +long variable name), this code style also has sufficient disadvantages that we think it is not worth providing "two +ways" to achieve the same result: + +- You can't use method chaining with `inplace=True` +- The ``inplace`` keyword complicates type annotations (because the return value depends on the value of `inplace`) +- Using `inplace=True` gives code that mutates the state of an object and thus has side-effects. That can introduce + subtle bugs and is harder to debug. + +Finally, there are also methods that have a `copy` keyword instead of an `inplace` keyword (which also avoids copying +the data in the case of `copy=False`, but still returns a new object referencing the same data instead of updating the +calling object), adding to the inconsistencies. This `copy=False` option also has become redundant with the introduction +of Copy-on-Write. + +Given the above reasons, we are convinced that there is no need for neither the `inplace` nor the `copy` keyword (except +for a small subset of methods that can actually update data inplace). Removing those keywords will give a more +consistent and less confusing API. + +Thus, in this PDEP, we aim to standardize behavior across methods to make control of inplace-ness of operations +consistent, and compatible with Copy-on-Write. + +Note: there are also operations (not methods) that work inplace in pandas, such as indexing ( +e.g. `df.loc[0, "col"] = val`) or inplace operators (e.g. `df += 1`). This is out of scope for this PDEP, as we focus on +the inplace behaviour of DataFrame and Series _methods_. + +## Detailed description + +### Status Quo + +Many methods in pandas currently have the ability to perform an operation inplace. For example, some methods such +as ``DataFrame.insert``, only support inplace operations, while other methods use keywords such as ``copy`` +or ``inplace`` to control whether an operation is done inplace or not. + +Unfortunately, many methods supporting the ``inplace`` keyword either cannot be done inplace, or make a copy as a +consequence of the operations they perform, regardless of whether ``inplace`` is ``True`` or not. This, coupled with the +fact that the ``inplace=True`` changes the return type of a method from a pandas object to ``None``, makes usage of +the ``inplace`` keyword confusing and non-intuitive. + +In addition, some methods, such as ``DataFrame.rename`` and ``DataFrame.rename_axis`` confusingly support both +the ``copy`` and ``inplace`` keywords, with the value of ``inplace`` overwriting the value of ``copy``. + +To summarize the status quo of inplace behavior of methods, we have divided methods that can operate inplace or have +an ``inplace``/``copy`` keyword into 4 groups: + +**Group 1: Methods that always operate inplace** + +| Method Name | +|:--------------| +| ``insert`` | +| ``pop`` | +| ``update`` | +| ``isetitem``* | + +These methods always operate inplace and don't have the ``inplace`` or ``copy`` keyword. + +\* Although ``isetitem`` operates on the original pandas object inplace, it will not change any existing values +inplace (it will remove the values of the column being set, and insert new values). + +**Group 2: Methods that modify the underlying data of the DataFrame/Series object and can be done inplace** + +| Method Name | +|:----------------| +| ``where`` | +| ``fillna`` | +| ``replace`` | +| ``mask`` | +| ``interpolate`` | +| ``ffill`` | +| ``bfill`` | +| ``clip`` | + +These methods don't operate inplace by default, but have the option to specify `inlace=True`. All those methods leave +the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of +the DataFrame or Series. + +**Group 3: Methods that modify the DataFrame/Series object, but not the pre-existing values** + +| Method Name | +|:----------------------------| +| ``drop`` (dropping columns) | +| ``eval`` | +| ``rename`` | +| ``rename_axis`` | +| ``reset_index`` | +| ``set_index`` | +| ``astype`` | +| ``infer_objects`` | +| ``set_axis`` | +| ``set_flags`` | +| ``to_period`` | +| ``to_timestamp`` | +| ``tz_localize`` | +| ``tz_convert`` | +| ``swaplevel`` | +| ``concat`` | + +These methods can change the structure of the DataFrame or Series, such as changing the shape by adding or removing +columns, or changing the row/column labels (changing the index/columns attributes), but don't modify the existing +underlying data of the object. +All those methods (except for `set_flags`) make a copy of the full data by default, but can be performed inplace with +avoiding copying all data (currently enabled with the `inplace` or `copy` keyword). + +Some of these methods only have a `copy` keyword instead of an `inplace` +keyword: `astype`, `infer_objects`, `set_axis`, `set_flags`, `to_period`, `to_timestamp`, `tz_localize`, `tz_convert`, `swaplevel`, `concat` +and `merge`. +These allow the user to avoid a copy, but don't update the original object inplace and instead return a new object +referencing the same data. + +Two methods also have both keywords: `rename`, `rename_axis`. + +**Group 4: Methods that can never operate inplace** + +| Method Name | +|:-------------------------| +| ``drop`` (dropping rows) | +| ``dropna`` | +| ``drop_duplicates`` | +| ``sort_values`` | +| ``sort_index`` | +| ``query`` | +| ``transpose`` | +| ``swapaxes`` | +| ``align`` | +| ``reindex`` | +| ``reindex_like`` | +| ``truncate`` | + +These methods can never operate inplace because the nature of the operation requires copying (such as reordering or +dropping rows). For those methods, `inplace=True` is essentially just synctactic sugar for reassigning the new result +to `self` (the calling DataFrame). + +Note: in the case of a "no-op" (for example when sorting an already sorted DataFrame), some of those methods might not +need to perform a copy. This currently happens with Copy-on-Write (regardless of ``inplace``), but this is considered an +implementation detail for the purpose of this PDEP. + +### Proposed changes and reasoning + +The methods from group 1 won't change behavior, and will remain always inplace. + +Methods in groups 3 and 4 will lose their ``copy`` and ``inplace`` keywords. Under Copy-on-Write, every operation will +potentially return a shallow copy of the input object, if the performed operation does not require a copy. This is +equivalent to behavior with ``copy=False`` and/or ``inplace=True`` for those methods. If users want to make a hard +copy(``copy=True``), they can do: + + :::python + df = df.func().copy() + +Therefore, there is no benefit of keeping the keywords around for these methods. + +User can emulate behavior of the ``inplace`` keyword by assigning the result of an operation to the same variable: + + :::python + df = pd.DataFrame({"foo": [1, 2, 3]}) + df = df.reset_index() + df.iloc[0, 1] = ... + +All references to the original object will go out of scope when the result of the ``reset_index`` operation is assigned +to ``df``. As a consequence, ``iloc`` will continue to operate inplace, and the underlying data will not be copied. + +The methods in group 2 behave different compared to the first three groups. These methods are actually able to operate +inplace because they only modify the underlying data. + + :::python + df = pd.DataFrame({"foo": [1, 2, 3]}) + df = df.replace(to_replace=1, value=100) + +If we follow the rules of Copy-on-Write[^1] where "any subset or returned series/dataframe always behaves as a copy of +the original, and thus never modifies the original", then there is no way of doing this operation inplace by default. +The original object would be modified before the reference goes out of scope. + +To avoid triggering a copy when a value would actually get replaced, we will keep the ``inplace`` argument for those +methods. + +### Open Questions + +#### With `inplace=True`, should we silently copy or raise an error if the data has references? + +For those methods where we would keep the `inplace=True` option, there is a complication that actually operating inplace +is not always possible. + +For example, + + :::python + df = pd.DataFrame({"foo": [1, 2, 3]}) + df.replace(to_replace=1, value=100, inplace=True) + +can be performed inplace. + +This is only true if ``df`` does not share the values it stores with another pandas object. For example, the following +operations + + :::python + df = pd.DataFrame({"foo": [1, 2, 3]}) + view = df[:] + # We can't operate inplace, because view would also be modified! + df.replace(to_replace=1, value=100, inplace=True) + +would be incompatible with the Copy-on-Write rules when actually done inplace. In this case we can either + +- copy the shared values before performing the operation to avoid modifying another object (i.e. follow the standard + Copy-on-Write procedure), +- raise an error to indicate that more than one object would be changed and the inplace operation is not possible. + +Raising an error here is problematic since oftentimes users do not have control over whether a method would cause a " +lazy copy" to be triggered under Copy-on-Write. It is also hard to fix, adding a `copy()` before calling a method +with ``inplace=True`` might actually be worse than triggering the copy under the hood. We would only copy columns that +share data with another object, not the whole object like ``.copy()`` would. + +There is another possible variant, which would be to trigger the copy (like the first option), but have an option to +raise a warning whenever this happens. +This would be useful in an IPython shell/Jupyter Notebook setting, where the user would have the opportunity to delete +unused references that are causing the copying to be triggered. + +For example, + + :::ipython + In [1]: import pandas as pd + + In [2]: pd.set_option("mode.copy_on_write", True) + + In [3]: ser = pd.Series([1,2,3]) + + In [4]: ser_vals = ser.values # Save values to check inplace-ness + + In [5]: ser + Out[5]: + 0 1 + 1 2 + 2 3 + dtype: int64 + + In [6]: ser = ser[:] # Original series should go out of scope + + In [7]: ser.iloc[0] = -1 # This should be inplace + + In [8]: ser + Out[8]: + 0 -1 + 1 2 + 2 3 + dtype: int64 + + In [9]: ser_vals + Out[9]: array([1, 2, 3]) # It's not modified! + + In [10]: Out[5] # IPython kept our series alive since we displayed it! + +While there are ways to mitigate this[^5], it may be helpful to let the user know that an operation that they performed +was not inplace, since it is possible to go out of memory because of this. + +#### Return the calling object (`self`) also when using `inplace=True`? + +The downsides of keeping the `inplace=True` option for certain methods, are that the return type of those methods will +now depend on the value of `inplace`, and that method chaining will no longer work. + +One way around this is to have the method return the original object that was operated on inplace when ``inplace=True``. + +Advantages: + +- It enables to use inplace operations in a method chain +- It simplifies type annotations +- It enables to change the default for ``inplace`` to True under Copy-on-Write + +Disadvantages: + +- In general, when a pandas method returns an object, this is a _new_ object, and thus following the Copy-on-Write rules + of behaving as a copy. This would introduce a special case where an _identical_ object would be + returned (`df2 = df.method(inplace=True); assert df2 is df`) +- It would change the behaviour of the current `inplace=True` + +Given that ``inplace`` is already widely used by the pandas community, we would like to collect feedback about what the +expected return type should be. Therefore, we will defer a decision on this until a later revision of this PDEP. + +## Backward compatibility + +Removing the `inplace` keyword is a breaking change, but since the affected behaviour is `inplace=True`, the default +behaviour when not specifying the keyword (i.e. `inplace=False`) will not change and the keyword itself can first be +deprecated before it is removed. + +Similarly for the `copy` keyword, this can be deprecated before it is removed. + +There are some behaviour changes (for example the current `copy=False` returning a shallow copy will no longer be an " +actual" shallow copy, but protected under Copy-on-Write), but those behaviour changes are covered by the Copy-on-Write +proposal[^1]. + +## Alternatives + +### Remove the `inplace` keyword altogether + +In the past, it was considered to remove the ``inplace`` keyword entirely. This was because many operations that had +the ``inplace`` keyword did not actually operate inplace, but made a copy and re-assigned the underlying values under +the hood, causing confusion and providing no real benefit to users. + +Because a majority of the methods supporting ``inplace`` did not operate inplace, it was considered at the time to +deprecate and remove inplace from all methods, and add back the keyword as necessary.[^3] + +For the subset of methods where the operation actually _can_ be done inplace (group 2), however, removing the `inplace` +keyword for those as well could give a significant performance regression when currently using this keyword with large +DataFrames. Therefore, we decided to keep the `inplace` keyword for this small subset of methods. + +### Standardize on the `copy` keyword instead of `inplace` + +It may seem more natural to standardize on the `copy` keyword instead of the `inplace` keyword, since the ``copy`` +keyword already returns a new object instead of None (enabling method chaining) when it is set to `True`. + +However, the `copy` keyword is not supported in any of the values-mutating methods listed in Group 2 above +unlike `inplace`, so semantics of future inplace mutation of values align better with the current behavior of +the `inplace` keyword, than with the current behavior of the `copy` keyword. + +Furthermore, with the Copy-on-Write proposal, the `copy` keyword also has become superfluous. With Copy-on-Write +enabled, methods that return a new pandas object will always try to avoid a copy whenever possible, regardless of +a `copy=False` keyword. Thus, the general proposal is to actually remove the `copy` keyword from the methods where it is +currently used. + +Currently, for methods where it is supported, when the `copy` keyword is `False`, a new pandas object (same +as `copy=True`) is returned as the result of a method call, with the values backing the object being shared when +possible. With the proposed inplace behavior, current behavior of ``copy=False`` would return a new pandas object with +identical values as the original object(that was modified inplace), which may be confusing for users, and lead to +ambiguity with Copy on Write rules. + +## History + +The future of the ``inplace`` keyword is something that has been debated a lot over the years. + +It may be helpful to review those discussions (see links) [^2] [^3] [^4] to better understand this PDEP. + +## Timeline + +Copy-on-Write is a relatively new feature (added in version 1.5) and some methods are missing the "lazy copy" +optimization (equivalent to ``copy=False``). + +Therefore, we will start showing deprecation warnings for the ``copy`` and ``inplace`` parameters in pandas 2.1, to +allow for bugs with Copy-on-Write to be addressed and for more optimizations to be added. + +Hopefully, users will be able to switch to Copy-on-Write to keep the no-copy behavior and to silence the warnings. + +The full removal of the ``copy`` parameter and ``inplace`` (where necessary) is set for pandas 3.0, which will coincide +with the enablement of Copy-on-Write for pandas by default. + +## PDEP History + +- 16 February 2023: Initial draft + +## References + +[^1]: [Copy on Write Specification](https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.iexejdstiz8u) +[^2]: +[^3]: +[^4]: +[^5]: From 57390ada100466dac777e5b66d5a4f2a72700c38 Mon Sep 17 00:00:00 2001 From: Thomas Li <47963215+lithomas1@users.noreply.github.com> Date: Fri, 17 Feb 2023 18:04:30 -0500 Subject: [PATCH 02/13] changes --- web/pandas/pdeps/0008-inplace-methods-in-pandas.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 9f8b0ff2d6fe5..426f5682d61e3 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -1,4 +1,4 @@ -# PDEP-7: In-place methods in pandas +# PDEP-8: In-place methods in pandas - Created: 16 February 2023 - Status: Under discussion From 92c6a0a253858405b7a9552eb2bb2cf9fa1fcaf4 Mon Sep 17 00:00:00 2001 From: Thomas Li <47963215+lithomas1@users.noreply.github.com> Date: Thu, 23 Feb 2023 16:51:25 -0500 Subject: [PATCH 03/13] Update --- .../pdeps/0008-inplace-methods-in-pandas.md | 131 +++++++++--------- 1 file changed, 66 insertions(+), 65 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 426f5682d61e3..6b26d8eac3cdc 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -121,93 +121,94 @@ inplace (it will remove the values of the column being set, and insert new value | ``bfill`` | | ``clip`` | -These methods don't operate inplace by default, but have the option to specify `inlace=True`. All those methods leave +These methods don't operate inplace by default, but can be done inplace with `inplace=True`. All those methods leave the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of the DataFrame or Series. **Group 3: Methods that modify the DataFrame/Series object, but not the pre-existing values** -| Method Name | -|:----------------------------| -| ``drop`` (dropping columns) | -| ``eval`` | -| ``rename`` | -| ``rename_axis`` | -| ``reset_index`` | -| ``set_index`` | -| ``astype`` | -| ``infer_objects`` | -| ``set_axis`` | -| ``set_flags`` | -| ``to_period`` | -| ``to_timestamp`` | -| ``tz_localize`` | -| ``tz_convert`` | -| ``swaplevel`` | -| ``concat`` | +| Method Name | Keyword | +|:----------------------------|-----------------------| +| ``drop`` (dropping columns) | ``inplace`` | +| ``rename`` | ``inplace``, ``copy`` | +| ``rename_axis`` | ``inplace``, ``copy`` | +| ``reset_index`` | ``inplace`` | +| ``set_index`` | ``inplace`` | +| ``astype`` | ``copy`` | +| ``infer_objects`` | ``copy`` | +| ``set_axis`` | ``copy`` | +| ``set_flags`` | ``copy`` | +| ``to_period`` | ``copy`` | +| ``to_timestamp`` | ``copy`` | +| ``tz_localize`` | ``copy`` | +| ``tz_convert`` | ``copy`` | +| ``Series.swaplevel``* | ``copy`` | +| ``concat`` | ``copy`` | + +\* The `copy` keyword is only available for `Series.swaplevel` and not for `DataFrame.swaplevel`. These methods can change the structure of the DataFrame or Series, such as changing the shape by adding or removing columns, or changing the row/column labels (changing the index/columns attributes), but don't modify the existing underlying data of the object. + All those methods (except for `set_flags`) make a copy of the full data by default, but can be performed inplace with avoiding copying all data (currently enabled with the `inplace` or `copy` keyword). Some of these methods only have a `copy` keyword instead of an `inplace` -keyword: `astype`, `infer_objects`, `set_axis`, `set_flags`, `to_period`, `to_timestamp`, `tz_localize`, `tz_convert`, `swaplevel`, `concat` -and `merge`. -These allow the user to avoid a copy, but don't update the original object inplace and instead return a new object -referencing the same data. +keyword. These allow the user to avoid a copy, but don't update the original object inplace and instead return a +new object referencing the same data. -Two methods also have both keywords: `rename`, `rename_axis`. +Two methods also have both keywords: `rename`, `rename_axis`, with the `inplace` keyword overriding `copy`. **Group 4: Methods that can never operate inplace** -| Method Name | -|:-------------------------| -| ``drop`` (dropping rows) | -| ``dropna`` | -| ``drop_duplicates`` | -| ``sort_values`` | -| ``sort_index`` | -| ``query`` | -| ``transpose`` | -| ``swapaxes`` | -| ``align`` | -| ``reindex`` | -| ``reindex_like`` | -| ``truncate`` | - -These methods can never operate inplace because the nature of the operation requires copying (such as reordering or -dropping rows). For those methods, `inplace=True` is essentially just synctactic sugar for reassigning the new result -to `self` (the calling DataFrame). +| Method Name | Keyword | +|:-------------------------|-------------| +| `drop` (dropping rows) | `inplace` | +| `dropna` | `inplace` | +| `drop_duplicates` | `inplace` | +| `sort_values` | `inplace` | +| `sort_index` | `inplace` | +| `eval` | `inplace` | +| `query` | `inplace` | +| `transpose` | `copy` | +| `swapaxes` | `copy` | +| `align` | `copy` | +| `reindex` | `copy` | +| `reindex_like` | `copy` | +| `truncate` | `copy` | + +Although all of these methods either `inplace` or `copy`, they can never operate inplace because the nature of the +operation requires copying (such as reordering or dropping rows). For those methods, `inplace=True` is essentially just +syntactic sugar for reassigning the new result to `self` (the calling DataFrame). Note: in the case of a "no-op" (for example when sorting an already sorted DataFrame), some of those methods might not -need to perform a copy. This currently happens with Copy-on-Write (regardless of ``inplace``), but this is considered an +need to perform a copy. This currently happens with Copy-on-Write (regardless of `inplace`), but this is considered an implementation detail for the purpose of this PDEP. ### Proposed changes and reasoning The methods from group 1 won't change behavior, and will remain always inplace. -Methods in groups 3 and 4 will lose their ``copy`` and ``inplace`` keywords. Under Copy-on-Write, every operation will +Methods in groups 3 and 4 will lose their `copy` and `inplace` keywords. Under Copy-on-Write, every operation will potentially return a shallow copy of the input object, if the performed operation does not require a copy. This is -equivalent to behavior with ``copy=False`` and/or ``inplace=True`` for those methods. If users want to make a hard -copy(``copy=True``), they can do: +equivalent to behavior with `copy=False` and/or `inplace=True` for those methods. If users want to make a hard +copy(`copy=True`), they can do: :::python df = df.func().copy() Therefore, there is no benefit of keeping the keywords around for these methods. -User can emulate behavior of the ``inplace`` keyword by assigning the result of an operation to the same variable: +User can emulate behavior of the `inplace` keyword by assigning the result of an operation to the same variable: :::python df = pd.DataFrame({"foo": [1, 2, 3]}) df = df.reset_index() df.iloc[0, 1] = ... -All references to the original object will go out of scope when the result of the ``reset_index`` operation is assigned -to ``df``. As a consequence, ``iloc`` will continue to operate inplace, and the underlying data will not be copied. +All references to the original object will go out of scope when the result of the `reset_index` operation is assigned +to `df`. As a consequence, `iloc` will continue to operate inplace, and the underlying data will not be copied. The methods in group 2 behave different compared to the first three groups. These methods are actually able to operate inplace because they only modify the underlying data. @@ -220,7 +221,7 @@ If we follow the rules of Copy-on-Write[^1] where "any subset or returned series the original, and thus never modifies the original", then there is no way of doing this operation inplace by default. The original object would be modified before the reference goes out of scope. -To avoid triggering a copy when a value would actually get replaced, we will keep the ``inplace`` argument for those +To avoid triggering a copy when a value would actually get replaced, we will keep the `inplace` argument for those methods. ### Open Questions @@ -238,7 +239,7 @@ For example, can be performed inplace. -This is only true if ``df`` does not share the values it stores with another pandas object. For example, the following +This is only true if `df` does not share the values it stores with another pandas object. For example, the following operations :::python @@ -255,8 +256,8 @@ would be incompatible with the Copy-on-Write rules when actually done inplace. I Raising an error here is problematic since oftentimes users do not have control over whether a method would cause a " lazy copy" to be triggered under Copy-on-Write. It is also hard to fix, adding a `copy()` before calling a method -with ``inplace=True`` might actually be worse than triggering the copy under the hood. We would only copy columns that -share data with another object, not the whole object like ``.copy()`` would. +with `inplace=True` might actually be worse than triggering the copy under the hood. We would only copy columns that +share data with another object, not the whole object like `.copy()` would. There is another possible variant, which would be to trigger the copy (like the first option), but have an option to raise a warning whenever this happens. @@ -305,13 +306,13 @@ was not inplace, since it is possible to go out of memory because of this. The downsides of keeping the `inplace=True` option for certain methods, are that the return type of those methods will now depend on the value of `inplace`, and that method chaining will no longer work. -One way around this is to have the method return the original object that was operated on inplace when ``inplace=True``. +One way around this is to have the method return the original object that was operated on inplace when `inplace=True`. Advantages: - It enables to use inplace operations in a method chain - It simplifies type annotations -- It enables to change the default for ``inplace`` to True under Copy-on-Write +- It enables to change the default for `inplace` to True under Copy-on-Write Disadvantages: @@ -320,7 +321,7 @@ Disadvantages: returned (`df2 = df.method(inplace=True); assert df2 is df`) - It would change the behaviour of the current `inplace=True` -Given that ``inplace`` is already widely used by the pandas community, we would like to collect feedback about what the +Given that `inplace` is already widely used by the pandas community, we would like to collect feedback about what the expected return type should be. Therefore, we will defer a decision on this until a later revision of this PDEP. ## Backward compatibility @@ -339,11 +340,11 @@ proposal[^1]. ### Remove the `inplace` keyword altogether -In the past, it was considered to remove the ``inplace`` keyword entirely. This was because many operations that had -the ``inplace`` keyword did not actually operate inplace, but made a copy and re-assigned the underlying values under +In the past, it was considered to remove the `inplace` keyword entirely. This was because many operations that had +the `inplace` keyword did not actually operate inplace, but made a copy and re-assigned the underlying values under the hood, causing confusion and providing no real benefit to users. -Because a majority of the methods supporting ``inplace`` did not operate inplace, it was considered at the time to +Because a majority of the methods supporting `inplace` did not operate inplace, it was considered at the time to deprecate and remove inplace from all methods, and add back the keyword as necessary.[^3] For the subset of methods where the operation actually _can_ be done inplace (group 2), however, removing the `inplace` @@ -352,7 +353,7 @@ DataFrames. Therefore, we decided to keep the `inplace` keyword for this small s ### Standardize on the `copy` keyword instead of `inplace` -It may seem more natural to standardize on the `copy` keyword instead of the `inplace` keyword, since the ``copy`` +It may seem more natural to standardize on the `copy` keyword instead of the `inplace` keyword, since the `copy` keyword already returns a new object instead of None (enabling method chaining) when it is set to `True`. However, the `copy` keyword is not supported in any of the values-mutating methods listed in Group 2 above @@ -366,27 +367,27 @@ currently used. Currently, for methods where it is supported, when the `copy` keyword is `False`, a new pandas object (same as `copy=True`) is returned as the result of a method call, with the values backing the object being shared when -possible. With the proposed inplace behavior, current behavior of ``copy=False`` would return a new pandas object with +possible. With the proposed inplace behavior, current behavior of `copy=False` would return a new pandas object with identical values as the original object(that was modified inplace), which may be confusing for users, and lead to ambiguity with Copy on Write rules. ## History -The future of the ``inplace`` keyword is something that has been debated a lot over the years. +The future of the `inplace` keyword is something that has been debated a lot over the years. It may be helpful to review those discussions (see links) [^2] [^3] [^4] to better understand this PDEP. ## Timeline Copy-on-Write is a relatively new feature (added in version 1.5) and some methods are missing the "lazy copy" -optimization (equivalent to ``copy=False``). +optimization (equivalent to `copy=False`). -Therefore, we will start showing deprecation warnings for the ``copy`` and ``inplace`` parameters in pandas 2.1, to +Therefore, we will start showing deprecation warnings for the `copy` and `inplace` parameters in pandas 2.1, to allow for bugs with Copy-on-Write to be addressed and for more optimizations to be added. Hopefully, users will be able to switch to Copy-on-Write to keep the no-copy behavior and to silence the warnings. -The full removal of the ``copy`` parameter and ``inplace`` (where necessary) is set for pandas 3.0, which will coincide +The full removal of the `copy` parameter and `inplace` (where necessary) is set for pandas 3.0, which will coincide with the enablement of Copy-on-Write for pandas by default. ## PDEP History From 6b0a91b4c8120b18904250e43559c39515e1bc84 Mon Sep 17 00:00:00 2001 From: Thomas Li <47963215+lithomas1@users.noreply.github.com> Date: Sat, 11 Mar 2023 17:00:11 -0500 Subject: [PATCH 04/13] Apply suggestions from code review --- web/pandas/pdeps/0008-inplace-methods-in-pandas.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 6b26d8eac3cdc..342ff4e8a2f1e 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -2,7 +2,7 @@ - Created: 16 February 2023 - Status: Under discussion -- Discussion: [PR xxx](https://github.com/pandas-dev/pandas/pull/xxx) +- Discussion: [PR 51466](https://github.com/pandas-dev/pandas/pull/51466) - Authors: [Thomas Li](https://github.com/lithomas1), [Patrick Hoefler](https://github.com/phofl), [Joris Van den Bossche](https://github.com/jorisvandenbossche) @@ -30,7 +30,7 @@ This PDEP proposes that: (These questions are deferred to a later revision, and will not affect the acceptance process of this PDEP.) - Should ``inplace=True`` return the original pandas object that was operated inplace on? - What should happen when ``inplace=True`` but the original pandas object cannot be operated inplace on because it - shares its values with another object? + shares its values with another pandas object? ## Motivation and Scope @@ -121,7 +121,7 @@ inplace (it will remove the values of the column being set, and insert new value | ``bfill`` | | ``clip`` | -These methods don't operate inplace by default, but can be done inplace with `inplace=True`. All those methods leave +These methods don't operate inplace by default, but can be done inplace with `inplace=True` if the dtypes are compatible (e.g. the values replacing the old values can be stored in the original array without an astype). All those methods leave the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of the DataFrame or Series. From 03ace50ba070fe9e57ac0d43e2a34f13e43f2093 Mon Sep 17 00:00:00 2001 From: Thomas Li <47963215+lithomas1@users.noreply.github.com> Date: Sun, 12 Mar 2023 09:32:47 -0400 Subject: [PATCH 05/13] cleanups + formatting --- .../pdeps/0008-inplace-methods-in-pandas.md | 40 ++++++++----------- 1 file changed, 17 insertions(+), 23 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 342ff4e8a2f1e..ddb5f5df99ded 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -60,9 +60,8 @@ ways" to achieve the same result: subtle bugs and is harder to debug. Finally, there are also methods that have a `copy` keyword instead of an `inplace` keyword (which also avoids copying -the data in the case of `copy=False`, but still returns a new object referencing the same data instead of updating the -calling object), adding to the inconsistencies. This `copy=False` option also has become redundant with the introduction -of Copy-on-Write. +the data when `copy=False`, but returns a new object referencing the same data instead of updating the calling object), +adding to the inconsistencies. This keyword is also redundant now with the introduction of Copy-on-Write. Given the above reasons, we are convinced that there is no need for neither the `inplace` nor the `copy` keyword (except for a small subset of methods that can actually update data inplace). Removing those keywords will give a more @@ -94,7 +93,7 @@ the ``copy`` and ``inplace`` keywords, with the value of ``inplace`` overwriting To summarize the status quo of inplace behavior of methods, we have divided methods that can operate inplace or have an ``inplace``/``copy`` keyword into 4 groups: -**Group 1: Methods that always operate inplace** +**Group 1: Methods that always operate inplace (no user-control with ``inplace``/``copy`` keyword) ** | Method Name | |:--------------| @@ -103,8 +102,6 @@ an ``inplace``/``copy`` keyword into 4 groups: | ``update`` | | ``isetitem``* | -These methods always operate inplace and don't have the ``inplace`` or ``copy`` keyword. - \* Although ``isetitem`` operates on the original pandas object inplace, it will not change any existing values inplace (it will remove the values of the column being set, and insert new values). @@ -121,7 +118,8 @@ inplace (it will remove the values of the column being set, and insert new value | ``bfill`` | | ``clip`` | -These methods don't operate inplace by default, but can be done inplace with `inplace=True` if the dtypes are compatible (e.g. the values replacing the old values can be stored in the original array without an astype). All those methods leave +These methods don't operate inplace by default, but can be done inplace with `inplace=True` if the dtypes are compatible +(e.g. the values replacing the old values can be stored in the original array without an astype). All those methods leave the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of the DataFrame or Series. @@ -162,8 +160,8 @@ Two methods also have both keywords: `rename`, `rename_axis`, with the `inplace` **Group 4: Methods that can never operate inplace** -| Method Name | Keyword | -|:-------------------------|-------------| +| Method Name | Keyword | +|:-----------------------|-----------| | `drop` (dropping rows) | `inplace` | | `dropna` | `inplace` | | `drop_duplicates` | `inplace` | @@ -178,9 +176,9 @@ Two methods also have both keywords: `rename`, `rename_axis`, with the `inplace` | `reindex_like` | `copy` | | `truncate` | `copy` | -Although all of these methods either `inplace` or `copy`, they can never operate inplace because the nature of the +Although these methods the `inplace`/`copy` keywords, they can never operate inplace because the nature of the operation requires copying (such as reordering or dropping rows). For those methods, `inplace=True` is essentially just -syntactic sugar for reassigning the new result to `self` (the calling DataFrame). +syntactic sugar for reassigning the new result to the calling DataFrame/Series. Note: in the case of a "no-op" (for example when sorting an already sorted DataFrame), some of those methods might not need to perform a copy. This currently happens with Copy-on-Write (regardless of `inplace`), but this is considered an @@ -193,14 +191,11 @@ The methods from group 1 won't change behavior, and will remain always inplace. Methods in groups 3 and 4 will lose their `copy` and `inplace` keywords. Under Copy-on-Write, every operation will potentially return a shallow copy of the input object, if the performed operation does not require a copy. This is equivalent to behavior with `copy=False` and/or `inplace=True` for those methods. If users want to make a hard -copy(`copy=True`), they can do: - - :::python - df = df.func().copy() +copy(`copy=True`), they can call the `copy()` method on the result of the operation. Therefore, there is no benefit of keeping the keywords around for these methods. -User can emulate behavior of the `inplace` keyword by assigning the result of an operation to the same variable: +To emulate behavior of the `inplace` keyword, we can reassig the result of an operation to the same variable: :::python df = pd.DataFrame({"foo": [1, 2, 3]}) @@ -210,8 +205,7 @@ User can emulate behavior of the `inplace` keyword by assigning the result of an All references to the original object will go out of scope when the result of the `reset_index` operation is assigned to `df`. As a consequence, `iloc` will continue to operate inplace, and the underlying data will not be copied. -The methods in group 2 behave different compared to the first three groups. These methods are actually able to operate -inplace because they only modify the underlying data. +Group 2 methods differ, though, since they only modify the underlying data, and therefore can be inplace. :::python df = pd.DataFrame({"foo": [1, 2, 3]}) @@ -336,19 +330,19 @@ There are some behaviour changes (for example the current `copy=False` returning actual" shallow copy, but protected under Copy-on-Write), but those behaviour changes are covered by the Copy-on-Write proposal[^1]. -## Alternatives +## Rejected ideas ### Remove the `inplace` keyword altogether -In the past, it was considered to remove the `inplace` keyword entirely. This was because many operations that had +In the past, it was considered to remove the `inplace` keyword entirely. This was because many methods with the `inplace` keyword did not actually operate inplace, but made a copy and re-assigned the underlying values under the hood, causing confusion and providing no real benefit to users. Because a majority of the methods supporting `inplace` did not operate inplace, it was considered at the time to deprecate and remove inplace from all methods, and add back the keyword as necessary.[^3] -For the subset of methods where the operation actually _can_ be done inplace (group 2), however, removing the `inplace` -keyword for those as well could give a significant performance regression when currently using this keyword with large +For methods where the operation actually _can_ be done inplace (group 2), however, removing the `inplace` +keyword could give a significant performance regression when currently using this keyword with large DataFrames. Therefore, we decided to keep the `inplace` keyword for this small subset of methods. ### Standardize on the `copy` keyword instead of `inplace` @@ -382,7 +376,7 @@ It may be helpful to review those discussions (see links) [^2] [^3] [^4] to bett Copy-on-Write is a relatively new feature (added in version 1.5) and some methods are missing the "lazy copy" optimization (equivalent to `copy=False`). -Therefore, we will start showing deprecation warnings for the `copy` and `inplace` parameters in pandas 2.1, to +Therefore, we propose deprecating the `copy` and `inplace` parameters in pandas 2.1, to allow for bugs with Copy-on-Write to be addressed and for more optimizations to be added. Hopefully, users will be able to switch to Copy-on-Write to keep the no-copy behavior and to silence the warnings. From 2110b3483d0ebb60bd0699d6df109914164ccc0d Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 20 Mar 2023 14:46:06 +0100 Subject: [PATCH 06/13] simplify by focusing on inplace keyword (removing explicit listing of copy keyword) --- .../pdeps/0008-inplace-methods-in-pandas.md | 117 +++++++----------- 1 file changed, 45 insertions(+), 72 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index ddb5f5df99ded..33f2a091d5d72 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -65,7 +65,8 @@ adding to the inconsistencies. This keyword is also redundant now with the intro Given the above reasons, we are convinced that there is no need for neither the `inplace` nor the `copy` keyword (except for a small subset of methods that can actually update data inplace). Removing those keywords will give a more -consistent and less confusing API. +consistent and less confusing API. Removing the `copy` keyword is covered by PDEP-7 about Copy-on-Write, +and this PDEP will focus on the `inplace` keyword. Thus, in this PDEP, we aim to standardize behavior across methods to make control of inplace-ness of operations consistent, and compatible with Copy-on-Write. @@ -79,21 +80,18 @@ the inplace behaviour of DataFrame and Series _methods_. ### Status Quo Many methods in pandas currently have the ability to perform an operation inplace. For example, some methods such -as ``DataFrame.insert``, only support inplace operations, while other methods use keywords such as ``copy`` -or ``inplace`` to control whether an operation is done inplace or not. +as ``DataFrame.insert`` only support inplace operations, while other methods use the `inplace` keyword to control +whether an operation is done inplace or not. Unfortunately, many methods supporting the ``inplace`` keyword either cannot be done inplace, or make a copy as a consequence of the operations they perform, regardless of whether ``inplace`` is ``True`` or not. This, coupled with the fact that the ``inplace=True`` changes the return type of a method from a pandas object to ``None``, makes usage of the ``inplace`` keyword confusing and non-intuitive. -In addition, some methods, such as ``DataFrame.rename`` and ``DataFrame.rename_axis`` confusingly support both -the ``copy`` and ``inplace`` keywords, with the value of ``inplace`` overwriting the value of ``copy``. - To summarize the status quo of inplace behavior of methods, we have divided methods that can operate inplace or have -an ``inplace``/``copy`` keyword into 4 groups: +an ``inplace`` keyword into 4 groups: -**Group 1: Methods that always operate inplace (no user-control with ``inplace``/``copy`` keyword) ** +**Group 1: Methods that always operate inplace (no user-control with ``inplace`` keyword)** | Method Name | |:--------------| @@ -125,58 +123,38 @@ the DataFrame or Series. **Group 3: Methods that modify the DataFrame/Series object, but not the pre-existing values** -| Method Name | Keyword | -|:----------------------------|-----------------------| -| ``drop`` (dropping columns) | ``inplace`` | -| ``rename`` | ``inplace``, ``copy`` | -| ``rename_axis`` | ``inplace``, ``copy`` | -| ``reset_index`` | ``inplace`` | -| ``set_index`` | ``inplace`` | -| ``astype`` | ``copy`` | -| ``infer_objects`` | ``copy`` | -| ``set_axis`` | ``copy`` | -| ``set_flags`` | ``copy`` | -| ``to_period`` | ``copy`` | -| ``to_timestamp`` | ``copy`` | -| ``tz_localize`` | ``copy`` | -| ``tz_convert`` | ``copy`` | -| ``Series.swaplevel``* | ``copy`` | -| ``concat`` | ``copy`` | - -\* The `copy` keyword is only available for `Series.swaplevel` and not for `DataFrame.swaplevel`. +| Method Name | +|:----------------------------| +| ``drop`` (dropping columns) | +| ``rename`` | +| ``rename_axis`` | +| ``reset_index`` | +| ``set_index`` | These methods can change the structure of the DataFrame or Series, such as changing the shape by adding or removing columns, or changing the row/column labels (changing the index/columns attributes), but don't modify the existing -underlying data of the object. - -All those methods (except for `set_flags`) make a copy of the full data by default, but can be performed inplace with -avoiding copying all data (currently enabled with the `inplace` or `copy` keyword). +underlying column data of the object. -Some of these methods only have a `copy` keyword instead of an `inplace` -keyword. These allow the user to avoid a copy, but don't update the original object inplace and instead return a -new object referencing the same data. +All those methods make a copy of the full data by default, but can be performed inplace with +avoiding copying all data (currently enabled with specifying `inplace=True`). -Two methods also have both keywords: `rename`, `rename_axis`, with the `inplace` keyword overriding `copy`. +Note: there are also methods that have a `copy` keyword instead of an `inplace` keyword (e.g. `set_axis`). This serves +a similar purpose (avoid copying all data), but those methods don't update the original object inplace and instead +return a new object referencing the same data. **Group 4: Methods that can never operate inplace** -| Method Name | Keyword | -|:-----------------------|-----------| -| `drop` (dropping rows) | `inplace` | -| `dropna` | `inplace` | -| `drop_duplicates` | `inplace` | -| `sort_values` | `inplace` | -| `sort_index` | `inplace` | -| `eval` | `inplace` | -| `query` | `inplace` | -| `transpose` | `copy` | -| `swapaxes` | `copy` | -| `align` | `copy` | -| `reindex` | `copy` | -| `reindex_like` | `copy` | -| `truncate` | `copy` | - -Although these methods the `inplace`/`copy` keywords, they can never operate inplace because the nature of the +| Method Name | +|:-----------------------| +| `drop` (dropping rows) | +| `dropna` | +| `drop_duplicates` | +| `sort_values` | +| `sort_index` | +| `eval` | +| `query` | + +Although these methods have the `inplace` keyword, they can never operate inplace because the nature of the operation requires copying (such as reordering or dropping rows). For those methods, `inplace=True` is essentially just syntactic sugar for reassigning the new result to the calling DataFrame/Series. @@ -188,14 +166,14 @@ implementation detail for the purpose of this PDEP. The methods from group 1 won't change behavior, and will remain always inplace. -Methods in groups 3 and 4 will lose their `copy` and `inplace` keywords. Under Copy-on-Write, every operation will -potentially return a shallow copy of the input object, if the performed operation does not require a copy. This is -equivalent to behavior with `copy=False` and/or `inplace=True` for those methods. If users want to make a hard -copy(`copy=True`), they can call the `copy()` method on the result of the operation. +Methods in groups 3 and 4 will lose their `inplace` keyword. Under Copy-on-Write, every operation will +potentially return a shallow copy of the input object, if the performed operation does not require a copy of the data. This is +equivalent to the behavior with `inplace=True` for those methods. If users want to make a hard +copy, they can call the `copy()` method on the result of the operation. Therefore, there is no benefit of keeping the keywords around for these methods. -To emulate behavior of the `inplace` keyword, we can reassig the result of an operation to the same variable: +To emulate behavior of the `inplace` keyword, we can reassign the result of an operation to the same variable: :::python df = pd.DataFrame({"foo": [1, 2, 3]}) @@ -222,7 +200,7 @@ methods. #### With `inplace=True`, should we silently copy or raise an error if the data has references? -For those methods where we would keep the `inplace=True` option, there is a complication that actually operating inplace +For those methods where we would keep the `inplace=True` option (group 2), there is a complication that actually operating inplace is not always possible. For example, @@ -324,12 +302,6 @@ Removing the `inplace` keyword is a breaking change, but since the affected beha behaviour when not specifying the keyword (i.e. `inplace=False`) will not change and the keyword itself can first be deprecated before it is removed. -Similarly for the `copy` keyword, this can be deprecated before it is removed. - -There are some behaviour changes (for example the current `copy=False` returning a shallow copy will no longer be an " -actual" shallow copy, but protected under Copy-on-Write), but those behaviour changes are covered by the Copy-on-Write -proposal[^1]. - ## Rejected ideas ### Remove the `inplace` keyword altogether @@ -348,7 +320,7 @@ DataFrames. Therefore, we decided to keep the `inplace` keyword for this small s ### Standardize on the `copy` keyword instead of `inplace` It may seem more natural to standardize on the `copy` keyword instead of the `inplace` keyword, since the `copy` -keyword already returns a new object instead of None (enabling method chaining) when it is set to `True`. +keyword already returns a new object instead of None (enabling method chaining) and avoids a coopy when it is set to `False`. However, the `copy` keyword is not supported in any of the values-mutating methods listed in Group 2 above unlike `inplace`, so semantics of future inplace mutation of values align better with the current behavior of @@ -356,13 +328,14 @@ the `inplace` keyword, than with the current behavior of the `copy` keyword. Furthermore, with the Copy-on-Write proposal, the `copy` keyword also has become superfluous. With Copy-on-Write enabled, methods that return a new pandas object will always try to avoid a copy whenever possible, regardless of -a `copy=False` keyword. Thus, the general proposal is to actually remove the `copy` keyword from the methods where it is -currently used. - -Currently, for methods where it is supported, when the `copy` keyword is `False`, a new pandas object (same -as `copy=True`) is returned as the result of a method call, with the values backing the object being shared when -possible. With the proposed inplace behavior, current behavior of `copy=False` would return a new pandas object with -identical values as the original object(that was modified inplace), which may be confusing for users, and lead to +a `copy=False` keyword. Thus, the Copy-on-Write PDEP proposes to actually remove the `copy` keyword from the methods +where it is currently used (so it would be strange to add this as a new keyword to the Group 2 methods). + +Currently, when using `copy=False` in methods where it is supported, a new pandas object is returned as the result +of a method call (same as with `copy=True`), but with the values backing this object being shared with the calling +object when possible (but the calling object is never modified). With the proposed inplace behavior for Group 2 methods, +a potential `copy=False` option would return a new pandas object with identical values as the original object (that +was modified inplace, in contrast to current usage of `copy=False`), which may be confusing for users, and lead to ambiguity with Copy on Write rules. ## History From 2ca875abca4bd4850e9d86c2d6f948ab985cd91b Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 20 Mar 2023 15:08:21 +0100 Subject: [PATCH 07/13] explain values-inplace vs object-inplace --- .../pdeps/0008-inplace-methods-in-pandas.md | 47 +++++++++++++++---- 1 file changed, 37 insertions(+), 10 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 33f2a091d5d72..3db785b5e6ff0 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -83,7 +83,32 @@ Many methods in pandas currently have the ability to perform an operation inplac as ``DataFrame.insert`` only support inplace operations, while other methods use the `inplace` keyword to control whether an operation is done inplace or not. -Unfortunately, many methods supporting the ``inplace`` keyword either cannot be done inplace, or make a copy as a +While we generally speak about "inplace" operations, this term is used in various context. Broadly speaking, +for this PDEP, we can distinguish two kinds of "inplace" operations: + +* **"values-inplace"**: an operation that updates the underlying values of a Series or DataFrame columns inplace + (without making a copy of the array). + + As illustration, an example of such a values-inplace operation without using a method: + + :::python + # if the dtype is compatible, this setitem operation updates the underlying array inplace + df.loc[0, "col"] = val + +* **"object-inplace"**: an operation that updates a pandas DataFrame or Series _object_ inplace, but without + updating existing column values inplace. + + As illustration, an example of such an object-inplace operation without using a method: + + :::python + # we replace the Index on `df` inplace, but without actually updating any existing array + df.index = pd.Index(...) + + Object-inplace operations, while not actually modifying existing column values, keep + (a subset of) those columns and thus can avoid copying the data of those existing columns. + +In addition, several methods supporting the ``inplace`` keyword cannot actually be done inplace (in neither meaning) +because they make a copy as a consequence of the operations they perform, regardless of whether ``inplace`` is ``True`` or not. This, coupled with the fact that the ``inplace=True`` changes the return type of a method from a pandas object to ``None``, makes usage of the ``inplace`` keyword confusing and non-intuitive. @@ -98,12 +123,13 @@ an ``inplace`` keyword into 4 groups: | ``insert`` | | ``pop`` | | ``update`` | -| ``isetitem``* | +| ``isetitem`` | -\* Although ``isetitem`` operates on the original pandas object inplace, it will not change any existing values -inplace (it will remove the values of the column being set, and insert new values). +This group encompasses both kinds of inplace: `update` can be values-inplace, while the others are object-inplace +(for example, although ``isetitem`` operates on the original pandas object inplace, +it will not change any existing values inplace; rather it will remove the values of the column being set, and insert new values). -**Group 2: Methods that modify the underlying data of the DataFrame/Series object and can be done inplace** +**Group 2: Methods that modify the underlying data of the DataFrame/Series object ("values-inplace")** | Method Name | |:----------------| @@ -121,7 +147,7 @@ These methods don't operate inplace by default, but can be done inplace with `in the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of the DataFrame or Series. -**Group 3: Methods that modify the DataFrame/Series object, but not the pre-existing values** +**Group 3: Methods that modify the DataFrame/Series object, but not the pre-existing values ("object-inplace")** | Method Name | |:----------------------------| @@ -135,7 +161,7 @@ These methods can change the structure of the DataFrame or Series, such as chang columns, or changing the row/column labels (changing the index/columns attributes), but don't modify the existing underlying column data of the object. -All those methods make a copy of the full data by default, but can be performed inplace with +All those methods make a copy of the full data by default, but can be performed object-inplace with avoiding copying all data (currently enabled with specifying `inplace=True`). Note: there are also methods that have a `copy` keyword instead of an `inplace` keyword (e.g. `set_axis`). This serves @@ -159,7 +185,8 @@ operation requires copying (such as reordering or dropping rows). For those meth syntactic sugar for reassigning the new result to the calling DataFrame/Series. Note: in the case of a "no-op" (for example when sorting an already sorted DataFrame), some of those methods might not -need to perform a copy. This currently happens with Copy-on-Write (regardless of `inplace`), but this is considered an +need to perform a copy and could be considered as "object-inplace" in that case. +This currently happens with Copy-on-Write (regardless of `inplace`), but this is considered an implementation detail for the purpose of this PDEP. ### Proposed changes and reasoning @@ -171,7 +198,7 @@ potentially return a shallow copy of the input object, if the performed operatio equivalent to the behavior with `inplace=True` for those methods. If users want to make a hard copy, they can call the `copy()` method on the result of the operation. -Therefore, there is no benefit of keeping the keywords around for these methods. +Therefore, there is no benefit of keeping the keyword around for these methods. To emulate behavior of the `inplace` keyword, we can reassign the result of an operation to the same variable: @@ -320,7 +347,7 @@ DataFrames. Therefore, we decided to keep the `inplace` keyword for this small s ### Standardize on the `copy` keyword instead of `inplace` It may seem more natural to standardize on the `copy` keyword instead of the `inplace` keyword, since the `copy` -keyword already returns a new object instead of None (enabling method chaining) and avoids a coopy when it is set to `False`. +keyword already returns a new object instead of None (enabling method chaining) and avoids a copy when it is set to `False`. However, the `copy` keyword is not supported in any of the values-mutating methods listed in Group 2 above unlike `inplace`, so semantics of future inplace mutation of values align better with the current behavior of From 762f4cb5d70c09ba1c0e10695ca57cfc651e950c Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 27 Oct 2023 15:01:29 +0200 Subject: [PATCH 08/13] Some textual edits to Proposed changes section --- .../pdeps/0008-inplace-methods-in-pandas.md | 48 +++++++++++-------- 1 file changed, 29 insertions(+), 19 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 3db785b5e6ff0..9a549f3a2bcb8 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -46,7 +46,7 @@ Generally, we assume that people use the keyword for the following reasons: For the first reason: efficiency is an important aspect. However, in practice it is not always the case that `inplace=True` improves anything. Some of the methods with an `inplace` keyword can actually work inplace, but -others still make a copy under the hood anyway. In addition, with the introduction of Copy-on-Write, there are now other +others still make a copy under the hood anyway. In addition, with the introduction of Copy-on-Write ([PDEP-7](^1)), there are now other ways to avoid making unnecessary copies by default (without needing to specify a keyword). The next section gives a detailed overview of those different cases. @@ -63,12 +63,12 @@ Finally, there are also methods that have a `copy` keyword instead of an `inplac the data when `copy=False`, but returns a new object referencing the same data instead of updating the calling object), adding to the inconsistencies. This keyword is also redundant now with the introduction of Copy-on-Write. -Given the above reasons, we are convinced that there is no need for neither the `inplace` nor the `copy` keyword (except -for a small subset of methods that can actually update data inplace). Removing those keywords will give a more +Given the above reasons, we are convinced that there is no need for neither the `inplace` nor the `copy` keyword, except +for a small subset of methods that can actually update data inplace. Removing those keywords will give a more consistent and less confusing API. Removing the `copy` keyword is covered by PDEP-7 about Copy-on-Write, and this PDEP will focus on the `inplace` keyword. -Thus, in this PDEP, we aim to standardize behavior across methods to make control of inplace-ness of operations +Thus, in this PDEP, we aim to standardize behavior across methods to make control of inplace-ness of methods consistent, and compatible with Copy-on-Write. Note: there are also operations (not methods) that work inplace in pandas, such as indexing ( @@ -101,8 +101,12 @@ for this PDEP, we can distinguish two kinds of "inplace" operations: As illustration, an example of such an object-inplace operation without using a method: :::python - # we replace the Index on `df` inplace, but without actually updating any existing array + # we replace the Index on `df` inplace, but without actually + # updating any existing array df.index = pd.Index(...) + # we update the DataFrame inplace, but by completely replacing a column, + # not by mutating the existing column's underlying array + df["col"] = new_values Object-inplace operations, while not actually modifying existing column values, keep (a subset of) those columns and thus can avoid copying the data of those existing columns. @@ -142,7 +146,7 @@ it will not change any existing values inplace; rather it will remove the values | ``bfill`` | | ``clip`` | -These methods don't operate inplace by default, but can be done inplace with `inplace=True` if the dtypes are compatible +These methods don't operate inplace by default, but can be done inplace with `inplace=True` _if_ the dtypes are compatible (e.g. the values replacing the old values can be stored in the original array without an astype). All those methods leave the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of the DataFrame or Series. @@ -180,7 +184,7 @@ return a new object referencing the same data. | `eval` | | `query` | -Although these methods have the `inplace` keyword, they can never operate inplace because the nature of the +Although these methods have the `inplace` keyword, they can never operate inplace, in neither meaning, because the nature of the operation requires copying (such as reordering or dropping rows). For those methods, `inplace=True` is essentially just syntactic sugar for reassigning the new result to the calling DataFrame/Series. @@ -191,16 +195,21 @@ implementation detail for the purpose of this PDEP. ### Proposed changes and reasoning -The methods from group 1 won't change behavior, and will remain always inplace. +The methods from **group 1 (always inplace, no keyword)** won't change behavior, and will remain always inplace. -Methods in groups 3 and 4 will lose their `inplace` keyword. Under Copy-on-Write, every operation will -potentially return a shallow copy of the input object, if the performed operation does not require a copy of the data. This is -equivalent to the behavior with `inplace=True` for those methods. If users want to make a hard -copy, they can call the `copy()` method on the result of the operation. +For methods from **group 3 (object-inplace)**, the `inplace=True` keyword can currently be +used to avoid a copy. However, with the introduction of Copy-on-Write, every operation +will potentially return a shallow copy of the input object by default (if the performed +operation does not require a copy of the data). This future default is therefore +equivalent to the behavior with `inplace=True` for those methods (minus the return +value), and therefore we propose to remove the `inplace` keyword for this group of +methods. -Therefore, there is no benefit of keeping the keyword around for these methods. +For methods from **group 4 (never inplace)**, the `inplace` keyword has no actual effect +(except for re-assigning to the calling variable) and is effectively syntactic sugar for +manually re-assigning. For this group, we propose to remove the `inplace` keyword. -To emulate behavior of the `inplace` keyword, we can reassign the result of an operation to the same variable: +For the above reasoning, we think there is no benefit of keeping the keyword around for these methods. To emulate behavior of the `inplace` keyword, we can reassign the result of an operation to the same variable: :::python df = pd.DataFrame({"foo": [1, 2, 3]}) @@ -208,9 +217,9 @@ To emulate behavior of the `inplace` keyword, we can reassign the result of an o df.iloc[0, 1] = ... All references to the original object will go out of scope when the result of the `reset_index` operation is assigned -to `df`. As a consequence, `iloc` will continue to operate inplace, and the underlying data will not be copied. +to `df`. As a consequence, `iloc` will continue to operate inplace, and the underlying data will not be copied (with Copy-on-Write). -Group 2 methods differ, though, since they only modify the underlying data, and therefore can be inplace. +**Group 2 (values-inplace)** methods differ, though, since they only modify the underlying data, and therefore can be inplace. :::python df = pd.DataFrame({"foo": [1, 2, 3]}) @@ -220,8 +229,9 @@ If we follow the rules of Copy-on-Write[^1] where "any subset or returned series the original, and thus never modifies the original", then there is no way of doing this operation inplace by default. The original object would be modified before the reference goes out of scope. -To avoid triggering a copy when a value would actually get replaced, we will keep the `inplace` argument for those -methods. +For this case, an `inplace=True` option can have an actual benefit, i.e. allowing to +avoid triggering a copy when a value would get replaced. Therefore, we propose to keep +the `inplace` argument for those methods. ### Open Questions @@ -329,7 +339,7 @@ Removing the `inplace` keyword is a breaking change, but since the affected beha behaviour when not specifying the keyword (i.e. `inplace=False`) will not change and the keyword itself can first be deprecated before it is removed. -## Rejected ideas +## Rejected alternatives ### Remove the `inplace` keyword altogether From eb4f6f83aae84610acb14248699e663fe71362f2 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 27 Oct 2023 15:31:23 +0200 Subject: [PATCH 09/13] clarify example of replace method and relationship with CoW rules --- .../pdeps/0008-inplace-methods-in-pandas.md | 54 +++++++++++++------ 1 file changed, 38 insertions(+), 16 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 9a549f3a2bcb8..f568b205e173e 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -133,7 +133,7 @@ This group encompasses both kinds of inplace: `update` can be values-inplace, wh (for example, although ``isetitem`` operates on the original pandas object inplace, it will not change any existing values inplace; rather it will remove the values of the column being set, and insert new values). -**Group 2: Methods that modify the underlying data of the DataFrame/Series object ("values-inplace")** +**Group 2: Methods that can modify the underlying data of the DataFrame/Series object ("values-inplace")** | Method Name | |:----------------| @@ -151,7 +151,7 @@ These methods don't operate inplace by default, but can be done inplace with `in the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of the DataFrame or Series. -**Group 3: Methods that modify the DataFrame/Series object, but not the pre-existing values ("object-inplace")** +**Group 3: Methods that can modify the DataFrame/Series object, but not the pre-existing values ("object-inplace")** | Method Name | |:----------------------------| @@ -197,41 +197,63 @@ implementation detail for the purpose of this PDEP. The methods from **group 1 (always inplace, no keyword)** won't change behavior, and will remain always inplace. +For methods from **group 4 (never inplace)**, the `inplace` keyword has no actual effect +(except for re-assigning to the calling variable) and is effectively syntactic sugar for +manually re-assigning. For this group, we propose to remove the `inplace` keyword. + For methods from **group 3 (object-inplace)**, the `inplace=True` keyword can currently be used to avoid a copy. However, with the introduction of Copy-on-Write, every operation will potentially return a shallow copy of the input object by default (if the performed operation does not require a copy of the data). This future default is therefore equivalent to the behavior with `inplace=True` for those methods (minus the return -value), and therefore we propose to remove the `inplace` keyword for this group of -methods. - -For methods from **group 4 (never inplace)**, the `inplace` keyword has no actual effect -(except for re-assigning to the calling variable) and is effectively syntactic sugar for -manually re-assigning. For this group, we propose to remove the `inplace` keyword. +value). -For the above reasoning, we think there is no benefit of keeping the keyword around for these methods. To emulate behavior of the `inplace` keyword, we can reassign the result of an operation to the same variable: +For the above reasoning, we think there is no benefit of keeping the keyword around for +these methods. To emulate behavior of the `inplace` keyword, we can reassign the result +of an operation to the same variable: :::python df = pd.DataFrame({"foo": [1, 2, 3]}) df = df.reset_index() df.iloc[0, 1] = ... -All references to the original object will go out of scope when the result of the `reset_index` operation is assigned +All references to the original object will go out of scope when the result of the `reset_index` operation is re-assigned to `df`. As a consequence, `iloc` will continue to operate inplace, and the underlying data will not be copied (with Copy-on-Write). -**Group 2 (values-inplace)** methods differ, though, since they only modify the underlying data, and therefore can be inplace. +**Group 2 (values-inplace)** methods differ, though, since they modify the underlying +data, and therefore can be actually happen inplace: + + :::python + df = pd.DataFrame({"foo": [1, 2, 3]}) + df.replace(to_replace=1, value=100, inplace=True) + +Currently, the above updates `df` values-inplace, without requiring a copy of the data. +For this type of method, however, we can _not_ emulate the above usage of `inplace` by +re-assigning: :::python df = pd.DataFrame({"foo": [1, 2, 3]}) df = df.replace(to_replace=1, value=100) -If we follow the rules of Copy-on-Write[^1] where "any subset or returned series/dataframe always behaves as a copy of -the original, and thus never modifies the original", then there is no way of doing this operation inplace by default. -The original object would be modified before the reference goes out of scope. +If we follow the rules of Copy-on-Write[^1] where "any subset or returned +series/dataframe always behaves as a copy of the original, and thus never modifies the +original", then there is no way of doing this operation inplace by default, because the +original object `df` would be modified before the reference goes out of scope (pandas +does not know whether you will re-assign it to `df` or assign it to another variable). +That would violate the Copy-on-Write rules, and therefore the `replace()` method in the +example always needs to make a copy of the underlying data by default For this case, an `inplace=True` option can have an actual benefit, i.e. allowing to -avoid triggering a copy when a value would get replaced. Therefore, we propose to keep -the `inplace` argument for those methods. +avoid a data copy. Therefore, we propose to keep the `inplace` argument for this +group of methods. + +Summarizing for the `inplace` keyword, we propose to: + +- Keep the `inplace` keyword for this subset of methods (group 2) that can update the + underlying values inplace ("values-inplace") +- Remove the `inplace` keyword from all other methods that either can never work inplace + (group 4) or only update the object (group 3, "object-inplace", which can be emulated + with reassigning). ### Open Questions From 1a4605d23ee6ced84200e117d99afd2d2c0ee45e Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 27 Oct 2023 16:55:26 +0200 Subject: [PATCH 10/13] Update PDEP8 text around open questions --- .../pdeps/0008-inplace-methods-in-pandas.md | 26 +++++++++++-------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index f568b205e173e..1dca5e9611429 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -255,7 +255,7 @@ Summarizing for the `inplace` keyword, we propose to: (group 4) or only update the object (group 3, "object-inplace", which can be emulated with reassigning). -### Open Questions +### Other design questions #### With `inplace=True`, should we silently copy or raise an error if the data has references? @@ -290,12 +290,13 @@ lazy copy" to be triggered under Copy-on-Write. It is also hard to fix, adding a with `inplace=True` might actually be worse than triggering the copy under the hood. We would only copy columns that share data with another object, not the whole object like `.copy()` would. -There is another possible variant, which would be to trigger the copy (like the first option), but have an option to -raise a warning whenever this happens. +**Therefore, we propose to silently copy when needed.** The `inplace=True` option would thus mean "try inplace whenever possible", and not guarantee it is actually done inplace. + +In the future, if there is demand for it, it could still be possible to add to option to raise a warning whenever this happens. This would be useful in an IPython shell/Jupyter Notebook setting, where the user would have the opportunity to delete unused references that are causing the copying to be triggered. -For example, +For example: :::ipython In [1]: import pandas as pd @@ -334,16 +335,16 @@ was not inplace, since it is possible to go out of memory because of this. #### Return the calling object (`self`) also when using `inplace=True`? -The downsides of keeping the `inplace=True` option for certain methods, are that the return type of those methods will -now depend on the value of `inplace`, and that method chaining will no longer work. - -One way around this is to have the method return the original object that was operated on inplace when `inplace=True`. +One of the downsides of the `inplace=True` option is that the return type of those methods +depends on the value of `inplace`, and that method chaining does not work. +Those downsides are still relevant for the cases where we keep `inplace=True`. +To address this, we can have those methods return the object that was operated on +inplace when `inplace=True`. Advantages: - It enables to use inplace operations in a method chain - It simplifies type annotations -- It enables to change the default for `inplace` to True under Copy-on-Write Disadvantages: @@ -352,8 +353,11 @@ Disadvantages: returned (`df2 = df.method(inplace=True); assert df2 is df`) - It would change the behaviour of the current `inplace=True` -Given that `inplace` is already widely used by the pandas community, we would like to collect feedback about what the -expected return type should be. Therefore, we will defer a decision on this until a later revision of this PDEP. +We generally assume that changing to return `self` should not give much problems for +existing usage (typically, the current return value of `None` is not actively used). +Further, we think the advantages of simplifing return types and enabling methods chains +outweighs the special case of returning an identical object. +**Therefore, we propose that for those methods with an `inplace=True` option, the calling object (`self`) gets returned.** ## Backward compatibility From 733e06af309c924bcf5228574dcfa50f070b9e88 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 Nov 2023 13:57:04 +0100 Subject: [PATCH 11/13] update timeline for deprecation/removal --- .../pdeps/0008-inplace-methods-in-pandas.md | 27 ++++++++++++------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 1dca5e9611429..b7bae211ce7e7 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -409,16 +409,23 @@ It may be helpful to review those discussions (see links) [^2] [^3] [^4] to bett ## Timeline -Copy-on-Write is a relatively new feature (added in version 1.5) and some methods are missing the "lazy copy" -optimization (equivalent to `copy=False`). - -Therefore, we propose deprecating the `copy` and `inplace` parameters in pandas 2.1, to -allow for bugs with Copy-on-Write to be addressed and for more optimizations to be added. - -Hopefully, users will be able to switch to Copy-on-Write to keep the no-copy behavior and to silence the warnings. - -The full removal of the `copy` parameter and `inplace` (where necessary) is set for pandas 3.0, which will coincide -with the enablement of Copy-on-Write for pandas by default. +The `inplace` keyword is widely used, and thus we need to take considerable time to +deprecate and remove this feature. + +- For those methods where the `inplace` keyword will be removed, we add a + DeprecationWarning in the first release after acceptance (2.2 if possible, otherwise + 3.0) +- Together with enabling Copy-on-Write in the pandas 3.0 major release, we already + update those methods that will keep the `inplace` keyword with the new behaviour + (returning `self`, working inplace when possible) +- Somewhere during the 3.x release cycle (e.g. in 3.1, depending on when the deprecation + was started), we change the DeprecationWarning to a more visible FutureWarning. +- The deprecated keyword is removed in pandas 4.0. + +When introducing the warning in 2.2 (or 3.0), users will already have the ability to +enable Copy-on-Write so they can rewrite their code in a way that avoids the deprecation +warning (remove the usage of `inplace`) while keeping the no-copy behaviour (which will +be the default with Copy-on-Write). ## PDEP History From 04ad61e3ead7d2b18ee3284c15f2153386079505 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 13 Nov 2023 14:20:01 +0100 Subject: [PATCH 12/13] update abstract --- .../pdeps/0008-inplace-methods-in-pandas.md | 33 ++++++++----------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index b7bae211ce7e7..98a404311fb7c 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -12,25 +12,18 @@ This PDEP proposes that: -- The ``inplace`` parameter will be removed from any methods that never can be done inplace -- The ``inplace`` parameter will also be removed from any methods that modify the shape of a pandas object's values or - don't modify the internal values of a pandas object at all. -- In contrast, the ``inplace`` parameter will be kept for any methods that only modify the underlying data of a pandas - object. - - For example, the ``fillna`` method would retain its ``inplace`` keyword, while ``dropna`` (potentially shrinks the - length of a ``DataFrame``/``Series``) and ``rename`` (alters labels not values) would lose their ``inplace`` - keywords - - For those methods, since Copy-on-Write behavior will lazily copy if the result is unchanged, users should reassign - to the same variable to imitate behavior of the ``inplace`` keyword. - e.g. ``df = df.dropna()`` for a DataFrame with no null values. -- The ``copy`` parameter will also be removed, except in constructors and in functions/methods that convert array-likes - to pandas objects (e.g. the ``pandas.array`` function) and functions/methods that export pandas objects to other data - types (e.g. ``DataFrame/Series.to_numpy`` method). -- Open Questions - (These questions are deferred to a later revision, and will not affect the acceptance process of this PDEP.) - - Should ``inplace=True`` return the original pandas object that was operated inplace on? - - What should happen when ``inplace=True`` but the original pandas object cannot be operated inplace on because it - shares its values with another pandas object? +- The ``inplace`` parameter will be removed from any method which can never update the + underlying values of a pandas object inplace or which alters the shape of the object, + and where the `inplace=True` option is only syntactic sugar for reassigning the result + to the calling DataFrame/Series. +- As a consequence, the `inplace` parameter is only kept for those methods that can + modify the underlying values of a pandas object inplace, such as `fillna` or `replace`. +- With the introduction of Copy-on-Write ([PDEP-7](^1)), users don't need the `inplace` + keyword to avoid a copy of the data. +- For those methods that will keep the `inplace=True` option: + - the method will do an attempt to do the operation inplace but still silently copy + when needed (for Copy-on-Write), i.e. there is no guarantee it is actually done inplace. + - the method will return the calling object (`self`), instead of the current `None`. ## Motivation and Scope @@ -433,7 +426,7 @@ be the default with Copy-on-Write). ## References -[^1]: [Copy on Write Specification](https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.iexejdstiz8u) +[^1]: [Copy on Write Specification](https://pandas.pydata.org/pdeps/0007-copy-on-write.html) [^2]: [^3]: [^4]: From 4bbd02f06bd3d6daead56297d9cffe798351404d Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Mon, 4 Dec 2023 22:14:03 +0100 Subject: [PATCH 13/13] Apply suggestions from code review Co-authored-by: Irv Lustig --- web/pandas/pdeps/0008-inplace-methods-in-pandas.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md index 98a404311fb7c..db37397914d5d 100644 --- a/web/pandas/pdeps/0008-inplace-methods-in-pandas.md +++ b/web/pandas/pdeps/0008-inplace-methods-in-pandas.md @@ -104,7 +104,7 @@ for this PDEP, we can distinguish two kinds of "inplace" operations: Object-inplace operations, while not actually modifying existing column values, keep (a subset of) those columns and thus can avoid copying the data of those existing columns. -In addition, several methods supporting the ``inplace`` keyword cannot actually be done inplace (in neither meaning) +In addition, several methods supporting the ``inplace`` keyword cannot actually be done inplace (in either meaning) because they make a copy as a consequence of the operations they perform, regardless of whether ``inplace`` is ``True`` or not. This, coupled with the fact that the ``inplace=True`` changes the return type of a method from a pandas object to ``None``, makes usage of @@ -177,7 +177,7 @@ return a new object referencing the same data. | `eval` | | `query` | -Although these methods have the `inplace` keyword, they can never operate inplace, in neither meaning, because the nature of the +Although these methods have the `inplace` keyword, they can never operate inplace, in either meaning, because the nature of the operation requires copying (such as reordering or dropping rows). For those methods, `inplace=True` is essentially just syntactic sugar for reassigning the new result to the calling DataFrame/Series. @@ -214,7 +214,7 @@ All references to the original object will go out of scope when the result of th to `df`. As a consequence, `iloc` will continue to operate inplace, and the underlying data will not be copied (with Copy-on-Write). **Group 2 (values-inplace)** methods differ, though, since they modify the underlying -data, and therefore can be actually happen inplace: +data, and therefore can actually happen inplace: :::python df = pd.DataFrame({"foo": [1, 2, 3]}) @@ -382,7 +382,7 @@ However, the `copy` keyword is not supported in any of the values-mutating metho unlike `inplace`, so semantics of future inplace mutation of values align better with the current behavior of the `inplace` keyword, than with the current behavior of the `copy` keyword. -Furthermore, with the Copy-on-Write proposal, the `copy` keyword also has become superfluous. With Copy-on-Write +Furthermore, with the approved Copy-on-Write proposal, the `copy` keyword also has become superfluous. With Copy-on-Write enabled, methods that return a new pandas object will always try to avoid a copy whenever possible, regardless of a `copy=False` keyword. Thus, the Copy-on-Write PDEP proposes to actually remove the `copy` keyword from the methods where it is currently used (so it would be strange to add this as a new keyword to the Group 2 methods).