Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally disallow duplicate labels #28394

Merged
merged 49 commits into from
Sep 3, 2020

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 11, 2019

This adds a property to NDFrame to disallow duplicate labels. This fixes a vexing issue with using pandas for ETL pipelines, where accidentally introducing duplicate labels can lead to confusing downstream behavior (e.g. NDFrame.__getitem__ not reducing dimensionality).

When set (via the construction with allow_duplicate_labels=False or afterward via .allows_duplicate_labels=False), the presence of duplicate labels causes a DuplicateLabelError exception to be raised:

In [3]: df = pd.DataFrame({"A": [1, 2]}, index=['a', 'a'], allow_duplicate_labels=False)
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
<ipython-input-3-1c8833763dfc> in <module>
----> 1 df = pd.DataFrame({"A": [1, 2]}, index=['a', 'a'], allow_duplicate_labels=False)

~/sandbox/pandas/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy, allow_duplicate_labels)
    493
    494         NDFrame.__init__(
--> 495             self, mgr, fastpath=True, allow_duplicate_labels=allow_duplicate_labels
    496         )
    497

~/sandbox/pandas/pandas/core/generic.py in __init__(self, data, axes, copy, dtype, allow_duplicate_labels, fastpath)
    202
    203         if not self.allows_duplicate_labels:
--> 204             self._maybe_check_duplicate_labels()
    205
    206     def _init_mgr(self, mgr, axes=None, dtype=None, copy=False):

~/sandbox/pandas/pandas/core/generic.py in _maybe_check_duplicate_labels(self, force)
    240         if force or not self.allows_duplicate_labels:
    241             for ax in self.axes:
--> 242                 ax._maybe_check_unique()
    243
    244     @property

~/sandbox/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
    540             # TODO: position, value, not too large.
    541             msg = "Index has duplicates."
--> 542             raise DuplicateLabelError(msg)
    543
    544     # --------------------------------------------------------------------

DuplicateLabelError: Index has duplicates.

This property is preserved through operations (using _metadata and __finalize__).

In [4]: df = pd.DataFrame(index=['a', 'A'], allow_duplicate_labels=False)

In [5]: df.rename(str.upper)
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
<ipython-input-5-17c8fb0b7c7f> in <module>
----> 1 df.rename(str.upper)
...

~/sandbox/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
    540             # TODO: position, value, not too large.
    541             msg = "Index has duplicates."
--> 542             raise DuplicateLabelError(msg)
    543
    544     # --------------------------------------------------------------------

DuplicateLabelError: Index has duplicates.

API design questions

  • Do we want to be positive or negative?
pd.Series(..., allow_duplicate_labels=True/False)
pd.Series(..., disallow_duplicate_labels=False/True)
pd.Series(..., require_unique_labels=False/True)
  • In my proposal, the argument is different from the property. Do we like that? The rational is the argument is making a statement about what to do, while the property is making a statement about what is allowed.
In [7]: s = pd.Series(allow_duplicate_labels=False)

In [8]: s.allows_duplicate_labels
Out[8]: False
  • I'd like a method-chainable way of saying "duplicates aren't allowed." Some options
# two methods.
s.disallow_duplicate().allow_duplicate()

I dislike that in combination with .allows_duplicates, since we'd have a property and a method that only differ by an s. Perhaps something like

s.update_metdata(allows_duplicates=False)

But do people know that "allows_duplicates" is considered metadata?


TODO:

  • many, many more tests
  • Propagate metadata in more places (groupby, rolling, etc.)
  • Backwards compatibility in __finalize__ (we may be changing when we call it, and what we pass)

Apologies for using the PR for discussion, but we needed a way to compare this to #28334.

pandas/core/generic.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

I think this is ready for an initial review, to decide if this direction / behavior is OK. I'd recommend starting with the new docs section for the behavior, and NDFrame.__finalize__ for the implementation.

cc @jorisvandenbossche @jreback @jbrockmendel @jschendel @datapythonista @ others I'm sure.

@datapythonista
Copy link
Member

This looks very useful, thanks for the work on this. I can't think of a better API, so happy with your proposal.

Just wondering if in the future we may want allow_duplicate_values=False by default, but that's surely something to leave for later and keep backward-compatibility here.

@TomAugspurger
Copy link
Contributor Author

Just wondering if in the future we may want allow_duplicate_values=False by default

Yeah I think that's a possibility. But yes let's leave it for later :)

or column labels. This may be a bit confusing at first. If you're familiar with
SQL, you know that row labels are similar to a primary key on a table, and you
would never want duplicates in a SQL table. But one of pandas' roles is to clean
mess, real-world data before it goes to some downstream system. And real-world
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mess --> messy?

@jschendel
Copy link
Member

I'm +1 on this.

A couple passing comments:

  • I found it a little bit confusing that the constructor parameter is allow_duplicate_labels but the attribute name is slightly different in allows_duplicate_labels. My preference here would be to make these the same.
  • Is there any utility in making this configurable at the axis level, i.e. allow a dataframe to have duplicate index labels but not have duplicate column labels?
    • I don't think this is terribly important but anticipate that we'll get a user request for that feature, so probably best to bring this up now.

One thing I noticed is that this doesn't appear to validate when you directly assign a new index:

In [2]: s = pd.Series(list('abc'), allow_duplicate_labels=False)

In [3]: s.index = [0, 0, 0]

In [4]: s
Out[4]: 
0    a
0    b
0    c
dtype: object

In [5]: s.allows_duplicate_labels
Out[5]: False

@jbrockmendel
Copy link
Member

Once a kwarg is in the Series constructor it can be difficult to get it out (see: fastpath). Could we put this in stealth mode for a while with something like:

ser = pd.Series(call_constructor_like_always)
ser._allows_duplicate_labels = False

Not a strong opinion, just thinking out loud.

@WillAyd
Copy link
Member

WillAyd commented Sep 13, 2019

Just out of curiosity why do you think adding this as a keyword argument to the Series / DataFrame constructor is the right approach? Would that subsequently guard against doing something like ser.index = list("aaa") after construction?

@TomAugspurger TomAugspurger added this to the 1.0 milestone Sep 13, 2019
@TomAugspurger TomAugspurger added metadata _metadata, .attrs API Design labels Sep 13, 2019
@TomAugspurger
Copy link
Contributor Author

I found it a little bit confusing that the constructor parameter is allow_duplicate_labels but the attribute name is slightly different in allows_duplicate_labels. My preference here would be to make these the same.

I share this concern. Do others? After using it for a few days I'm getting used to it, but still need to stop and think.

Is there any utility in making this configurable at the axis level, i.e. allow a dataframe to have duplicate index labels but not have duplicate column labels?

Yes. My backwards-compatible plan here is to have the setter allow True/False/index/columns.

* True : allows duplicates everywhere
* False : allows duplicates nowhere
* index : allows duplicates in the index, but not in the columns
* columns : allows duplicates in the columns, but not in the index.

Code-wise, it's not much effort to support. It just makes the logic a bit more complicated, so I'm holding off for now.

Once a kwarg is in the Series constructor it can be difficult to get it out (see: fastpath). Could we put this in stealth mode for a while with something like:

Do we anticipate needing to remove this? Certainly, it can be marked as experimental if there's trepidation.

Just out of curiosity why do you think adding this as a keyword argument to the Series / DataFrame constructor is the right approach?

Initially, I was thinking it'd be a property on Index. I think that's more natural, since it's making a statement about the Index. But users interact more with Series / DataFrame, so I think it's more ergonomic to place it there.

After implementing things, putting it on NDFrame is nice because we have NDFrame.__finalize__ If it were on the index, things would be a be trickier.

@jbrockmendel
Copy link
Member

Do we anticipate needing to remove this? Certainly, it can be marked as experimental if there's trepidation.

I don't anticipate needing to remove this, but I do anticipate other conceptually similar flags being implemented. In that scenario, I don't like the idea of all of these becoming kwargs in the Series constructor.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Oct 17, 2019
This aids in the implementation of
pandas-dev#28394. Over there, I'm having
issues with using `NDFrame.__finalize__` to copy attributes, in part
because getattribute on NDFrame is so complicated.

This simplifies this because we only need to look in NDFrame.attrs,
which is just a plain dictionary.

Aside from the addition of a public NDFrame.attrs dictionary, there
aren't any user-facing API changes.
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Oct 21, 2019
commit 67a3263
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Fri Oct 18 08:05:04 2019 -0500

    fixup name

commit e6183cd
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Fri Oct 18 07:05:33 2019 -0500

    fixup Index.name

commit d1826bb
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Oct 17 13:45:30 2019 -0500

    REF: Store metadata in attrs dict

    This aids in the implementation of
    pandas-dev#28394. Over there, I'm having
    issues with using `NDFrame.__finalize__` to copy attributes, in part
    because getattribute on NDFrame is so complicated.

    This simplifies this because we only need to look in NDFrame.attrs,
    which is just a plain dictionary.

    Aside from the addition of a public NDFrame.attrs dictionary, there
    aren't any user-facing API changes.
@TomAugspurger
Copy link
Contributor Author

I'm seeing some strange mypy issues with these changes. Lots of

pandas/tests/test_strings.py:206: error: List item 0 has incompatible type "Type[Series]"; expected "Type[PandasObject]"
pandas/tests/test_strings.py:240: error: List item 0 has incompatible type "Type[Series]"; expected "Type[PandasObject]"
...

L206 is @pytest.mark.parametrize("box", [Series, Index]).

In fact, I'm seeing these mypy errors with just the following diff from master

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index ef4e3e064d..7445951536 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -407,6 +407,7 @@ class DataFrame(NDFrame):
         columns: Optional[Axes] = None,
         dtype: Optional[Dtype] = None,
         copy: bool = False,
+        allow_duplicate_labels: bool = True,
     ):
         if data is None:
             data = {}
diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index d59ce8db9b..64a38ca4eb 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -207,6 +207,7 @@ class NDFrame(PandasObject, SelectionMixin):
         dtype: Optional[Dtype] = None,
         attrs: Optional[Mapping[Optional[Hashable], Any]] = None,
         fastpath: bool = False,
+        allow_duplicate_labels: bool = True,
     ):
 
         if not fastpath:
diff --git a/pandas/core/series.py b/pandas/core/series.py
index 3e9d3d5c04..8d6beb42c4 100644
--- a/pandas/core/series.py
+++ b/pandas/core/series.py
@@ -203,7 +203,7 @@ class Series(base.IndexOpsMixin, generic.NDFrame):
     # Constructors
 
     def __init__(
-        self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False
+        self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False, allow_duplicate_labels=True,
     ):
 
         # we are called internally, so short-circuit

Any ideas what's going wrong (perhaps @simonjayhawkins or @WillAyd have a guess)?

@WillAyd
Copy link
Member

WillAyd commented Oct 23, 2019

What version of mypy / pytest are you using? Seems strange for that to pop up if that is your only diff

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Oct 24, 2019
Working around a strange typing issue. See
pandas-dev#28394 (comment)
for more, but the types on these were being inferred incorrectly by
mypy with just the addition of the `allows_duplicate_labels` kwarg.
@TomAugspurger
Copy link
Contributor Author

pytest 5.2.1
mypy 0.720

Indeed, it's quite strange. Opened #29205 to add explicit types to the tests that were failing.

@simonjayhawkins
Copy link
Member

on master

reveal_type([Series, Index]) gives Revealed type is 'builtins.list[builtins.type*]'

with the changes in this PR..

Revealed type is 'builtins.list[def (data: Any =, Any =, Any =, name: Any =, Any =, Any =, **Any) -> pandas.core.base.PandasObject]'

@simonjayhawkins
Copy link
Member

same inferred type with mypy 0.730 and 0.740

@jreback jreback removed this from the 1.0 milestone Jan 1, 2020
@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

@TomAugspurger status of this

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 1, 2020 via email

@TomAugspurger
Copy link
Contributor Author

Just the windows s3 failures, which I think that #35856 is tracking.

@TomAugspurger
Copy link
Contributor Author

CI is green now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice updates

a few inline questions

  • can u ensure that this can survive a round trip pickle (flags)
  • worth separating out index and columns flags for duplicates? eg i almost always care about column duplicates but may want to allow row dupes

doc/source/user_guide/duplicates.rst Outdated Show resolved Hide resolved
pandas/core/flags.py Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of inline comments

can u ensure this survives a round trip pickle (flags)

  • worth having separate flags for index and columns ? eg disallow column but allow for rows i would say is common

the reason to do this now is that it would be very hard to change later

@jreback
Copy link
Contributor

jreback commented Aug 26, 2020

i think my comments got duped as internet not great

@TomAugspurger
Copy link
Contributor Author

worth separating out index and columns flags for duplicates? eg i almost always care about column duplicates but may want to allow row dupes

Possibly. The current system is future-compatible with that though, allows_duplicate_labels="index" / "columns"

@TomAugspurger
Copy link
Contributor Author

Added the pickle test.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 31, 2020

All green with the pickle test now.

I've also added tests that assert_eq correctly compares .flags. I think we would (by default) consider tm.assert_frame_equal(a, b) to be False when the flags don't match.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of minor comments and questions. looks good

doc/source/user_guide/duplicates.rst Outdated Show resolved Hide resolved
doc/source/user_guide/duplicates.rst Show resolved Hide resolved
doc/source/whatsnew/v1.2.0.rst Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
pandas/core/indexes/base.py Show resolved Hide resolved
pandas/core/indexes/base.py Show resolved Hide resolved
@@ -483,6 +483,26 @@ def _simple_new(cls, values, name: Label = None):
def _constructor(self):
return type(self)

def _maybe_check_unique(self):
if not self.is_unique:
# TODO: position, value, not too large.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah maybe an arg to show the first 10 duplicates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's truncated, following our usual repr's settings. I'll remove the todo

In [3]: s = pd.Series(1, index=[0] * 200).set_flags(allows_duplicate_labels=False)
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
<ipython-input-3-ece63ed32110> in <module>
----> 1 s = pd.Series(1, index=[0] * 200).set_flags(allows_duplicate_labels=False)

~/sandbox/pandas/pandas/core/generic.py in set_flags(self, copy, allows_duplicate_labels)
    349         df = self.copy(deep=copy)
    350         if allows_duplicate_labels is not None:
--> 351             df.flags["allows_duplicate_labels"] = allows_duplicate_labels
    352         return df
    353

~/sandbox/pandas/pandas/core/flags.py in __setitem__(self, key, value)
    103         if key not in self._keys:
    104             raise ValueError(f"Unknown flag {key}. Must be one of {self._keys}")
--> 105         setattr(self, key, value)
    106
    107     def __repr__(self):

~/sandbox/pandas/pandas/core/flags.py in allows_duplicate_labels(self, value)
     90         if not value:
     91             for ax in obj.axes:
---> 92                 ax._maybe_check_unique()
     93
     94         self._allows_duplicate_labels = value

~/sandbox/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
    491             msg += "\n{}".format(duplicates)
    492
--> 493             raise DuplicateLabelError(msg)
    494
    495     def _format_duplicate_message(self):

DuplicateLabelError: Index has duplicates.
                                               positions
label
0      [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
In [10]: s = pd.Series(1, index=np.arange(100).repeat(10)).set_flags(allows_duplicate_labels=False)
---------------------------------------------------------------------------
DuplicateLabelError                       Traceback (most recent call last)
<ipython-input-10-452fca37a730> in <module>
----> 1 s = pd.Series(1, index=np.arange(100).repeat(10)).set_flags(allows_duplicate_labels=False)

~/sandbox/pandas/pandas/core/generic.py in set_flags(self, copy, allows_duplicate_labels)
    349         df = self.copy(deep=copy)
    350         if allows_duplicate_labels is not None:
--> 351             df.flags["allows_duplicate_labels"] = allows_duplicate_labels
    352         return df
    353

~/sandbox/pandas/pandas/core/flags.py in __setitem__(self, key, value)
    103         if key not in self._keys:
    104             raise ValueError(f"Unknown flag {key}. Must be one of {self._keys}")
--> 105         setattr(self, key, value)
    106
    107     def __repr__(self):

~/sandbox/pandas/pandas/core/flags.py in allows_duplicate_labels(self, value)
     90         if not value:
     91             for ax in obj.axes:
---> 92                 ax._maybe_check_unique()
     93
     94         self._allows_duplicate_labels = value

~/sandbox/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
    491             msg += "\n{}".format(duplicates)
    492
--> 493             raise DuplicateLabelError(msg)
    494
    495     def _format_duplicate_message(self):

DuplicateLabelError: Index has duplicates.
                                               positions
label
0                         [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1               [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2               [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
3               [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
4               [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
...                                                  ...
95     [950, 951, 952, 953, 954, 955, 956, 957, 958, ...
96     [960, 961, 962, 963, 964, 965, 966, 967, 968, ...
97     [970, 971, 972, 973, 974, 975, 976, 977, 978, ...
98     [980, 981, 982, 983, 984, 985, 986, 987, 988, ...
99     [990, 991, 992, 993, 994, 995, 996, 997, 998, ...

[100 rows x 1 columns]

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks @TomAugspurger

@jreback jreback merged commit 76eb314 into pandas-dev:master Sep 3, 2020
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design metadata _metadata, .attrs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants