Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce usage of private pandas attributes #576

Closed
jsignell opened this issue Mar 23, 2021 · 6 comments
Closed

Reduce usage of private pandas attributes #576

jsignell opened this issue Mar 23, 2021 · 6 comments

Comments

@jsignell
Copy link
Member

When running the dask upstream tests in dask/dask#7441 for example you can see that fastparquet sets _code. This is not supported in the dev version of pandas. There is a ticket (pandas-dev/pandas#40580) to add a FutureWarning in pandas and deprecate this more intentinonally, but fastparquet should find a way to not use ._code to construct Categoricals.

@martindurant
Copy link
Member

I will need help from someone in Pandas to appease the requirement. @jbrockmendel , this happens within empty, which you recently looked at.

            if str(t) == 'category':
                c = Categorical([], categories=cat(col), fastpath=True)
                vals = np.zeros(size, dtype=c.codes.dtype)
                index = CategoricalIndex(c)
>               index._data._codes = vals
E               AttributeError: can't set attribute

The intent is to produce a categorical index where the set of category codes is a mutable array and the categories can be set without having to first parse a set of values.

@jbrockmendel
Copy link

a couple options:

  1. as mentioned in the OP, pandas#40580 will at least turn this into a warning instead of an exception
  2. set ._ndarray instead of ._codes (please dont)
  3. let's go implement Categorical.empty right now

@jbrockmendel
Copy link

tentative Categorical.empty

    @classmethod
    def empty(cls, shape: Shape, dtype: CategoricalDtype) -> Categorical:
        """
        Analogous to np.empty(shape, dtype=dtype)

        Parameters
        ----------
        shape : tuple[int]
        dtype : CategoricalDtype
        """
        arr = cls._from_sequence([], dtype=dtype)

        # We have to use np.zeros instead of np.empty otherwise the resulting
        #  ndarray may contain codes not supported by this dtype, in which
        #  case repr(result) could segfault.
        backing = np.zeros(shape, dtype=arr._ndarray.dtype)

        return arr._from_backing_data(backing)

@jorisvandenbossche
Copy link
Member

About this specific code snippet:

                c = Categorical([], categories=cat(col), fastpath=True)
                vals = np.zeros(size, dtype=c.codes.dtype)
                index = CategoricalIndex(c)
                index._data._codes = vals

that can also be written as:

temp = Categorical([], categories=cat(col), fastpath=True)
vals = np.zeros(size, dtype=temp.codes.dtype)
c = Categorical(vals, dtype=temp.dtype, fastpath=True)
index = CategoricalIndex(c)

without the use of private APIs. And I think the only reason you need the temp is to get the correct integer bit size for the codes . As long as that dtype is correct, pandas shouldn't copy the values, and so afterwards setting into vals will still update the CategoricalIndex.

@martindurant
Copy link
Member

Thanks @jorisvandenbossche ! So this involves no copies, right? I think for fastparquet's use, empty would be fine too, since we are guaranteed to fill in the codes with appropriate values; but I'm not sure it's particularly faster to be worth bothering.

@jbrockmendel
Copy link

So this involves no copies, right?

Correct. @jorisvandenbossche 's snippet should have better compatibility with older pandas versions than the one I posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants