Reduce usage of private pandas attributes #576

jsignell · 2021-03-23T20:13:24Z

When running the dask upstream tests in dask/dask#7441 for example you can see that fastparquet sets _code. This is not supported in the dev version of pandas. There is a ticket (pandas-dev/pandas#40580) to add a FutureWarning in pandas and deprecate this more intentinonally, but fastparquet should find a way to not use ._code to construct Categoricals.

The text was updated successfully, but these errors were encountered:

martindurant · 2021-03-23T20:21:01Z

I will need help from someone in Pandas to appease the requirement. @jbrockmendel , this happens within empty, which you recently looked at.

            if str(t) == 'category':
                c = Categorical([], categories=cat(col), fastpath=True)
                vals = np.zeros(size, dtype=c.codes.dtype)
                index = CategoricalIndex(c)
>               index._data._codes = vals
E               AttributeError: can't set attribute

The intent is to produce a categorical index where the set of category codes is a mutable array and the categories can be set without having to first parse a set of values.

jbrockmendel · 2021-03-23T20:32:27Z

a couple options:

as mentioned in the OP, pandas#40580 will at least turn this into a warning instead of an exception
set ._ndarray instead of ._codes (please dont)
let's go implement Categorical.empty right now

jbrockmendel · 2021-03-23T21:24:03Z

tentative Categorical.empty

    @classmethod
    def empty(cls, shape: Shape, dtype: CategoricalDtype) -> Categorical:
        """
        Analogous to np.empty(shape, dtype=dtype)

        Parameters
        ----------
        shape : tuple[int]
        dtype : CategoricalDtype
        """
        arr = cls._from_sequence([], dtype=dtype)

        # We have to use np.zeros instead of np.empty otherwise the resulting
        #  ndarray may contain codes not supported by this dtype, in which
        #  case repr(result) could segfault.
        backing = np.zeros(shape, dtype=arr._ndarray.dtype)

        return arr._from_backing_data(backing)

jorisvandenbossche · 2021-03-24T07:43:40Z

About this specific code snippet:

                c = Categorical([], categories=cat(col), fastpath=True)
                vals = np.zeros(size, dtype=c.codes.dtype)
                index = CategoricalIndex(c)
                index._data._codes = vals

that can also be written as:

temp = Categorical([], categories=cat(col), fastpath=True)
vals = np.zeros(size, dtype=temp.codes.dtype)
c = Categorical(vals, dtype=temp.dtype, fastpath=True)
index = CategoricalIndex(c)

without the use of private APIs. And I think the only reason you need the temp is to get the correct integer bit size for the codes . As long as that dtype is correct, pandas shouldn't copy the values, and so afterwards setting into vals will still update the CategoricalIndex.

martindurant · 2021-03-24T14:17:09Z

Thanks @jorisvandenbossche ! So this involves no copies, right? I think for fastparquet's use, empty would be fine too, since we are guaranteed to fill in the codes with appropriate values; but I'm not sure it's particularly faster to be worth bothering.

jbrockmendel · 2021-03-24T14:40:52Z

So this involves no copies, right?

Correct. @jorisvandenbossche 's snippet should have better compatibility with older pandas versions than the one I posted.

jbrockmendel mentioned this issue Mar 23, 2021

ENH: Categorical.empty pandas-dev/pandas#40602

Merged

4 tasks

martindurant closed this as completed Apr 19, 2021

jorisvandenbossche mentioned this issue Oct 21, 2021

How to handle upstream warnings in CI dask/dask#8278

Closed

jrbourbeau mentioned this issue Dec 19, 2022

Can't set Categorical._codes in pandas=2.0 #832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce usage of private pandas attributes #576

Reduce usage of private pandas attributes #576

jsignell commented Mar 23, 2021

martindurant commented Mar 23, 2021

jbrockmendel commented Mar 23, 2021

jbrockmendel commented Mar 23, 2021

jorisvandenbossche commented Mar 24, 2021

martindurant commented Mar 24, 2021

jbrockmendel commented Mar 24, 2021

Reduce usage of private pandas attributes #576

Reduce usage of private pandas attributes #576

Comments

jsignell commented Mar 23, 2021

martindurant commented Mar 23, 2021

jbrockmendel commented Mar 23, 2021

jbrockmendel commented Mar 23, 2021

jorisvandenbossche commented Mar 24, 2021

martindurant commented Mar 24, 2021

jbrockmendel commented Mar 24, 2021