New release failing on pandas master #297

jorisvandenbossche · 2018-01-28T14:58:32Z

The new release breaks some of our tests. It seems that we adapted the signature of make_block_same_class (in pandas-dev/pandas#19265). Maybe we should also fix this temporarily in pandas by putting it back

=================================== FAILURES ===================================
___________________ TestParquetFastParquet.test_datetime_tz ____________________
[gw0] linux -- Python 3.6.4 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.io.test_parquet.TestParquetFastParquet object at 0x7fce4e4d7940>
fp = 'fastparquet'
    def test_datetime_tz(self, fp):
        # doesn't preserve tz
        df = pd.DataFrame({'a': pd.date_range('20130101', periods=3,
                                              tz='US/Eastern')})
    
        # warns on the coercion
        with catch_warnings(record=True):
>           check_round_trip(df, fp, expected=df.astype('datetime64[ns]'))
pandas/tests/io/test_parquet.py:478: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:158: in check_round_trip
    compare(repeat)
pandas/tests/io/test_parquet.py:152: in compare
    actual = read_parquet(path, **read_kwargs)
pandas/io/parquet.py:278: in read_parquet
    return impl.read(path, columns=columns, **kwargs)
pandas/io/parquet.py:226: in read
    return parquet_file.to_pandas(columns=columns, **kwargs)
../../../miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/api.py:388: in to_pandas
    df, views = self.pre_allocate(size, columns, categories, index)
../../../miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/api.py:418: in pre_allocate
    self._dtypes(categories), tz)
../../../miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/api.py:509: in _pre_allocate
    index_type=index_type, cats=cats, timezones=tz)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
types = [dtype('<M8[ns]')], size = 3, cats = {}, cols = ['a'], index_type = None
index_name = None, timezones = {'a': 'US/Eastern'}
    def empty(types, size, cats=None, cols=None, index_type=None, index_name=None,
              timezones=None):
        """
        Create empty DataFrame to assign into
    
        Parameters
        ----------
        types: like np record structure, 'i4,u2,f4,f2,f4,M8,m8', or using tuples
            applies to non-categorical columns. If there are only categorical
            columns, an empty string of None will do.
        size: int
            Number of rows to allocate
        cats: dict {col: labels}
            Location and labels for categorical columns, e.g., {1: ['mary', 'mo]}
            will create column index 1 (inserted amongst the numerical columns)
            with two possible values. If labels is an integers, `{'col': 5}`,
            will generate temporary labels using range. If None, or column name
            is missing, will assume 16-bit integers (a reasonable default).
        cols: list of labels
            assigned column names, including categorical ones.
        timezones: dict {col: timezone_str}
            for timestamp type columns, apply this timezone to the pandas series;
            the numpy view will be UTC.
    
        Returns
        -------
        - dataframe with correct shape and data-types
        - list of numpy views, in order, of the columns of the dataframe. Assign
            to this.
        """
        df = DataFrame()
        views = {}
        timezones = timezones or {}
    
        cols = cols if cols is not None else range(cols)
        if isinstance(types, STR_TYPE):
            types = types.split(',')
        for t, col in zip(types, cols):
            if str(t) == 'category':
                if cats is None or col not in cats:
                    df[str(col)] = Categorical(
                            [], categories=RangeIndex(0, 2**14),
                            fastpath=True)
                elif isinstance(cats[col], int):
                    df[str(col)] = Categorical(
                            [], categories=RangeIndex(0, cats[col]),
                            fastpath=True)
                else:  # explicit labels list
                    df[str(col)] = Categorical([], categories=cats[col],
                                               fastpath=True)
            else:
                df[str(col)] = np.empty(0, dtype=t)
                if df[str(col)].dtype.kind == "M" and str(col) in timezones:
                    df[str(col)] = df[str(col)].dt.tz_localize(timezones[str(col)])
    
        if index_type is not None and index_type is not False:
            if index_name is None:
                raise ValueError('If using an index, must give an index name')
            if str(index_type) == 'category':
                if cats is None or index_name not in cats:
                    c = Categorical(
                            [], categories=RangeIndex(0, 2**14),
                            fastpath=True)
                elif isinstance(cats[index_name], int):
                    c = Categorical(
                            [], categories=RangeIndex(0, cats[index_name]),
                            fastpath=True)
                else:  # explicit labels list
                    c = Categorical([], categories=cats[index_name],
                                    fastpath=True)
                print(cats, index_name, c)
                vals = np.empty(size, dtype=c.codes.dtype)
                index = CategoricalIndex(c)
                index._data._codes = vals
                views[index_name] = vals
            else:
                index = Index(np.empty(size, dtype=index_type))
                views[index_name] = index.values
    
            axes = [df._data.axes[0], index]
        else:
            axes = [df._data.axes[0], RangeIndex(size)]
    
        # allocate and create blocks
        blocks = []
        for block in df._data.blocks:
            if block.is_categorical:
                categories = block.values.categories
                code = np.zeros(shape=size, dtype=block.values.codes.dtype)
                values = Categorical(values=code, categories=categories,
                                     fastpath=True)
                new_block = block.make_block_same_class(values=values)
            elif getattr(block.dtype, 'tz', None):
                new_shape = (size, )
                values = np.empty(shape=new_shape, dtype=block.values.values.dtype)
                new_block = block.make_block_same_class(
>                       values=values, dtype=block.values.dtype)
E               TypeError: make_block_same_class() got an unexpected keyword argument 'dtype'
../../../miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/dataframe.py:105: TypeError

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-01-28T15:02:29Z

cc @jbrockmendel

martindurant · 2018-01-28T15:05:59Z

Is it reasonable to have DataFrame.empty created in pandas, it belongs better there than in a parquet-specific location and your people probably know how to do this better than I do.

jorisvandenbossche · 2018-01-28T15:13:09Z

Yes, I think that's a good idea anyway. Do you want to open an issue about that?

But for the actual issue, I am trying to understand what the code does. As it might be that the change in pandas is kind of a regression for this usecase. What is the reason you need to re-create the datetimetz block? As in the first loop when creating the columns you already do a tz_localize ?

martindurant · 2018-01-28T15:14:10Z

The first dataframe contains blocks of zero size, and here they are recreated with the required size. I don't know of a different way to do this :|

jorisvandenbossche · 2018-01-28T15:23:44Z

Ah, yes, I see. The make_block machinery is a bit complicated, but it seems the passed dtype is indeed necessary in this case.

jorisvandenbossche · 2018-01-28T15:25:19Z

As far as I can see, this is a regression in make_block_same_class, so we should just fix this in pandas master.

martindurant · 2018-01-28T16:43:37Z

Thanks for looking into it, @jorisvandenbossche
I'll raise an issue on pandas suggesting an empty function based on mine, but I don't think I know the code-base well enough to produce a rigorous PR.

jorisvandenbossche · 2018-01-29T10:20:54Z

See comment of @jreback what should be used instead: pandas-dev/pandas#19431 (comment)

jreback · 2018-01-29T10:23:06Z

this is going to be changed very soon
relying on internal APIs is not great

jorisvandenbossche mentioned this issue Jan 28, 2018

Regression in make_block_same_class (tests failing for new fastparquet release) pandas-dev/pandas#19431

Closed

jorisvandenbossche closed this as completed Jan 28, 2018

jorisvandenbossche reopened this Jan 29, 2018

jorisvandenbossche mentioned this issue Jan 29, 2018

[#19431] Regression in make_block_same_class (tests failing for new fastparquet release) pandas-dev/pandas#19434

Merged

4 tasks

martindurant closed this as completed Feb 9, 2018

jorisvandenbossche mentioned this issue Jul 2, 2019

Deprecated dtype argument in Block.make_block_same_class #442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New release failing on pandas master #297

New release failing on pandas master #297

jorisvandenbossche commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

martindurant commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

martindurant commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

martindurant commented Jan 28, 2018

jorisvandenbossche commented Jan 29, 2018

jreback commented Jan 29, 2018

New release failing on pandas master #297

New release failing on pandas master #297

Comments

jorisvandenbossche commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

martindurant commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

martindurant commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

martindurant commented Jan 28, 2018

jorisvandenbossche commented Jan 29, 2018

jreback commented Jan 29, 2018