Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New release failing on pandas master #297

Closed
jorisvandenbossche opened this issue Jan 28, 2018 · 9 comments
Closed

New release failing on pandas master #297

jorisvandenbossche opened this issue Jan 28, 2018 · 9 comments

Comments

@jorisvandenbossche
Copy link
Member

The new release breaks some of our tests. It seems that we adapted the signature of make_block_same_class (in pandas-dev/pandas#19265). Maybe we should also fix this temporarily in pandas by putting it back

=================================== FAILURES ===================================
___________________ TestParquetFastParquet.test_datetime_tz ____________________
[gw0] linux -- Python 3.6.4 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.io.test_parquet.TestParquetFastParquet object at 0x7fce4e4d7940>
fp = 'fastparquet'
    def test_datetime_tz(self, fp):
        # doesn't preserve tz
        df = pd.DataFrame({'a': pd.date_range('20130101', periods=3,
                                              tz='US/Eastern')})
    
        # warns on the coercion
        with catch_warnings(record=True):
>           check_round_trip(df, fp, expected=df.astype('datetime64[ns]'))
pandas/tests/io/test_parquet.py:478: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:158: in check_round_trip
    compare(repeat)
pandas/tests/io/test_parquet.py:152: in compare
    actual = read_parquet(path, **read_kwargs)
pandas/io/parquet.py:278: in read_parquet
    return impl.read(path, columns=columns, **kwargs)
pandas/io/parquet.py:226: in read
    return parquet_file.to_pandas(columns=columns, **kwargs)
../../../miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/api.py:388: in to_pandas
    df, views = self.pre_allocate(size, columns, categories, index)
../../../miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/api.py:418: in pre_allocate
    self._dtypes(categories), tz)
../../../miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/api.py:509: in _pre_allocate
    index_type=index_type, cats=cats, timezones=tz)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
types = [dtype('<M8[ns]')], size = 3, cats = {}, cols = ['a'], index_type = None
index_name = None, timezones = {'a': 'US/Eastern'}
    def empty(types, size, cats=None, cols=None, index_type=None, index_name=None,
              timezones=None):
        """
        Create empty DataFrame to assign into
    
        Parameters
        ----------
        types: like np record structure, 'i4,u2,f4,f2,f4,M8,m8', or using tuples
            applies to non-categorical columns. If there are only categorical
            columns, an empty string of None will do.
        size: int
            Number of rows to allocate
        cats: dict {col: labels}
            Location and labels for categorical columns, e.g., {1: ['mary', 'mo]}
            will create column index 1 (inserted amongst the numerical columns)
            with two possible values. If labels is an integers, `{'col': 5}`,
            will generate temporary labels using range. If None, or column name
            is missing, will assume 16-bit integers (a reasonable default).
        cols: list of labels
            assigned column names, including categorical ones.
        timezones: dict {col: timezone_str}
            for timestamp type columns, apply this timezone to the pandas series;
            the numpy view will be UTC.
    
        Returns
        -------
        - dataframe with correct shape and data-types
        - list of numpy views, in order, of the columns of the dataframe. Assign
            to this.
        """
        df = DataFrame()
        views = {}
        timezones = timezones or {}
    
        cols = cols if cols is not None else range(cols)
        if isinstance(types, STR_TYPE):
            types = types.split(',')
        for t, col in zip(types, cols):
            if str(t) == 'category':
                if cats is None or col not in cats:
                    df[str(col)] = Categorical(
                            [], categories=RangeIndex(0, 2**14),
                            fastpath=True)
                elif isinstance(cats[col], int):
                    df[str(col)] = Categorical(
                            [], categories=RangeIndex(0, cats[col]),
                            fastpath=True)
                else:  # explicit labels list
                    df[str(col)] = Categorical([], categories=cats[col],
                                               fastpath=True)
            else:
                df[str(col)] = np.empty(0, dtype=t)
                if df[str(col)].dtype.kind == "M" and str(col) in timezones:
                    df[str(col)] = df[str(col)].dt.tz_localize(timezones[str(col)])
    
        if index_type is not None and index_type is not False:
            if index_name is None:
                raise ValueError('If using an index, must give an index name')
            if str(index_type) == 'category':
                if cats is None or index_name not in cats:
                    c = Categorical(
                            [], categories=RangeIndex(0, 2**14),
                            fastpath=True)
                elif isinstance(cats[index_name], int):
                    c = Categorical(
                            [], categories=RangeIndex(0, cats[index_name]),
                            fastpath=True)
                else:  # explicit labels list
                    c = Categorical([], categories=cats[index_name],
                                    fastpath=True)
                print(cats, index_name, c)
                vals = np.empty(size, dtype=c.codes.dtype)
                index = CategoricalIndex(c)
                index._data._codes = vals
                views[index_name] = vals
            else:
                index = Index(np.empty(size, dtype=index_type))
                views[index_name] = index.values
    
            axes = [df._data.axes[0], index]
        else:
            axes = [df._data.axes[0], RangeIndex(size)]
    
        # allocate and create blocks
        blocks = []
        for block in df._data.blocks:
            if block.is_categorical:
                categories = block.values.categories
                code = np.zeros(shape=size, dtype=block.values.codes.dtype)
                values = Categorical(values=code, categories=categories,
                                     fastpath=True)
                new_block = block.make_block_same_class(values=values)
            elif getattr(block.dtype, 'tz', None):
                new_shape = (size, )
                values = np.empty(shape=new_shape, dtype=block.values.values.dtype)
                new_block = block.make_block_same_class(
>                       values=values, dtype=block.values.dtype)
E               TypeError: make_block_same_class() got an unexpected keyword argument 'dtype'
../../../miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/dataframe.py:105: TypeError
@jorisvandenbossche
Copy link
Member Author

cc @jbrockmendel

@martindurant
Copy link
Member

Is it reasonable to have DataFrame.empty created in pandas, it belongs better there than in a parquet-specific location and your people probably know how to do this better than I do.

@jorisvandenbossche
Copy link
Member Author

Yes, I think that's a good idea anyway. Do you want to open an issue about that?

But for the actual issue, I am trying to understand what the code does. As it might be that the change in pandas is kind of a regression for this usecase. What is the reason you need to re-create the datetimetz block? As in the first loop when creating the columns you already do a tz_localize ?

@martindurant
Copy link
Member

The first dataframe contains blocks of zero size, and here they are recreated with the required size. I don't know of a different way to do this :|

@jorisvandenbossche
Copy link
Member Author

Ah, yes, I see. The make_block machinery is a bit complicated, but it seems the passed dtype is indeed necessary in this case.

@jorisvandenbossche
Copy link
Member Author

As far as I can see, this is a regression in make_block_same_class, so we should just fix this in pandas master.

@martindurant
Copy link
Member

Thanks for looking into it, @jorisvandenbossche
I'll raise an issue on pandas suggesting an empty function based on mine, but I don't think I know the code-base well enough to produce a rigorous PR.

@jorisvandenbossche
Copy link
Member Author

See comment of @jreback what should be used instead: pandas-dev/pandas#19431 (comment)

@jreback
Copy link

jreback commented Jan 29, 2018

this is going to be changed very soon
relying on internal APIs is not great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants