Make PUDL compatible with Pandas 1.5 #1901

zaneselvans · 2022-09-03T16:44:32Z

Pandas has put out a release candidate for v1.5.0. I tried running our tests with the RC installed and there are some failures we need to address before we can move to the new version, which definitely has some features we'll appreciate!

Unit Test Failures

test/unit/extract/excel_test.py::TestGenericExtractor:: fails because columns cannot be a set.
test/unit/analysis/spatial_test.py::test_overlay fails because more than one column is named geometry. This doesn't seem to be attributable to our code. I created a geopandas issue. Looks like the underlying issue had already been identified: REF: avoid internals in merge code pandas-dev/pandas#48082
Formatting of report_year / report_date columns is broken. This is due to a change in the way pandas handles casing Series to datetime64 types with units larger than seconds. I created an issue and found a workaround in this commit

import pandas as pd
from pudl.metadata.classes import Resource

fields = [{'name': 'report_year', 'type': 'year'}]

resource = Resource(**{
    'name': 'table',
    'harvest': {'harvest': True},
    'schema': {'fields': fields, 'primary_key': ['report_year']},
})

df = pd.DataFrame({'report_date': ['2000-02-02', '2000-03-03']})
actual = resource.format_df(df)
expected = pd.DataFrame({"report_year": ["2000-01-01", "2000-01-01"]}, dtype="datetime64[ns]")

pd.testing.assert_frame_equal(actual, expected)

actual:

	report_year
0	2000-02-02
1	2000-03-03

expected:

	report_year
0	2000-01-01
1	2000-01-01

test/unit/harvest_test.py::test_eia_example[resource2] fails because dataframes are not equal. Maybe same issue as above? Here they have different numbers of columns, so it's not just a cosmetic difference.

Integration Test Failures

test/integration/output_test.py::test_ferc714_etl fails with RecursionError: maximum recursion depth exceeded while calling a Python object. Stack trace in the details here:

.env_tox/lib/python3.10/site-packages/pudl/output/ferc714.py:594: in georef_counties
    self.fipsify(update=update), census_gdf=census_counties
.env_tox/lib/python3.10/site-packages/pudl/output/ferc714.py:543: in fipsify
    categorized = self.categorize(update=update)
.env_tox/lib/python3.10/site-packages/pudl/output/ferc714.py:433: in categorize
    rids_ferc714 = self.pudl_out.respondent_id_ferc714()
.env_tox/lib/python3.10/site-packages/pudl/output/pudltabl.py:489: in respondent_id_ferc714
    self.etl_ferc714(update=update)
.env_tox/lib/python3.10/site-packages/pudl/output/pudltabl.py:482: in etl_ferc714
    ferc714_tfr_dfs = pudl.transform.ferc714.transform(
.env_tox/lib/python3.10/site-packages/pudl/transform/ferc714.py:604: in transform
    tfr_dfs = tfr_funcs[table](tfr_dfs)
.env_tox/lib/python3.10/site-packages/pudl/transform/ferc714.py:361: in respondent_id
    tfr_dfs["respondent_id_ferc714"].assign(
.env_tox/lib/python3.10/site-packages/pandas/core/frame.py:4879: in assign
    data[k] = com.apply_if_callable(v, data)
.env_tox/lib/python3.10/site-packages/pandas/core/common.py:364: in apply_if_callable
    return maybe_callable(obj, **kwargs)
.env_tox/lib/python3.10/site-packages/pudl/transform/ferc714.py:363: in <lambda>
    eia_code=lambda x: x.eia_code.replace(to_replace=0, value=pd.NA),
.env_tox/lib/python3.10/site-packages/pandas/util/_decorators.py:317: in wrapper
    return func(*args, **kwargs)
.env_tox/lib/python3.10/site-packages/pandas/core/series.py:5380: in replace
    return super().replace(
.env_tox/lib/python3.10/site-packages/pandas/util/_decorators.py:317: in wrapper
    return func(*args, **kwargs)
.env_tox/lib/python3.10/site-packages/pandas/core/generic.py:7251: in replace
    new_data = self._mgr.replace(
.env_tox/lib/python3.10/site-packages/pandas/core/internals/managers.py:468: in replace
    return self.apply(
.env_tox/lib/python3.10/site-packages/pandas/core/internals/managers.py:348: in apply
    applied = getattr(b, f)(**kwargs)
.env_tox/lib/python3.10/site-packages/pandas/core/internals/blocks.py:613: in replace
    return blk.replace(
.env_tox/lib/python3.10/site-packages/pandas/core/internals/blocks.py:613: in replace
    return blk.replace(
.env_tox/lib/python3.10/site-packages/pandas/core/internals/blocks.py:613: in replace
...

Integration Test Warnings

pudl/transform/eia.py:1009: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for row in too_many_codes.iteritems():
pudl/analysis/allocate_net_gen.py:448: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
pudl/helpers.py:1389: FutureWarning: In a future version, df.iloc[:, i] = newvals will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either df[df.columns[i]] = newvals or, if columns are non-unique, df.isetitem(i, newvals) @cmgosnell this is in dedupe_on_category() and I have no idea what the function is supposed to do based on the docstring, but it looks like something related to plant_parts_eia.
pudl/transform/ferc714.py:332: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.

The text was updated successfully, but these errors were encountered:

zaneselvans · 2022-09-15T23:30:54Z

The date column formatting issue seems to stem from changes to how Pandas deals type casting using the Numpy datetime types, and seems like it might not be intentional. I opened an issue: pandas-dev/pandas#48574

A minimal reproduction of the behavior:

import pandas as pd
from pandas.testing import assert_series_equal
print(f"{pd.__version__=}")

ser = pd.Series(["2000-02-02", "2000-03-03"], dtype="datetime64[ns]")

# Passes on pandas 1.4.4
# Fails on pandas 1.5.0rc0
# Fails on pandas 1.6.0.dev0+136.gbbf17ea692
assert_series_equal(
    ser.astype("datetime64[Y]"),
    pd.Series(["2000-01-01", "2000-01-01"], dtype="datetime64[ns]")
)

zaneselvans · 2022-09-16T15:21:11Z

What on earth was I thinking here? I didn't even realize you could groupby the values in another series. And how does the result of a groupby on a single column of a dataframe result in the whole dataframe being handed back?

def _standardize_offset_codes(df, offset_fixes):
    """Convert to standardized UTC offset abbreviations.

    This function ensures that all of the 3-4 letter abbreviations used to indicate a
    timestamp's localized offset from UTC are standardized, so that they can be used to
    make the timestamps timezone aware. The standard abbreviations we're using are:

    "HST": Hawaii Standard Time
    "AKST": Alaska Standard Time
    "AKDT": Alaska Daylight Time
    "PST": Pacific Standard Time
    "PDT": Pacific Daylight Time
    "MST": Mountain Standard Time
    "MDT": Mountain Daylight Time
    "CST": Central Standard Time
    "CDT": Central Daylight Time
    "EST": Eastern Standard Time
    "EDT": Eastern Daylight Time

    In some cases different respondents use the same non-standard abbreviations to
    indicate different offsets, and so the fixes are applied on a per-respondent basis,
    as defined by offset_fixes.

    Args:
        df (pandas.DataFrame): A DataFrame containing a utc_offset_code column
            that needs to be standardized.
        offset_fixes (dict): A dictionary with respondent_id_ferc714 values as the
            keys, and a dictionary mapping non-standard UTC offset codes to
            the standardized UTC offset codes as the value.

    Returns:
        Standardized UTC offset codes.
    """
    logger.debug("Standardizing UTC offset codes.")
    # Treat empty string as missing
    is_blank = df["utc_offset_code"] == ""
    code = df["utc_offset_code"].mask(is_blank)
    # Apply specific fixes on a per-respondent basis:
    df = code.groupby(df["respondent_id_ferc714"]).apply(
        lambda x: x.replace(offset_fixes[x.name]) if x.name in offset_fixes else x
    )
    return df

Oh wait. it doesn't really. Here's how it's being used:

    # Clean UTC offset codes
    df["utc_offset_code"] = df["utc_offset_code"].str.strip().str.upper()
    df["utc_offset_code"] = df.pipe(_standardize_offset_codes, OFFSET_CODE_FIXES)
    # NOTE: Assumes constant timezone for entire year

Where OFFSET_CODE_FIXES is something that looks like...

offset_fixes = {
    264: {"CDS": "CDT"},
    271: {"EDS": "EDT"},
    275: {"CPT": "CST"},
    277: {
        "CPT": "CST",
        np.nan: "CST",
    },
    281: {"CEN": "CST"},
    288: {np.nan: "EST"},
    293: {np.nan: "MST"},
    294: {np.nan: "EST"},
    296: {"CPT": "CST"},
    297: {"CPT": "CST"},
}

zaneselvans · 2022-09-16T17:08:58Z

I think all of these have been addressed. The CI is still failing because of the geopandas issue, but I think it's been fixed on the pandas main branch. It's just not in 1.5.0rc0.

zaneselvans self-assigned this Sep 3, 2022

zaneselvans added the dependencies Pull requests that update a dependency file label Sep 3, 2022

zaneselvans mentioned this issue Sep 3, 2022

Ensure PUDL works with Pandas 1.5.0 #1902

Merged

zaneselvans linked a pull request Sep 3, 2022 that will close this issue

Ensure PUDL works with Pandas 1.5.0 #1902

Merged

zaneselvans mentioned this issue Sep 3, 2022

BUG: overlay union raises ValueError with pandas 1.5.0rc0 geopandas/geopandas#2544

Closed

3 tasks

zaneselvans closed this as completed Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PUDL compatible with Pandas 1.5 #1901

Make PUDL compatible with Pandas 1.5 #1901

zaneselvans commented Sep 3, 2022 •

edited

Loading

zaneselvans commented Sep 15, 2022

zaneselvans commented Sep 16, 2022

zaneselvans commented Sep 16, 2022

Make PUDL compatible with Pandas 1.5 #1901

Make PUDL compatible with Pandas 1.5 #1901

Comments

zaneselvans commented Sep 3, 2022 • edited Loading

Unit Test Failures

Integration Test Failures

Integration Test Warnings

zaneselvans commented Sep 15, 2022

zaneselvans commented Sep 16, 2022

zaneselvans commented Sep 16, 2022

zaneselvans commented Sep 3, 2022 •

edited

Loading