Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PUDL compatible with Pandas 1.5 #1901

Closed
9 tasks done
zaneselvans opened this issue Sep 3, 2022 · 3 comments · Fixed by #1902
Closed
9 tasks done

Make PUDL compatible with Pandas 1.5 #1901

zaneselvans opened this issue Sep 3, 2022 · 3 comments · Fixed by #1902
Assignees
Labels
dependencies Pull requests that update a dependency file

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Sep 3, 2022

Pandas has put out a release candidate for v1.5.0. I tried running our tests with the RC installed and there are some failures we need to address before we can move to the new version, which definitely has some features we'll appreciate!

Unit Test Failures

import pandas as pd
from pudl.metadata.classes import Resource

fields = [{'name': 'report_year', 'type': 'year'}]

resource = Resource(**{
    'name': 'table',
    'harvest': {'harvest': True},
    'schema': {'fields': fields, 'primary_key': ['report_year']},
})

df = pd.DataFrame({'report_date': ['2000-02-02', '2000-03-03']})
actual = resource.format_df(df)
expected = pd.DataFrame({"report_year": ["2000-01-01", "2000-01-01"]}, dtype="datetime64[ns]")

pd.testing.assert_frame_equal(actual, expected)

actual:

report_year
0 2000-02-02
1 2000-03-03

expected:

report_year
0 2000-01-01
1 2000-01-01
  • test/unit/harvest_test.py::test_eia_example[resource2] fails because dataframes are not equal. Maybe same issue as above? Here they have different numbers of columns, so it's not just a cosmetic difference.

Integration Test Failures

  • test/integration/output_test.py::test_ferc714_etl fails with RecursionError: maximum recursion depth exceeded while calling a Python object. Stack trace in the details here:
.env_tox/lib/python3.10/site-packages/pudl/output/ferc714.py:594: in georef_counties
    self.fipsify(update=update), census_gdf=census_counties
.env_tox/lib/python3.10/site-packages/pudl/output/ferc714.py:543: in fipsify
    categorized = self.categorize(update=update)
.env_tox/lib/python3.10/site-packages/pudl/output/ferc714.py:433: in categorize
    rids_ferc714 = self.pudl_out.respondent_id_ferc714()
.env_tox/lib/python3.10/site-packages/pudl/output/pudltabl.py:489: in respondent_id_ferc714
    self.etl_ferc714(update=update)
.env_tox/lib/python3.10/site-packages/pudl/output/pudltabl.py:482: in etl_ferc714
    ferc714_tfr_dfs = pudl.transform.ferc714.transform(
.env_tox/lib/python3.10/site-packages/pudl/transform/ferc714.py:604: in transform
    tfr_dfs = tfr_funcs[table](tfr_dfs)
.env_tox/lib/python3.10/site-packages/pudl/transform/ferc714.py:361: in respondent_id
    tfr_dfs["respondent_id_ferc714"].assign(
.env_tox/lib/python3.10/site-packages/pandas/core/frame.py:4879: in assign
    data[k] = com.apply_if_callable(v, data)
.env_tox/lib/python3.10/site-packages/pandas/core/common.py:364: in apply_if_callable
    return maybe_callable(obj, **kwargs)
.env_tox/lib/python3.10/site-packages/pudl/transform/ferc714.py:363: in <lambda>
    eia_code=lambda x: x.eia_code.replace(to_replace=0, value=pd.NA),
.env_tox/lib/python3.10/site-packages/pandas/util/_decorators.py:317: in wrapper
    return func(*args, **kwargs)
.env_tox/lib/python3.10/site-packages/pandas/core/series.py:5380: in replace
    return super().replace(
.env_tox/lib/python3.10/site-packages/pandas/util/_decorators.py:317: in wrapper
    return func(*args, **kwargs)
.env_tox/lib/python3.10/site-packages/pandas/core/generic.py:7251: in replace
    new_data = self._mgr.replace(
.env_tox/lib/python3.10/site-packages/pandas/core/internals/managers.py:468: in replace
    return self.apply(
.env_tox/lib/python3.10/site-packages/pandas/core/internals/managers.py:348: in apply
    applied = getattr(b, f)(**kwargs)
.env_tox/lib/python3.10/site-packages/pandas/core/internals/blocks.py:613: in replace
    return blk.replace(
.env_tox/lib/python3.10/site-packages/pandas/core/internals/blocks.py:613: in replace
    return blk.replace(
.env_tox/lib/python3.10/site-packages/pandas/core/internals/blocks.py:613: in replace
...

Integration Test Warnings

  • pudl/transform/eia.py:1009: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for row in too_many_codes.iteritems():
  • pudl/analysis/allocate_net_gen.py:448: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  • pudl/helpers.py:1389: FutureWarning: In a future version, df.iloc[:, i] = newvals will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either df[df.columns[i]] = newvals or, if columns are non-unique, df.isetitem(i, newvals) @cmgosnell this is in dedupe_on_category() and I have no idea what the function is supposed to do based on the docstring, but it looks like something related to plant_parts_eia.
  • pudl/transform/ferc714.py:332: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
@zaneselvans zaneselvans self-assigned this Sep 3, 2022
@zaneselvans zaneselvans added the dependencies Pull requests that update a dependency file label Sep 3, 2022
@zaneselvans zaneselvans linked a pull request Sep 3, 2022 that will close this issue
@zaneselvans
Copy link
Member Author

The date column formatting issue seems to stem from changes to how Pandas deals type casting using the Numpy datetime types, and seems like it might not be intentional. I opened an issue: pandas-dev/pandas#48574

A minimal reproduction of the behavior:

import pandas as pd
from pandas.testing import assert_series_equal
print(f"{pd.__version__=}")

ser = pd.Series(["2000-02-02", "2000-03-03"], dtype="datetime64[ns]")

# Passes on pandas 1.4.4
# Fails on pandas 1.5.0rc0
# Fails on pandas 1.6.0.dev0+136.gbbf17ea692
assert_series_equal(
    ser.astype("datetime64[Y]"),
    pd.Series(["2000-01-01", "2000-01-01"], dtype="datetime64[ns]")
)

@zaneselvans
Copy link
Member Author

What on earth was I thinking here? I didn't even realize you could groupby the values in another series. And how does the result of a groupby on a single column of a dataframe result in the whole dataframe being handed back?

def _standardize_offset_codes(df, offset_fixes):
    """Convert to standardized UTC offset abbreviations.

    This function ensures that all of the 3-4 letter abbreviations used to indicate a
    timestamp's localized offset from UTC are standardized, so that they can be used to
    make the timestamps timezone aware. The standard abbreviations we're using are:

    "HST": Hawaii Standard Time
    "AKST": Alaska Standard Time
    "AKDT": Alaska Daylight Time
    "PST": Pacific Standard Time
    "PDT": Pacific Daylight Time
    "MST": Mountain Standard Time
    "MDT": Mountain Daylight Time
    "CST": Central Standard Time
    "CDT": Central Daylight Time
    "EST": Eastern Standard Time
    "EDT": Eastern Daylight Time

    In some cases different respondents use the same non-standard abbreviations to
    indicate different offsets, and so the fixes are applied on a per-respondent basis,
    as defined by offset_fixes.

    Args:
        df (pandas.DataFrame): A DataFrame containing a utc_offset_code column
            that needs to be standardized.
        offset_fixes (dict): A dictionary with respondent_id_ferc714 values as the
            keys, and a dictionary mapping non-standard UTC offset codes to
            the standardized UTC offset codes as the value.

    Returns:
        Standardized UTC offset codes.
    """
    logger.debug("Standardizing UTC offset codes.")
    # Treat empty string as missing
    is_blank = df["utc_offset_code"] == ""
    code = df["utc_offset_code"].mask(is_blank)
    # Apply specific fixes on a per-respondent basis:
    df = code.groupby(df["respondent_id_ferc714"]).apply(
        lambda x: x.replace(offset_fixes[x.name]) if x.name in offset_fixes else x
    )
    return df

Oh wait. it doesn't really. Here's how it's being used:

    # Clean UTC offset codes
    df["utc_offset_code"] = df["utc_offset_code"].str.strip().str.upper()
    df["utc_offset_code"] = df.pipe(_standardize_offset_codes, OFFSET_CODE_FIXES)
    # NOTE: Assumes constant timezone for entire year

Where OFFSET_CODE_FIXES is something that looks like...

offset_fixes = {
    264: {"CDS": "CDT"},
    271: {"EDS": "EDT"},
    275: {"CPT": "CST"},
    277: {
        "CPT": "CST",
        np.nan: "CST",
    },
    281: {"CEN": "CST"},
    288: {np.nan: "EST"},
    293: {np.nan: "MST"},
    294: {np.nan: "EST"},
    296: {"CPT": "CST"},
    297: {"CPT": "CST"},
}

@zaneselvans
Copy link
Member Author

I think all of these have been addressed. The CI is still failing because of the geopandas issue, but I think it's been fixed on the pandas main branch. It's just not in 1.5.0rc0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant