-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make PUDL compatible with Pandas 1.5 #1901
Comments
The date column formatting issue seems to stem from changes to how Pandas deals type casting using the Numpy A minimal reproduction of the behavior: import pandas as pd
from pandas.testing import assert_series_equal
print(f"{pd.__version__=}")
ser = pd.Series(["2000-02-02", "2000-03-03"], dtype="datetime64[ns]")
# Passes on pandas 1.4.4
# Fails on pandas 1.5.0rc0
# Fails on pandas 1.6.0.dev0+136.gbbf17ea692
assert_series_equal(
ser.astype("datetime64[Y]"),
pd.Series(["2000-01-01", "2000-01-01"], dtype="datetime64[ns]")
) |
What on earth was I thinking here? I didn't even realize you could groupby the values in another series. And how does the result of a groupby on a single column of a dataframe result in the whole dataframe being handed back? def _standardize_offset_codes(df, offset_fixes):
"""Convert to standardized UTC offset abbreviations.
This function ensures that all of the 3-4 letter abbreviations used to indicate a
timestamp's localized offset from UTC are standardized, so that they can be used to
make the timestamps timezone aware. The standard abbreviations we're using are:
"HST": Hawaii Standard Time
"AKST": Alaska Standard Time
"AKDT": Alaska Daylight Time
"PST": Pacific Standard Time
"PDT": Pacific Daylight Time
"MST": Mountain Standard Time
"MDT": Mountain Daylight Time
"CST": Central Standard Time
"CDT": Central Daylight Time
"EST": Eastern Standard Time
"EDT": Eastern Daylight Time
In some cases different respondents use the same non-standard abbreviations to
indicate different offsets, and so the fixes are applied on a per-respondent basis,
as defined by offset_fixes.
Args:
df (pandas.DataFrame): A DataFrame containing a utc_offset_code column
that needs to be standardized.
offset_fixes (dict): A dictionary with respondent_id_ferc714 values as the
keys, and a dictionary mapping non-standard UTC offset codes to
the standardized UTC offset codes as the value.
Returns:
Standardized UTC offset codes.
"""
logger.debug("Standardizing UTC offset codes.")
# Treat empty string as missing
is_blank = df["utc_offset_code"] == ""
code = df["utc_offset_code"].mask(is_blank)
# Apply specific fixes on a per-respondent basis:
df = code.groupby(df["respondent_id_ferc714"]).apply(
lambda x: x.replace(offset_fixes[x.name]) if x.name in offset_fixes else x
)
return df Oh wait. it doesn't really. Here's how it's being used: # Clean UTC offset codes
df["utc_offset_code"] = df["utc_offset_code"].str.strip().str.upper()
df["utc_offset_code"] = df.pipe(_standardize_offset_codes, OFFSET_CODE_FIXES)
# NOTE: Assumes constant timezone for entire year Where offset_fixes = {
264: {"CDS": "CDT"},
271: {"EDS": "EDT"},
275: {"CPT": "CST"},
277: {
"CPT": "CST",
np.nan: "CST",
},
281: {"CEN": "CST"},
288: {np.nan: "EST"},
293: {np.nan: "MST"},
294: {np.nan: "EST"},
296: {"CPT": "CST"},
297: {"CPT": "CST"},
} |
I think all of these have been addressed. The CI is still failing because of the geopandas issue, but I think it's been fixed on the pandas |
Pandas has put out a release candidate for v1.5.0. I tried running our tests with the RC installed and there are some failures we need to address before we can move to the new version, which definitely has some features we'll appreciate!
Unit Test Failures
test/unit/extract/excel_test.py::TestGenericExtractor::
fails because columns cannot be a set.test/unit/analysis/spatial_test.py::test_overlay
fails because more than one column is namedgeometry
. This doesn't seem to be attributable to our code. I created a geopandas issue. Looks like the underlying issue had already been identified: REF: avoid internals in merge code pandas-dev/pandas#48082report_year
/report_date
columns is broken. This is due to a change in the way pandas handles casing Series todatetime64
types with units larger than seconds. I created an issue and found a workaround in this commitactual
:expected
:test/unit/harvest_test.py::test_eia_example[resource2]
fails because dataframes are not equal. Maybe same issue as above? Here they have different numbers of columns, so it's not just a cosmetic difference.Integration Test Failures
test/integration/output_test.py::test_ferc714_etl
fails with RecursionError: maximum recursion depth exceeded while calling a Python object. Stack trace in the details here:Integration Test Warnings
pudl/transform/eia.py:1009
: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for row in too_many_codes.iteritems():pudl/analysis/allocate_net_gen.py:448
: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.pudl/helpers.py:1389
: FutureWarning: In a future version,df.iloc[:, i] = newvals
will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use eitherdf[df.columns[i]] = newvals
or, if columns are non-unique,df.isetitem(i, newvals)
@cmgosnell this is indedupe_on_category()
and I have no idea what the function is supposed to do based on the docstring, but it looks like something related toplant_parts_eia
.pudl/transform/ferc714.py:332
: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.The text was updated successfully, but these errors were encountered: