Ensure PUDL works with Pandas 1.5.0 #1902

zaneselvans · 2022-09-03T18:46:49Z

Pandas v1.5.0rc0 breaks PUDL, so we need to chase down some problems.

Closes #1901

This is a (small) hack that works around changes in how pandas deals with casting Series to `datetime64` types have units larger than seconds. We depended on the previous behavior. Not sure if it's something that will get fixed in pandas, but I made this issue and it's a breaking change, so hopefully: pandas-dev/pandas#48574

codecov · 2022-09-16T18:45:08Z

Codecov Report

Base: 82.7% // Head: 82.7% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (4017d3a) compared to base (851a98c).
Patch coverage: 97.6% of modified lines in pull request are covered.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #1902   +/-   ##
=====================================
  Coverage   82.7%   82.7%           
=====================================
  Files         65      65           
  Lines       7398    7424   +26     
=====================================
+ Hits        6123    6147   +24     
- Misses      1275    1277    +2

Impacted Files	Coverage Δ
src/pudl/analysis/allocate_net_gen.py	`97.2% <ø> (ø)`
src/pudl/metadata/constants.py	`100.0% <ø> (ø)`
src/pudl/transform/ferc714.py	`88.2% <94.1%> (-0.1%)`	⬇️
src/pudl/extract/excel.py	`95.4% <100.0%> (ø)`
src/pudl/helpers.py	`87.4% <100.0%> (-0.3%)`	⬇️
src/pudl/output/ferc714.py	`93.5% <100.0%> (+0.6%)`	⬆️
src/pudl/transform/eia.py	`95.2% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/pudl/extract/excel.py

src/pudl/helpers.py

src/pudl/metadata/constants.py

zaneselvans · 2022-09-20T04:55:52Z

For some reason the ferc714 transform and output routines with pandas 1.5 are using much more memory than they did with pandas 1.4.4 -- enough that the GitHub runner gets shut down (more than 7GB).

Running the tests locally it seems like they max out at ~17GB. Clearly something has changed, but it's not clear to me how the minor changes I made to the pudl.transform.ferc714 module could be responsible for this.

@TrentonBush @bendnorman is there anything in there that seems like a big memory disaster to you? The resulting dataframes get PUDL dtypes enforced in the transform() method at the end of the module, but I don't think that converts the timezones to categoricals, and messing around with it now, as strings they take up 2/3 of the memory of the demand_hourly_pa_ferc714 table (which is by far the largest table with like 15 million rows). I could try switching to using pudl.metadata.classes.Resource.format_df() instead which will convert to categorical values where there's an ENUM constraint in the resource definition.

With that change in place it looks like the transformation maybe stays under 7GB of memory usage, but the pudl.output.ferc714 routines are still using much more memory than they were before -- that's where it maxed out at 17GB before, in the Respondents.summarize_demand() method. Maybe I should really be using dask there anyway since it's extremely serializable. Oh but dask doesn't like multi-indices.

zaneselvans · 2022-09-20T15:31:37Z

I re-ran this code changing nothing but the version of pandas that was installed and it seems like the increase in memory usage is due to pandas, not pudl. ferc714_out.summarize_demand():

pandas v1.4.4: ~3GB; 0m35s
pandas v1.5.0: ~17GB; 8m23s

zaneselvans · 2022-09-20T18:19:00Z

It turns out this memory blowup was due the the same issue I reported in pandas-dev/pandas#48574

In this case we were trying to downsample the hourly FERC-714 timestamps to annual resolution, so we ended up with 8760x more records than expected. Which were then merged with all of the respondent IDs, leading to a dataframe with 28M records when it should really only have had about 3400.

I searched the whole codebase for any other instances of datetime64\[[Y,M,D,H,m]\] and didn't find any other instances so... hopefully the CI passes now and we can merge this in.

zaneselvans added 2 commits September 3, 2022 13:37

Bump pandas & setuptools versions. Fix bad geopandas version in testenv.

1efa3f6

Convert set of dataframe columns into a list.

745d649

zaneselvans added the dependencies Pull requests that update a dependency file label Sep 3, 2022

zaneselvans linked an issue Sep 3, 2022 that may be closed by this pull request

Make PUDL compatible with Pandas 1.5 #1901

Closed

9 tasks

zaneselvans added 10 commits September 3, 2022 15:35

Avoid infinite df.replace() recursion.

26483e9

Replace df.iteritems() with df.items() in BGA compilation.

e717230

Update setuptools and geopandas in environment.yml

538f299

Merge branch 'dev' into pandas-1.5

fe6fb2f

Merge branch 'dev' into pandas-1.5

1f3ba5d

Merge branch 'dev' into pandas-1.5

d88a827

Remove unnecessary df.loc to avoid FutureWarning

7a0a12d

Use df.transform instead of a weird apply() to avoid FutureWarning.

ac43e8d

Revert to v1.4 of pandas to get CI passing...

bdd703d

zaneselvans marked this pull request as ready for review September 16, 2022 17:16

zaneselvans requested a review from TrentonBush September 16, 2022 17:17

Revert overzealous method chaining that made a bug...

d30cdba

TrentonBush approved these changes Sep 20, 2022

View reviewed changes

src/pudl/extract/excel.py Outdated Show resolved Hide resolved

src/pudl/helpers.py Show resolved Hide resolved

src/pudl/metadata/constants.py Outdated Show resolved Hide resolved

zaneselvans added 2 commits September 19, 2022 22:01

Respond to Trenton's PR comments and allow use of pandas 1.5.0

93f6f48

Merge branch 'dev' into pandas-1.5

cea6f1e

Use format_df() at the end of FERC-714 transform to save memory.

8f62893

Add logging to FERC-714 output and fix another astype(datetime64[Y]).

4017d3a

zaneselvans merged commit 321a339 into dev Sep 20, 2022

zaneselvans deleted the pandas-1.5 branch September 20, 2022 19:28

bendnorman mentioned this pull request Sep 23, 2022

Rollback pandas-1.5 #1945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure PUDL works with Pandas 1.5.0 #1902

Ensure PUDL works with Pandas 1.5.0 #1902

zaneselvans commented Sep 3, 2022 •

edited

Loading

codecov bot commented Sep 16, 2022 •

edited

Loading

zaneselvans commented Sep 20, 2022 •

edited

Loading

zaneselvans commented Sep 20, 2022

zaneselvans commented Sep 20, 2022

Ensure PUDL works with Pandas 1.5.0 #1902

Ensure PUDL works with Pandas 1.5.0 #1902

Conversation

zaneselvans commented Sep 3, 2022 • edited Loading

codecov bot commented Sep 16, 2022 • edited Loading

Codecov Report

zaneselvans commented Sep 20, 2022 • edited Loading

zaneselvans commented Sep 20, 2022

zaneselvans commented Sep 20, 2022

zaneselvans commented Sep 3, 2022 •

edited

Loading

codecov bot commented Sep 16, 2022 •

edited

Loading

zaneselvans commented Sep 20, 2022 •

edited

Loading