Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure PUDL works with Pandas 1.5.0 #1902

Merged
merged 17 commits into from
Sep 20, 2022
Merged

Ensure PUDL works with Pandas 1.5.0 #1902

merged 17 commits into from
Sep 20, 2022

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Sep 3, 2022

Pandas v1.5.0rc0 breaks PUDL, so we need to chase down some problems.

Closes #1901

@zaneselvans zaneselvans added the dependencies Pull requests that update a dependency file label Sep 3, 2022
@zaneselvans zaneselvans linked an issue Sep 3, 2022 that may be closed by this pull request
9 tasks
@zaneselvans zaneselvans marked this pull request as ready for review September 16, 2022 17:16
@codecov
Copy link

codecov bot commented Sep 16, 2022

Codecov Report

Base: 82.7% // Head: 82.7% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (4017d3a) compared to base (851a98c).
Patch coverage: 97.6% of modified lines in pull request are covered.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #1902   +/-   ##
=====================================
  Coverage   82.7%   82.7%           
=====================================
  Files         65      65           
  Lines       7398    7424   +26     
=====================================
+ Hits        6123    6147   +24     
- Misses      1275    1277    +2     
Impacted Files Coverage Δ
src/pudl/analysis/allocate_net_gen.py 97.2% <ø> (ø)
src/pudl/metadata/constants.py 100.0% <ø> (ø)
src/pudl/transform/ferc714.py 88.2% <94.1%> (-0.1%) ⬇️
src/pudl/extract/excel.py 95.4% <100.0%> (ø)
src/pudl/helpers.py 87.4% <100.0%> (-0.3%) ⬇️
src/pudl/output/ferc714.py 93.5% <100.0%> (+0.6%) ⬆️
src/pudl/transform/eia.py 95.2% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/pudl/extract/excel.py Outdated Show resolved Hide resolved
src/pudl/helpers.py Show resolved Hide resolved
src/pudl/metadata/constants.py Outdated Show resolved Hide resolved
@zaneselvans
Copy link
Member Author

zaneselvans commented Sep 20, 2022

For some reason the ferc714 transform and output routines with pandas 1.5 are using much more memory than they did with pandas 1.4.4 -- enough that the GitHub runner gets shut down (more than 7GB).

Running the tests locally it seems like they max out at ~17GB. Clearly something has changed, but it's not clear to me how the minor changes I made to the pudl.transform.ferc714 module could be responsible for this.

@TrentonBush @bendnorman is there anything in there that seems like a big memory disaster to you? The resulting dataframes get PUDL dtypes enforced in the transform() method at the end of the module, but I don't think that converts the timezones to categoricals, and messing around with it now, as strings they take up 2/3 of the memory of the demand_hourly_pa_ferc714 table (which is by far the largest table with like 15 million rows). I could try switching to using pudl.metadata.classes.Resource.format_df() instead which will convert to categorical values where there's an ENUM constraint in the resource definition.

With that change in place it looks like the transformation maybe stays under 7GB of memory usage, but the pudl.output.ferc714 routines are still using much more memory than they were before -- that's where it maxed out at 17GB before, in the Respondents.summarize_demand() method. Maybe I should really be using dask there anyway since it's extremely serializable. Oh but dask doesn't like multi-indices.

@zaneselvans
Copy link
Member Author

I re-ran this code changing nothing but the version of pandas that was installed and it seems like the increase in memory usage is due to pandas, not pudl. ferc714_out.summarize_demand():

  • pandas v1.4.4: ~3GB; 0m35s
  • pandas v1.5.0: ~17GB; 8m23s

@zaneselvans
Copy link
Member Author

It turns out this memory blowup was due the the same issue I reported in pandas-dev/pandas#48574

In this case we were trying to downsample the hourly FERC-714 timestamps to annual resolution, so we ended up with 8760x more records than expected. Which were then merged with all of the respondent IDs, leading to a dataframe with 28M records when it should really only have had about 3400.

I searched the whole codebase for any other instances of datetime64\[[Y,M,D,H,m]\] and didn't find any other instances so... hopefully the CI passes now and we can merge this in.

@zaneselvans zaneselvans merged commit 321a339 into dev Sep 20, 2022
@zaneselvans zaneselvans deleted the pandas-1.5 branch September 20, 2022 19:28
@bendnorman bendnorman mentioned this pull request Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make PUDL compatible with Pandas 1.5
2 participants