Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollback pandas-1.5 #1945

Merged
merged 1 commit into from
Sep 26, 2022
Merged

Rollback pandas-1.5 #1945

merged 1 commit into from
Sep 26, 2022

Conversation

bendnorman
Copy link
Member

Since updating to pandas-1.5, our nightly builds have doubled in memory use. I'm opening this PR to test the memory use with pandas 1.4.x.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@bendnorman bendnorman changed the base branch from main to dev September 22, 2022 17:42
@codecov
Copy link

codecov bot commented Sep 22, 2022

Codecov Report

Base: 82.8% // Head: 82.8% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (6397698) compared to base (32caab0).
Patch has no changes to coverable lines.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #1945   +/-   ##
=====================================
  Coverage   82.8%   82.8%           
=====================================
  Files         65      65           
  Lines       7436    7436           
=====================================
+ Hits        6158    6159    +1     
+ Misses      1278    1277    -1     
Impacted Files Coverage Δ
src/pudl/helpers.py 87.7% <0.0%> (+0.2%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bendnorman
Copy link
Member Author

bendnorman commented Sep 23, 2022

The nightly build for this branch passed on a 32 GB machine and has a memory profile similar to the builds before upgrading to pandas 1.5. Memory profiles:

nightly_build_memory_issues

Some potential pandas-1.5 bugs:

@zaneselvans
Copy link
Member

I'd be surprised if the downcasting solution wasn't reducing memory usage in the places where it's been applied. The blowup was happening because e.g. running astype("datetime64[Y]") on an hourly dataframe in pandas 1.5 results in... an hourly dataframe. So there were still a huge number of records, and then they got merged with other dataframes, leading to tens of millions of rows where there should only have been thousands. The workaround that turns a Series into a Numpy array, downcasts, and then rebuilds a new Series definitely seemed to result in the right number of records.

I searched the codebase for any instance of datetime64 with day/month/year resolutions and the only one I found outside the report_date formatting function was in the FERC-714 hourly demand table transform, which I fixed.

Maybe there are other ways that one might try and invoke the same behavior without explicitly mentioning the data type? Should we run a memory profiler and see where it explodes?

@bendnorman
Copy link
Member Author

I'll profile the EIA ETL because its using 2x memory with pandas-1.5.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welp, it's annoying but I guess we should merge this into dev to get the nightly builds working again and re-open #1901 until we've hunted down the memory issue, or it's been fixed upstream.

@bendnorman bendnorman merged commit 81d10c1 into dev Sep 26, 2022
@bendnorman bendnorman deleted the rollback-pandas-1.5 branch September 26, 2022 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants