-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure PUDL works with Pandas 1.5.0 #1902
Conversation
This is a (small) hack that works around changes in how pandas deals with casting Series to `datetime64` types have units larger than seconds. We depended on the previous behavior. Not sure if it's something that will get fixed in pandas, but I made this issue and it's a breaking change, so hopefully: pandas-dev/pandas#48574
Codecov ReportBase: 82.7% // Head: 82.7% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #1902 +/- ##
=====================================
Coverage 82.7% 82.7%
=====================================
Files 65 65
Lines 7398 7424 +26
=====================================
+ Hits 6123 6147 +24
- Misses 1275 1277 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
For some reason the ferc714 transform and output routines with pandas 1.5 are using much more memory than they did with pandas 1.4.4 -- enough that the GitHub runner gets shut down (more than 7GB). Running the tests locally it seems like they max out at ~17GB. Clearly something has changed, but it's not clear to me how the minor changes I made to the @TrentonBush @bendnorman is there anything in there that seems like a big memory disaster to you? The resulting dataframes get PUDL dtypes enforced in the With that change in place it looks like the transformation maybe stays under 7GB of memory usage, but the |
I re-ran this code changing nothing but the version of pandas that was installed and it seems like the increase in memory usage is due to pandas, not pudl.
|
It turns out this memory blowup was due the the same issue I reported in pandas-dev/pandas#48574 In this case we were trying to downsample the hourly FERC-714 timestamps to annual resolution, so we ended up with 8760x more records than expected. Which were then merged with all of the respondent IDs, leading to a dataframe with 28M records when it should really only have had about 3400. I searched the whole codebase for any other instances of |
Pandas v1.5.0rc0 breaks PUDL, so we need to chase down some problems.
Closes #1901