-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading in pandas/materialized views #602
Conversation
Wait so like...how did you get creds to the prod db? They only exist in 1 place(to my knowledge) |
Request.nc.in_(ncList), | ||
] | ||
|
||
requestTypes = (', ').join([f"'{rt}'" for rt in requestTypes]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deja vu 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know lol. I was all about the ORM for a minute (sheep emoji), but it's actually kind of annoying. Documentation isn't great, plus it's just an extra code layer that we really don't need for this app.
@@ -28,7 +28,7 @@ async def get_heatmap(self, filters): | |||
filters['requestTypes'], | |||
filters['ncList']) | |||
|
|||
pins = dataAccess.query(fields, filters) | |||
pins = dataAccess.query(fields, filters, table='map') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably put in a table existence safety check somewhere in the dataAcess layer, I'm cool with it going in after this PR but it should prob go in at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good call, I'll make a ticket
This PR does two things: (1) it changes the DataService to read from the database using pandas instead of the ORM, and (2) adds two materialized views to the database.
Reading with pandas instead of the ORM
Using pandas instead of the ORM to read from the DB makes the code simpler without hurting performance. If we continue to use the ORM, we'll have to define new models for every different table or view that we're reading from. Plus, with the ORM, reading requires two steps: first the DataService uses the ORM to generate a list of dictionaries, and second, the services that use the DataService put that data into a pandas dataframe. It's cleaner to just read the data into a dataframe (using
pd.read_sql
), since that's where it ends up anyway.As for performance, I tested pandas reads vs. ORM reads pretty extensively. The read times were almost identical, with maybe a slight advantage to pandas reads. The code for those tests, and some handy charts, is here.
Materialized views
The PR creates two materialized views at the end of the ingest script. There's a view called 'map', which supports the
/pin-clusters
and/heatmap
endpoints, and another one called 'vis', which supports all of the visualization and comparison endpoints. Both views have six columns, and both have indexes on all of the columns that we use for filtering.As discussed on slack, reading from materialized views is way faster than reading from the full ingest staging table. (Code for those tests is here). On my local, reading a one-year period with all request types and all NCs takes about 14 seconds when I used the full table, but only around 4 seconds to read from the 'map' view. (4 seconds is still too slow IMO, and I'm gonna continue working on getting the time down.)
As for the 10M row limit on heroku -- I created a materialized view in the production database for testing, and it didn't have any effect on the row count in the heroku console. So it looks like materialized views don't count towards that limit.
dev
branchAny questions? See the getting started guide