reading in pandas/materialized views #602

jmensch1 · 2020-05-11T17:35:25Z

This PR does two things: (1) it changes the DataService to read from the database using pandas instead of the ORM, and (2) adds two materialized views to the database.

Reading with pandas instead of the ORM

Using pandas instead of the ORM to read from the DB makes the code simpler without hurting performance. If we continue to use the ORM, we'll have to define new models for every different table or view that we're reading from. Plus, with the ORM, reading requires two steps: first the DataService uses the ORM to generate a list of dictionaries, and second, the services that use the DataService put that data into a pandas dataframe. It's cleaner to just read the data into a dataframe (using pd.read_sql), since that's where it ends up anyway.

As for performance, I tested pandas reads vs. ORM reads pretty extensively. The read times were almost identical, with maybe a slight advantage to pandas reads. The code for those tests, and some handy charts, is here.

Materialized views

The PR creates two materialized views at the end of the ingest script. There's a view called 'map', which supports the /pin-clusters and /heatmap endpoints, and another one called 'vis', which supports all of the visualization and comparison endpoints. Both views have six columns, and both have indexes on all of the columns that we use for filtering.

As discussed on slack, reading from materialized views is way faster than reading from the full ingest staging table. (Code for those tests is here). On my local, reading a one-year period with all request types and all NCs takes about 14 seconds when I used the full table, but only around 4 seconds to read from the 'map' view. (4 seconds is still too slow IMO, and I'm gonna continue working on getting the time down.)

As for the 10M row limit on heroku -- I created a materialized view in the production database for testing, and it didn't have any effect on the row count in the heroku console. So it looks like materialized views don't count towards that limit.

Up to date with dev branch
Branch name follows guidelines
All PR Status checks are successful
Peer reviewed and approved

Any questions? See the getting started guide

sellnat77 · 2020-05-12T04:37:58Z

Wait so like...how did you get creds to the prod db? They only exist in 1 place(to my knowledge)

sellnat77 · 2020-05-12T04:39:41Z

server/src/services/dataService.py

-            Request.nc.in_(ncList),
-        ]
+
+        requestTypes = (', ').join([f"'{rt}'" for rt in requestTypes])


Deja vu 😂

I know lol. I was all about the ORM for a minute (sheep emoji), but it's actually kind of annoying. Documentation isn't great, plus it's just an extra code layer that we really don't need for this app.

sellnat77 · 2020-05-12T04:43:27Z

server/src/services/heatmapService.py

@@ -28,7 +28,7 @@ async def get_heatmap(self, filters):
                filters['requestTypes'],
                filters['ncList'])

-            pins = dataAccess.query(fields, filters)
+            pins = dataAccess.query(fields, filters, table='map')


We should probably put in a table existence safety check somewhere in the dataAcess layer, I'm cool with it going in after this PR but it should prob go in at some point

yeah good call, I'll make a ticket

jmensch1 added 4 commits May 11, 2020 08:41

removed arg to DataService constructor in /apistatus

db065e3

converted DataService to pandas

6bdc330

map and vis views

ad2a7fd

handling null fields or filters in dataservice.query

a9511e2

jmensch1 added this to the 311-Data - Beta milestone May 11, 2020

jmensch1 requested a review from sellnat77 May 11, 2020 17:50

sellnat77 reviewed May 12, 2020

View reviewed changes

sellnat77 approved these changes May 12, 2020

View reviewed changes

jmensch1 merged commit d89c848 into dev May 12, 2020

jmensch1 deleted the BACK-MaterializedViews branch May 12, 2020 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading in pandas/materialized views #602

reading in pandas/materialized views #602

jmensch1 commented May 11, 2020 •

edited

Loading

sellnat77 commented May 12, 2020

sellnat77 May 12, 2020

jmensch1 May 12, 2020

sellnat77 May 12, 2020

jmensch1 May 12, 2020

reading in pandas/materialized views #602

reading in pandas/materialized views #602

Conversation

jmensch1 commented May 11, 2020 • edited Loading

Reading with pandas instead of the ORM

Materialized views

sellnat77 commented May 12, 2020

sellnat77 May 12, 2020

Choose a reason for hiding this comment

jmensch1 May 12, 2020

Choose a reason for hiding this comment

sellnat77 May 12, 2020

Choose a reason for hiding this comment

jmensch1 May 12, 2020

Choose a reason for hiding this comment

jmensch1 commented May 11, 2020 •

edited

Loading