Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading in pandas/materialized views #602

Merged
merged 4 commits into from
May 12, 2020
Merged

reading in pandas/materialized views #602

merged 4 commits into from
May 12, 2020

Conversation

jmensch1
Copy link
Contributor

@jmensch1 jmensch1 commented May 11, 2020

This PR does two things: (1) it changes the DataService to read from the database using pandas instead of the ORM, and (2) adds two materialized views to the database.

Reading with pandas instead of the ORM

Using pandas instead of the ORM to read from the DB makes the code simpler without hurting performance. If we continue to use the ORM, we'll have to define new models for every different table or view that we're reading from. Plus, with the ORM, reading requires two steps: first the DataService uses the ORM to generate a list of dictionaries, and second, the services that use the DataService put that data into a pandas dataframe. It's cleaner to just read the data into a dataframe (using pd.read_sql), since that's where it ends up anyway.

As for performance, I tested pandas reads vs. ORM reads pretty extensively. The read times were almost identical, with maybe a slight advantage to pandas reads. The code for those tests, and some handy charts, is here.

Materialized views

The PR creates two materialized views at the end of the ingest script. There's a view called 'map', which supports the /pin-clusters and /heatmap endpoints, and another one called 'vis', which supports all of the visualization and comparison endpoints. Both views have six columns, and both have indexes on all of the columns that we use for filtering.

As discussed on slack, reading from materialized views is way faster than reading from the full ingest staging table. (Code for those tests is here). On my local, reading a one-year period with all request types and all NCs takes about 14 seconds when I used the full table, but only around 4 seconds to read from the 'map' view. (4 seconds is still too slow IMO, and I'm gonna continue working on getting the time down.)

As for the 10M row limit on heroku -- I created a materialized view in the production database for testing, and it didn't have any effect on the row count in the heroku console. So it looks like materialized views don't count towards that limit.

  • Up to date with dev branch
  • Branch name follows guidelines
  • All PR Status checks are successful
  • Peer reviewed and approved

Any questions? See the getting started guide

@jmensch1 jmensch1 added this to the 311-Data - Beta milestone May 11, 2020
@jmensch1 jmensch1 requested a review from sellnat77 May 11, 2020 17:50
@sellnat77
Copy link
Member

Wait so like...how did you get creds to the prod db? They only exist in 1 place(to my knowledge)

Request.nc.in_(ncList),
]

requestTypes = (', ').join([f"'{rt}'" for rt in requestTypes])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deja vu 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know lol. I was all about the ORM for a minute (sheep emoji), but it's actually kind of annoying. Documentation isn't great, plus it's just an extra code layer that we really don't need for this app.

@@ -28,7 +28,7 @@ async def get_heatmap(self, filters):
filters['requestTypes'],
filters['ncList'])

pins = dataAccess.query(fields, filters)
pins = dataAccess.query(fields, filters, table='map')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably put in a table existence safety check somewhere in the dataAcess layer, I'm cool with it going in after this PR but it should prob go in at some point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good call, I'll make a ticket

@jmensch1 jmensch1 merged commit d89c848 into dev May 12, 2020
@jmensch1 jmensch1 deleted the BACK-MaterializedViews branch May 12, 2020 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants