Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient queries for connection list #17360

Merged
merged 18 commits into from
Oct 10, 2022
Merged

Conversation

pmossman
Copy link
Contributor

@pmossman pmossman commented Sep 28, 2022

What

The web_backend connections list handler finds all connections within a workspace, and loops over each one, calling other handlers to build source reads, destination reads, and fetching the latest job info. Doing this once per connection is very slow and can require hundreds of queries for workspaces with many connections.

Instead, this PR performs all necessary queries up front and stores the information in maps that can then be referenced when building the response model for each connection. This should drastically reduce the number of database queries.

How

  • Write new repository methods that fetch information given lists of connections/sources/etc.
  • Write helpers that convert lists to maps for easy lookups
  • Pass these maps into the methods that construct response items to eliminate queries inside the connection loop

Performance Testing

I did some comparison in dev between master and this branch for a workspace with ~250 active connections

  • The code in master generates ~3.3k individual queries, and takes ~15-20 seconds. Here's an example span
  • The code in this branch generates 11 queries, and this number is constant no matter how many connections are in the workspace. Here's an example span. The workspace loads in ~1-3 seconds instead of ~15-20 seconds, which is obviously a massive improvement.

Tagging @malikdiarra and @davinchia for review as they both have context on our repository layer and experience with making similar optimizations. Definitely looking for thoughts and feedback on the new queries and in-memory grouping code.

@pmossman pmossman changed the title Parker/optimize connection list Efficient queries for connection list Sep 28, 2022
@github-actions github-actions bot added area/platform issues related to the platform area/server labels Sep 28, 2022
@pmossman pmossman temporarily deployed to more-secrets September 28, 2022 22:01 Inactive
@pmossman pmossman temporarily deployed to more-secrets September 29, 2022 00:55 Inactive
@pmossman pmossman force-pushed the parker/optimize-connection-list branch from 7da0acd to af043f7 Compare September 30, 2022 17:00
@pmossman pmossman temporarily deployed to more-secrets September 30, 2022 17:02 Inactive
@pmossman pmossman temporarily deployed to more-secrets September 30, 2022 20:51 Inactive
@pmossman pmossman force-pushed the parker/optimize-connection-list branch from 518f06a to 15fc834 Compare October 3, 2022 16:17
@pmossman pmossman temporarily deployed to more-secrets October 3, 2022 16:20 Inactive
@pmossman pmossman temporarily deployed to more-secrets October 3, 2022 16:34 Inactive
@pmossman pmossman force-pushed the parker/optimize-connection-list branch from f5b39bb to 058c5df Compare October 3, 2022 17:54
@pmossman pmossman marked this pull request as ready for review October 3, 2022 17:54
@pmossman pmossman temporarily deployed to more-secrets October 3, 2022 17:56 Inactive
@pmossman pmossman temporarily deployed to more-secrets October 4, 2022 18:41 Inactive
@davinchia
Copy link
Contributor

Good points:

  • we removed another few queries by combining the joins.
  • this is so SO much more readable!

I'm curious what kind of performance improvement we have here.

@pmossman pmossman force-pushed the parker/optimize-connection-list branch from 84409c8 to 67d1441 Compare October 10, 2022 15:38
@pmossman pmossman temporarily deployed to more-secrets October 10, 2022 15:40 Inactive
@pmossman pmossman temporarily deployed to more-secrets October 10, 2022 16:17 Inactive
@pmossman pmossman merged commit 39a14b7 into master Oct 10, 2022
@pmossman pmossman deleted the parker/optimize-connection-list branch October 10, 2022 16:55
jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request Oct 31, 2022
* query once for all needed models, instead of querying within connections loop

* cleanup and fix failing tests

* pmd fix

* fix query and add test

* return empty if input list is empty

* undo aggressive autoformatting

* don't query for connection operations in a loop, instead query once and group-by connectionID in memory

* try handling operationIds in a single query instead of two

* remove optional

* fix operationIds query

* very annoying, test was failing because operationIds can be listed in a different order. verify operationIds separately from rest of object

* combined queries/functions instead of separate queries for actor and definition

* remove leftover lines that aren't doing anything

* format

* add javadoc

* format

* use leftjoin so that connections that lack operations aren't left out

* clean up comments and format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants