Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix/feat: use joins instead of n+1/opening multiple connections #1297

Merged
merged 8 commits into from
Feb 1, 2023

Conversation

markphelps
Copy link
Collaborator

@markphelps markphelps commented Jan 28, 2023

Overview

This is kind of a doozy so bear with me. 🐻

This came about while I was looking at our StackHawk scans in reference to FLI-177.

Basically, StackHawk issues a bunch of concurrent queries to probe our app for security concerns, SQL injections, etc.

When I was running this locally (using the SQLite DB for Flipt), I kept running into these 500 errors:

ERROR	finished unary call with code Internal	{"server": "grpc", "grpc.start_time": "2023-01-27T15:40:34-05:00", "system": "grpc", "span.kind": "server", "grpc.service": "flipt.Flipt", "grpc.method": "GetRule", "peer.address": "127.0.0.1:63802", "error": "rpc error: code = Internal desc = database is locked", "grpc.code": "Internal", "grpc.time_ms": 5191.929}
 database is locked

Being the relevant bit.

A trip to the DB

In looking into this further and doing some googling, I realized that this is mainly a problem with SQLite and opening multiple connections simultaneously, as SQLite is really just a file, so it makes sense that only 1 connection should hold the lock on the file while doing its querying.

This also led me to mattn/go-sqlite3#274, where they basically say the same thing, and recommend:

  1. Always calling row.Close() to return the connection back to the pool
  2. Setting db.SetMaxOpenConnections(1) to enforce this constraint of 1 connection

You can see the deadlock/timeout happening now in the unit tests since I followed this advice to set max open connections to 1:

 ?   	go.flipt.io/flipt/internal/storage/oplock/testing	[no test files]
coverage: 67.0% of statements
panic: test timed out after 10m0s

Going through our storage code

This led me to look at our storage (SQL) layer, to ensure we were doing (1), which we are, but then I realized there are several cases where we are trying to open issue multiple queries at once ourselves. Mainly when we try to populate a 1-many relationship such as flag HAS MANY variants like here and here (GetFlag and ListFlags respectively).

Here and in several other places, we are issuing 1 query to get the parent (flag), then calling this method (variants) to issue a query to get all the children (variants) for that flag.

The key part is that we don't actually close the flag Row before issuing the second query to get all the variants, so we are basically creating a deadlock in the case of SQLite.

While we could work around this for GetFlag, by first getting the flag row, then closing the row, then issuing the query to get all variants, this approach won't work as easily for ListFlags as we are required to loop through each row to hydrate the flags.

Proposal(s)

1. Joins The Old Fashioned Way

The correct way to do this kind of thing in SQL is to just use JOINs and get all the flags and their variants at once. This has the benefit of not locking the db in our case and also results in fewer queries overall.

This is what I have prototyped in this PR, starting with the ListFlags because it is the most 'difficult'.

Unfortunately, Go does not make it easy to work with JOINS using the stdlib or even using masterminds/squirrel which is the SQLbuilder library we are using here. This is why I had to add the map in this PR to check to see if we have 'seen' this flag before adding it to the results.

Proposal 1 is basically to continue this pattern throughout the entire storage layer for all parent->child relationships, so that we can do things the 'right way' and stop creating unnecessary queries and potential performance degradations/errors in highly concurrent scenarios.

2. Introduce a Data Model layer + use SQLx or SQLC

This option is basically introducing data models that map to our database instead of passing the proto models all the way down + introducing a library to help with this parent <-> child relationship, removing the need to do this mapping ourselves.

SQLx

SQLx supports parent-child relationships in their StructScan helper, it would just require us to define each of these data structs and add the appropriate db: struct tags to each field.

The benefits of this approach seem to me to be:

  1. Get us closer to where we want to be in FLI-40 (using proper domain models)
  2. Prevent the work required to manually handle these joins/relationships
  3. Potentially reduce bugs by depending on StructScan instead of scanning the row results into the data types ourselves

SQLC

SQLC is similar to the approach that SQLx would result in, however, the main difference is that it generates the data models and the query/mutation methods themselves! Ex: https://docs.sqlc.dev/en/latest/howto/select.html

The benefits of this approach seem to me to be:

  1. All of the benefits in the SQLx approach above
  2. Less code to write/maintain ourselves

The downside of SQLC however is that it does not have the same database compatibility that the Go SQL package/SQLx has.

Currently, it only supports:

  • MySQL
  • Postgres
  • SQLite (beta)

While these are the only DBs we need now, it could potentially limit us if we wanted to support another SQL db down the road (although I'm not sure if we'll ever want to actually do this).

Tl;dr

  • We should probably use JOINs instead of multiple queries to populate parent child data
  • There are a few options, so we should figure out which one we actually want to move to

Comment on lines 166 to 168
// append flag to output results if we haven't seen it yet
if _, ok := uniqueFlags[f.Key]; !ok {
flags = append(flags, f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you could actually do all of this without the map. Given we ensure the return rows are ordered by Flag. This might be the natural order of the join. However, we could ensure it in the ORDER BY clause. e.g. ORDER BY f.created_at, f.key DESC.

Once we have that invariant, we can simply do the following two steps:

  1. Compare the newly scanned flag key to the last seen flag key in the result so far.
  2. If the flag keys differ, then append the newly scanned flag to the results.
  3. append the currently scanned variant to the last flag in the results seen so far.

@GeorgeMac
Copy link
Contributor

GeorgeMac commented Jan 30, 2023

There is another approach to consider. We're doing the classic N+1 but we could also do the Rails style preload.

Rails usually lets you pick between JOIN (eager_load) and the two queries style (preload).
Or includes which decides for you based on some heuristic I have forgotten.

The preload method just makes two queries.
The first is for the parent and the second uses WHERE fk IN (id1, id2, ...) to load the relation.

An ORM would likely just hide the implementation detail of this. Which might be favourable to be fair.

@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2023

Codecov Report

Merging #1297 (56cf0fa) into main (ef32b3e) will increase coverage by 0.28%.
The diff coverage is 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #1297      +/-   ##
==========================================
+ Coverage   80.09%   80.37%   +0.28%     
==========================================
  Files          43       43              
  Lines        3300     3307       +7     
==========================================
+ Hits         2643     2658      +15     
+ Misses        527      518       -9     
- Partials      130      131       +1     
Impacted Files Coverage Δ
internal/storage/sql/db.go 91.36% <100.00%> (+0.45%) ⬆️
internal/storage/sql/fields.go 39.47% <0.00%> (+21.05%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@markphelps markphelps changed the title proposal: use joins instead of n+1/opening multiple connections fix/feat: use joins instead of n+1/opening multiple connections Jan 31, 2023
@markphelps
Copy link
Collaborator Author

markphelps commented Jan 31, 2023

Update: we chose to go with option 1. Joins The Old Fashioned Way and just continues doing the queries/results scanning by hand.

I continued the approach to the ListSegments and ListRules storage methods.

We are also now able to compare benchmarks added in #1307 (now on main) to those from this PR to see how using JOINS instead of multiple queries affects performance/memory allocations.

Here are the results:

Main Branch (no JOINS)

goos: linux
goarch: amd64
pkg: go.flipt.io/flipt/internal/storage/sql
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkListFlags/no-pagination-2         	      20	 289475937 ns/op	19884141 B/op	  677228 allocs/op
BenchmarkListFlags/pagination-limit-10-2   	    1681	   3557599 ns/op	  228084 B/op	    7661 allocs/op
BenchmarkListFlags/pagination-limit-25-2   	     763	   7836253 ns/op	  526125 B/op	   17818 allocs/op
BenchmarkListFlags/pagination-limit-100-2  	     204	  28974698 ns/op	 2016770 B/op	   68599 allocs/op
BenchmarkListFlags/pagination-limit-500-2  	      40	 145750363 ns/op	 9963442 B/op	  339415 allocs/op
BenchmarkListFlags/pagination-2            	      40	 147484753 ns/op	 9963565 B/op	  339424 allocs/op
BenchmarkListSegments/no-pagination-2      	       3	1891547016 ns/op	19284261 B/op	  697224 allocs/op
BenchmarkListSegments/pagination-limit-10-2         	     261	  23129548 ns/op	  221543 B/op	    7881 allocs/op
BenchmarkListSegments/pagination-limit-25-2         	     100	  53584245 ns/op	  510681 B/op	   18338 allocs/op
BenchmarkListSegments/pagination-limit-100-2        	      27	 199298428 ns/op	 1956430 B/op	   70620 allocs/op
BenchmarkListSegments/pagination-limit-500-2        	       6	 973445186 ns/op	 9662901 B/op	  349432 allocs/op
BenchmarkListSegments/pagination-2                  	       6	 974635037 ns/op	 9663716 B/op	  349444 allocs/op

This Branch (Joins)

goos: linux
goarch: amd64
pkg: go.flipt.io/flipt/internal/storage/sql
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkListFlags/no-pagination-2         	      51	 115100076 ns/op	17045878 B/op	  695298 allocs/op
BenchmarkListFlags/pagination-limit-10-2   	    2778	   2183352 ns/op	   29532 B/op	    1031 allocs/op
BenchmarkListFlags/pagination-limit-25-2   	    2511	   2382021 ns/op	   54766 B/op	    2075 allocs/op
BenchmarkListFlags/pagination-limit-100-2  	    1878	   3163501 ns/op	  181777 B/op	    7290 allocs/op
BenchmarkListFlags/pagination-limit-500-2  	     802	   7504184 ns/op	  859892 B/op	   35098 allocs/op
BenchmarkListFlags/pagination-2            	     190	  31245636 ns/op	  860305 B/op	   35108 allocs/op
BenchmarkListSegments/no-pagination-2      	      44	 132385529 ns/op	16565384 B/op	  725297 allocs/op
BenchmarkListSegments/pagination-limit-10-2         	     319	  18577732 ns/op	   29003 B/op	    1064 allocs/op
BenchmarkListSegments/pagination-limit-25-2         	     322	  18671660 ns/op	   53517 B/op	    2153 allocs/op
BenchmarkListSegments/pagination-limit-100-2        	     306	  19551521 ns/op	  176936 B/op	    7593 allocs/op
BenchmarkListSegments/pagination-limit-500-2        	     246	  24178912 ns/op	  835893 B/op	   36601 allocs/op
BenchmarkListSegments/pagination-2                  	     134	  44454415 ns/op	  836265 B/op	   36611 allocs/op

If I'm reading it correctly, it looks like this approach results in anywhere from a 2-10x speed improvement AND uses anywhere from half to a tenth of the memory allocations!!

And also it fixes the SQLite contention issue, as we are now setting the MaxOpenConns(1) if running with SQLite.

Seems like a win/win..win.

@markphelps markphelps marked this pull request as ready for review January 31, 2023 19:25
@markphelps markphelps requested a review from a team as a code owner January 31, 2023 19:25
@markphelps
Copy link
Collaborator Author

Note: we also opted to continue using the map approach, we can potentially lessen the number of allocations by going with the approach laid out by @GeorgeMac here: #1297 (comment) in the future.

Now that we have these benchmarks in place it should be easy to see if we can eek out any more performance here.

We also will create an internal issue to look into SQLx or alternative approaches in the future

Copy link
Contributor

@GeorgeMac GeorgeMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. There is one suggested quality of life change to config.

internal/storage/sql/db.go Outdated Show resolved Hide resolved
@markphelps
Copy link
Collaborator Author

This is awesome. There is one suggested quality of life change to config.

I moved the log warning to internal/cmd/grpc since we already have access to the logger there

@markphelps markphelps merged commit 6ded9f8 into main Feb 1, 2023
@markphelps markphelps deleted the sqlite-concurrency-fix branch February 1, 2023 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants