fix/feat: use joins instead of n+1/opening multiple connections #1297

markphelps · 2023-01-28T17:16:09Z

Overview

This is kind of a doozy so bear with me. 🐻

This came about while I was looking at our StackHawk scans in reference to FLI-177.

Basically, StackHawk issues a bunch of concurrent queries to probe our app for security concerns, SQL injections, etc.

When I was running this locally (using the SQLite DB for Flipt), I kept running into these 500 errors:

ERROR	finished unary call with code Internal	{"server": "grpc", "grpc.start_time": "2023-01-27T15:40:34-05:00", "system": "grpc", "span.kind": "server", "grpc.service": "flipt.Flipt", "grpc.method": "GetRule", "peer.address": "127.0.0.1:63802", "error": "rpc error: code = Internal desc = database is locked", "grpc.code": "Internal", "grpc.time_ms": 5191.929}

 database is locked

Being the relevant bit.

A trip to the DB

In looking into this further and doing some googling, I realized that this is mainly a problem with SQLite and opening multiple connections simultaneously, as SQLite is really just a file, so it makes sense that only 1 connection should hold the lock on the file while doing its querying.

This also led me to mattn/go-sqlite3#274, where they basically say the same thing, and recommend:

Always calling row.Close() to return the connection back to the pool
Setting db.SetMaxOpenConnections(1) to enforce this constraint of 1 connection

You can see the deadlock/timeout happening now in the unit tests since I followed this advice to set max open connections to 1:

 ?   	go.flipt.io/flipt/internal/storage/oplock/testing	[no test files]
coverage: 67.0% of statements
panic: test timed out after 10m0s

Going through our storage code

This led me to look at our storage (SQL) layer, to ensure we were doing (1), which we are, but then I realized there are several cases where we are trying to open issue multiple queries at once ourselves. Mainly when we try to populate a 1-many relationship such as flag HAS MANY variants like here and here (GetFlag and ListFlags respectively).

Here and in several other places, we are issuing 1 query to get the parent (flag), then calling this method (variants) to issue a query to get all the children (variants) for that flag.

The key part is that we don't actually close the flag Row before issuing the second query to get all the variants, so we are basically creating a deadlock in the case of SQLite.

While we could work around this for GetFlag, by first getting the flag row, then closing the row, then issuing the query to get all variants, this approach won't work as easily for ListFlags as we are required to loop through each row to hydrate the flags.

Proposal(s)

1. Joins The Old Fashioned Way

The correct way to do this kind of thing in SQL is to just use JOINs and get all the flags and their variants at once. This has the benefit of not locking the db in our case and also results in fewer queries overall.

This is what I have prototyped in this PR, starting with the ListFlags because it is the most 'difficult'.

Unfortunately, Go does not make it easy to work with JOINS using the stdlib or even using masterminds/squirrel which is the SQLbuilder library we are using here. This is why I had to add the map in this PR to check to see if we have 'seen' this flag before adding it to the results.

Proposal 1 is basically to continue this pattern throughout the entire storage layer for all parent->child relationships, so that we can do things the 'right way' and stop creating unnecessary queries and potential performance degradations/errors in highly concurrent scenarios.

2. Introduce a Data Model layer + use SQLx or SQLC

This option is basically introducing data models that map to our database instead of passing the proto models all the way down + introducing a library to help with this parent <-> child relationship, removing the need to do this mapping ourselves.

SQLx

SQLx supports parent-child relationships in their StructScan helper, it would just require us to define each of these data structs and add the appropriate db: struct tags to each field.

The benefits of this approach seem to me to be:

Get us closer to where we want to be in FLI-40 (using proper domain models)
Prevent the work required to manually handle these joins/relationships
Potentially reduce bugs by depending on StructScan instead of scanning the row results into the data types ourselves

SQLC

SQLC is similar to the approach that SQLx would result in, however, the main difference is that it generates the data models and the query/mutation methods themselves! Ex: https://docs.sqlc.dev/en/latest/howto/select.html

The benefits of this approach seem to me to be:

All of the benefits in the SQLx approach above
Less code to write/maintain ourselves

The downside of SQLC however is that it does not have the same database compatibility that the Go SQL package/SQLx has.

Currently, it only supports:

MySQL
Postgres
SQLite (beta)

While these are the only DBs we need now, it could potentially limit us if we wanted to support another SQL db down the road (although I'm not sure if we'll ever want to actually do this).

Tl;dr

We should probably use JOINs instead of multiple queries to populate parent child data
There are a few options, so we should figure out which one we actually want to move to

GeorgeMac · 2023-01-30T09:53:04Z

internal/storage/sql/common/flag.go

+		// append flag to output results if we haven't seen it yet
+		if _, ok := uniqueFlags[f.Key]; !ok {
+			flags = append(flags, f)


I believe you could actually do all of this without the map. Given we ensure the return rows are ordered by Flag. This might be the natural order of the join. However, we could ensure it in the ORDER BY clause. e.g. ORDER BY f.created_at, f.key DESC.

Once we have that invariant, we can simply do the following two steps:

Compare the newly scanned flag key to the last seen flag key in the result so far.

If the flag keys differ, then append the newly scanned flag to the results.

append the currently scanned variant to the last flag in the results seen so far.

GeorgeMac · 2023-01-30T11:37:20Z

There is another approach to consider. We're doing the classic N+1 but we could also do the Rails style preload.

Rails usually lets you pick between JOIN (eager_load) and the two queries style (preload).
Or includes which decides for you based on some heuristic I have forgotten.

The preload method just makes two queries.
The first is for the parent and the second uses WHERE fk IN (id1, id2, ...) to load the relation.

An ORM would likely just hide the implementation detail of this. Which might be favourable to be fair.

codecov-commenter · 2023-01-31T18:07:17Z

Codecov Report

Merging #1297 (56cf0fa) into main (ef32b3e) will increase coverage by 0.28%.
The diff coverage is 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #1297      +/-   ##
==========================================
+ Coverage   80.09%   80.37%   +0.28%     
==========================================
  Files          43       43              
  Lines        3300     3307       +7     
==========================================
+ Hits         2643     2658      +15     
+ Misses        527      518       -9     
- Partials      130      131       +1

Impacted Files	Coverage Δ
internal/storage/sql/db.go	`91.36% <100.00%> (+0.45%)`	⬆️
internal/storage/sql/fields.go	`39.47% <0.00%> (+21.05%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

markphelps · 2023-01-31T19:25:50Z

Update: we chose to go with option 1. Joins The Old Fashioned Way and just continues doing the queries/results scanning by hand.

I continued the approach to the ListSegments and ListRules storage methods.

We are also now able to compare benchmarks added in #1307 (now on main) to those from this PR to see how using JOINS instead of multiple queries affects performance/memory allocations.

Here are the results:

Main Branch (no JOINS)

goos: linux
goarch: amd64
pkg: go.flipt.io/flipt/internal/storage/sql
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkListFlags/no-pagination-2         	      20	 289475937 ns/op	19884141 B/op	  677228 allocs/op
BenchmarkListFlags/pagination-limit-10-2   	    1681	   3557599 ns/op	  228084 B/op	    7661 allocs/op
BenchmarkListFlags/pagination-limit-25-2   	     763	   7836253 ns/op	  526125 B/op	   17818 allocs/op
BenchmarkListFlags/pagination-limit-100-2  	     204	  28974698 ns/op	 2016770 B/op	   68599 allocs/op
BenchmarkListFlags/pagination-limit-500-2  	      40	 145750363 ns/op	 9963442 B/op	  339415 allocs/op
BenchmarkListFlags/pagination-2            	      40	 147484753 ns/op	 9963565 B/op	  339424 allocs/op
BenchmarkListSegments/no-pagination-2      	       3	1891547016 ns/op	19284261 B/op	  697224 allocs/op
BenchmarkListSegments/pagination-limit-10-2         	     261	  23129548 ns/op	  221543 B/op	    7881 allocs/op
BenchmarkListSegments/pagination-limit-25-2         	     100	  53584245 ns/op	  510681 B/op	   18338 allocs/op
BenchmarkListSegments/pagination-limit-100-2        	      27	 199298428 ns/op	 1956430 B/op	   70620 allocs/op
BenchmarkListSegments/pagination-limit-500-2        	       6	 973445186 ns/op	 9662901 B/op	  349432 allocs/op
BenchmarkListSegments/pagination-2                  	       6	 974635037 ns/op	 9663716 B/op	  349444 allocs/op

This Branch (Joins)

goos: linux
goarch: amd64
pkg: go.flipt.io/flipt/internal/storage/sql
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkListFlags/no-pagination-2         	      51	 115100076 ns/op	17045878 B/op	  695298 allocs/op
BenchmarkListFlags/pagination-limit-10-2   	    2778	   2183352 ns/op	   29532 B/op	    1031 allocs/op
BenchmarkListFlags/pagination-limit-25-2   	    2511	   2382021 ns/op	   54766 B/op	    2075 allocs/op
BenchmarkListFlags/pagination-limit-100-2  	    1878	   3163501 ns/op	  181777 B/op	    7290 allocs/op
BenchmarkListFlags/pagination-limit-500-2  	     802	   7504184 ns/op	  859892 B/op	   35098 allocs/op
BenchmarkListFlags/pagination-2            	     190	  31245636 ns/op	  860305 B/op	   35108 allocs/op
BenchmarkListSegments/no-pagination-2      	      44	 132385529 ns/op	16565384 B/op	  725297 allocs/op
BenchmarkListSegments/pagination-limit-10-2         	     319	  18577732 ns/op	   29003 B/op	    1064 allocs/op
BenchmarkListSegments/pagination-limit-25-2         	     322	  18671660 ns/op	   53517 B/op	    2153 allocs/op
BenchmarkListSegments/pagination-limit-100-2        	     306	  19551521 ns/op	  176936 B/op	    7593 allocs/op
BenchmarkListSegments/pagination-limit-500-2        	     246	  24178912 ns/op	  835893 B/op	   36601 allocs/op
BenchmarkListSegments/pagination-2                  	     134	  44454415 ns/op	  836265 B/op	   36611 allocs/op

If I'm reading it correctly, it looks like this approach results in anywhere from a 2-10x speed improvement AND uses anywhere from half to a tenth of the memory allocations!!

And also it fixes the SQLite contention issue, as we are now setting the MaxOpenConns(1) if running with SQLite.

Seems like a win/win..win.

markphelps · 2023-01-31T19:29:29Z

Note: we also opted to continue using the map approach, we can potentially lessen the number of allocations by going with the approach laid out by @GeorgeMac here: #1297 (comment) in the future.

Now that we have these benchmarks in place it should be easy to see if we can eek out any more performance here.

We also will create an internal issue to look into SQLx or alternative approaches in the future

GeorgeMac

This is awesome. There is one suggested quality of life change to config.

internal/storage/sql/db.go

Co-authored-by: George <me@georgemac.com>

markphelps · 2023-02-01T15:55:36Z

This is awesome. There is one suggested quality of life change to config.

I moved the log warning to internal/cmd/grpc since we already have access to the logger there

pull-request-size bot added the size/L label Jan 28, 2023

GeorgeMac reviewed Jan 30, 2023

View reviewed changes

pull-request-size bot added size/XL and removed size/L labels Jan 31, 2023

markphelps mentioned this pull request Jan 31, 2023

chore: add benchmark tests for list flags/segments #1307

Merged

markphelps added 5 commits January 31, 2023 14:10

chore: wip sqlite concurrency issue

b3c2a43

chore: use left join to get flags/variants at same time

1bf6fd1

chore: comments to clarify logic

7797e39

chore: finish joins

bd84114

chore: set maxopencons to 1 for sqlite

d8e27e2

markphelps force-pushed the sqlite-concurrency-fix branch from 7e34f9b to d8e27e2 Compare January 31, 2023 19:10

markphelps changed the title ~~proposal: use joins instead of n+1/opening multiple connections~~ fix/feat: use joins instead of n+1/opening multiple connections Jan 31, 2023

markphelps marked this pull request as ready for review January 31, 2023 19:25

markphelps requested a review from a team as a code owner January 31, 2023 19:25

chore: only disable CSP headers in development

4d06f5e

markphelps requested a review from GeorgeMac February 1, 2023 14:00

GeorgeMac approved these changes Feb 1, 2023

View reviewed changes

internal/storage/sql/db.go Outdated Show resolved Hide resolved

markphelps and others added 2 commits February 1, 2023 10:47

Update internal/storage/sql/db.go

a7d0c44

Co-authored-by: George <me@georgemac.com>

chore: log warning if overriding max_open_conn

56cf0fa

markphelps merged commit 6ded9f8 into main Feb 1, 2023

markphelps deleted the sqlite-concurrency-fix branch February 1, 2023 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/feat: use joins instead of n+1/opening multiple connections #1297

fix/feat: use joins instead of n+1/opening multiple connections #1297

markphelps commented Jan 28, 2023 •

edited

Loading

GeorgeMac Jan 30, 2023

GeorgeMac commented Jan 30, 2023 •

edited

Loading

codecov-commenter commented Jan 31, 2023 •

edited

Loading

markphelps commented Jan 31, 2023 •

edited

Loading

markphelps commented Jan 31, 2023

GeorgeMac left a comment

markphelps commented Feb 1, 2023

fix/feat: use joins instead of n+1/opening multiple connections #1297

fix/feat: use joins instead of n+1/opening multiple connections #1297

Conversation

markphelps commented Jan 28, 2023 • edited Loading

Overview

A trip to the DB

Going through our storage code

Proposal(s)

1. Joins The Old Fashioned Way

2. Introduce a Data Model layer + use SQLx or SQLC

SQLx

SQLC

Tl;dr

GeorgeMac Jan 30, 2023

Choose a reason for hiding this comment

GeorgeMac commented Jan 30, 2023 • edited Loading

codecov-commenter commented Jan 31, 2023 • edited Loading

Codecov Report

markphelps commented Jan 31, 2023 • edited Loading

Main Branch (no JOINS)

This Branch (Joins)

markphelps commented Jan 31, 2023

GeorgeMac left a comment

Choose a reason for hiding this comment

markphelps commented Feb 1, 2023

markphelps commented Jan 28, 2023 •

edited

Loading

GeorgeMac commented Jan 30, 2023 •

edited

Loading

codecov-commenter commented Jan 31, 2023 •

edited

Loading

markphelps commented Jan 31, 2023 •

edited

Loading