Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hacky support for Postgres schemas w/ runtime interpolation #3370

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bgentry
Copy link

@bgentry bgentry commented May 9, 2024

Hi @kyleconroy and @andrewmbenton 👋 @brandur and I have recently been looking more into making River fully support isolation at the Postgres schema level (that is, allowing different River clients to be configured to use different Postgres schemas). Our initial hope was to be able to rely on the search_path connection param as recommended by pgx and #2635 for this purpose.

Unfortunately, one of our users quickly discovered that this approach is basically incompatible with pgbouncer. As far as I know the only ways to make this work with pgbouncer and sqlc would be:

  1. Explicitly SET search_path prior to any query, within a transaction. This is error prone and has perf penalties in the form of added round trips.
  2. Configure pgbouncer to isolate client pools for specific schemas at the user level. This is a pain to configure and precludes mixing and matching multiple schemas within a single conn pool. Not great.
  3. Hypothetically, configure the database queries to explicitly specify the schema prefix on every query where it's required. I say hypothetically because this is of course not supported by sqlc today due to the fact that the schema can't be parameterized (it must be explicitly inserted into the SQL string).

I decided to spend chunk of time last night seeing if I could hack a way to achieve option (3) in sqlc. It actually wasn't too difficult to achieve in a hacky way, though doing so did cement my dislike of Go's text/template 😆 This draft PR shows one potential path for achieving this using a new use_schema_placeholder option, which when enabled causes all queries to be passed through a runtime string replacement on the known prefix placeholder sqlc_schema_placeholder.. It's definitely ugly and probably shouldn't be shipped in any form that resembles this PR, but I did want to at least raise it as a possible proof-of-concept.

Here's a corresponding River commit showing the queries that I updated to make use of the placeholder prefix. This sqlc PR does actually build and generate fully functional code for being able to use arbitrary schemas specified at runtime as shown in that commit.

I have lots of ideas for how this could be improved or done differently from the way I've done here (caching, maybe make this customizable per-query ala Ecto instead of being statically wired into Queries, etc.), but rather than spin my wheels on them I thought I'd use this draft PR to start a discussion around if and how this might be achieved in sqlc given the fact that the previously recommended approach is not actually viable for those using pgbouncer in the most common way.

I also want to highlight that this is fairly distinct from allowing arbitrary runtime SQL substitution, because a schema prefix can be strictly validated to avoid i.e. injection concerns.

Related issues:

@kyleconroy
Copy link
Collaborator

@bgentry Thanks for kicking off the conversation. I agree with you that this is a hacky solution, but I think we can figure out a solution. To confirm, when writing your queries, the SQL looks like the following:

-- name: GetAuthor :one
SELECT * FROM sqlc_schema_placeholder.authors
WHERE id = $1 LIMIT 1;

Instead of this, I was thinking we could add support for dynamic identifiers. The queries would look something like this

-- name: GetAuthor :one
SELECT * FROM "{schema}.authors"
WHERE id = $1 LIMIT 1;

This would add another parameter to the queries. The generated code would look like:

func GetAuthor(ctx context.Context, schema string, id int) {
    ....
}

@bgentry
Copy link
Author

bgentry commented May 9, 2024

@kyleconroy I think that would work perfectly well for what we would need, and it would be more flexible too (able to support fully qualified my_db.my_schema.table_name identifiers too). Given my limited knowledge of sqlc's internals I'm not really sure how it could be implemented but maybe you do? 😆

Do you have any thoughts on the query cache side of it? Obviously you wouldn't want to perform the same string substitution repeatedly hundreds of times per second. You could imagine a swappable implementation via an interface like:

type QueryCache interface {
    Substitute(query, placeholder, value) string
}

A naiive implementation could just replace it every single time, whereas a more advanced one could cache the results with a tunable size or LRU setup.

@elee1766
Copy link

elee1766 commented May 9, 2024

Do you have any thoughts on the query cache side of it?

im not him but i do. i dont think a cache here is as good as you think

Obviously you wouldn't want to perform the same string substitution repeatedly hundreds of times per second.

i dont think this is obvious either. a string replacement is simple O(n) copy operation that can be optimized in advance since the we know how many to replace and actually we already know where the things we need to replace are.

on the other hand, a lookup in cache/LRU is not only a O(n) operation since it needs to validate equality, but it also requires hashing and extra time to shuffle buckets, and it won't necessarily save you any string copies either.

A simple benchmark with naive strings.ReplaceAll shows this- I bet that some simple preprocessing and hits will make it faster and more memory efficient than the mapping. the speed is the same for small queries, and hash table is slower than replace for a large table, im assuming because of the hash + compound key. Making it a nested map might reduce the cost of the compound key but that would increase the runtime cost.

code:

package subbench_test

import (
	"strings"
	"testing"
)

func genQuery(replacement string, prefix, suffix int) string {

	return strings.Repeat("a", prefix) + replacement + strings.Repeat("b", suffix)
}

func BenchmarkReplaceSmallString(b *testing.B) {
	q := genQuery("{schema}", 30, 250)
	b.StartTimer()
	for i := 0; i < b.N; i++ {
		strings.ReplaceAll(q, "{schema}", "public")
	}
}

func BenchmarkReplaceLargeString(b *testing.B) {
	q := genQuery("{schema}", 1000000, 1000000)

	b.StartTimer()
	for i := 0; i < b.N; i++ {
		strings.ReplaceAll(q, "{schema}", "public")
	}
}

func BenchmarkCacheSmallString(b *testing.B) {
	q := genQuery("{schema}", 30, 250)
	ans := strings.ReplaceAll(q, "{schema}", "public")
	cache := map[string]string{}
	cache[q+"public"] = ans
	b.StartTimer()
	for i := 0; i < b.N; i++ {
		_ = cache[q+"public"]
	}
}

func BenchmarkCacheLargeString(b *testing.B) {
	q := genQuery("{schema}", 1000000, 1000000)
	ans := strings.ReplaceAll(q, "{schema}", "public")
	cache := map[string]string{}
	cache[q+"public"] = ans
	b.StartTimer()
	for i := 0; i < b.N; i++ {
		_ = cache[q+"public"]
	}
}

results:

goos: linux
goarch: amd64
pkg: benchmarks
cpu: 13th Gen Intel(R) Core(TM) i9-13900K
BenchmarkReplaceSmallString-32           9861722               194.3 ns/op           288 B/op          1 allocs/op
BenchmarkReplaceLargeString-32              3223            497422 ns/op         2008308 B/op          1 allocs/op
BenchmarkCacheSmallString-32             9496440               183.6 ns/op           320 B/op          1 allocs/op
BenchmarkCacheLargeString-32                3402            588169 ns/op         2009413 B/op          1 allocs/op
PASS

edit: upon some more thinking, it could perhaps be possible to create an efficient cache if one was initialized per query? that would remove the large cost of hashing the query, as that would be compiled into the code, and instead you would only hash your substitutions. It would still be debatable performance based off the query size and substitution size though. the query for pgx has to be a string anyways - and so for large queries, that allocation will probably always be your bottleneck, while for small strings substitution is rather fast.

@bgentry
Copy link
Author

bgentry commented Jun 21, 2024

Hey @kyleconroy, wanted to see if you've thought any more about how this might be implemented or where it fits on your roadmap. As of now it's the only thing blocking River from having full support for multiple Postgres schemas, and while we really don't want to abandon sqlc it's a pretty important feature that we'll need to support soon one way or another.

@elee1766
Copy link

Instead of this, I was thinking we could add support for dynamic identifiers. The queries would look something like this

-- name: GetAuthor :one
SELECT * FROM "{schema}.authors"
WHERE id = $1 LIMIT 1;

This would add another parameter to the queries. The generated code would look like:

func GetAuthor(ctx context.Context, schema string, id int) {
    ....
}

@kyleconroy at this point freeing up these sessions (thousands) that river requires because of this limitation of sqlc would be big difference for us. i'm glad to do all the work to implement this feature, but I want to confirm that this is still something you are interested in and would be okay with merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants