Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single binary server (SQLite) and concurrency bugfixes for in memory storages #2114

Merged
merged 18 commits into from
Feb 1, 2022

Conversation

wolfy-j
Copy link
Contributor

@wolfy-j wolfy-j commented Oct 29, 2021

  • Fixed the issues causing SQLite to deadlock when multiple writers perform an operation
  • Updated SQL tools and server build to include SQLite database
  • Added connection option to automatically setup in-memory/file database schema
  • Removed the dependency on SQL shared cache in order to make the system work (you can use private caches now)
  • Added the reference counter for GC (not used at the moment due to non-overlapping DB connections)

In short, you can use temporal now as a single binary file with in-memory or file SQLite database.

These changes properly enabled the support for SQLite in the temporal server. The initial implementation did not address the fact that internal Temporal services are highly isolated, each will create at least a single connection to the database violating the SQLite concept of safety only within a single thread. While the base configuration works on any minor load it might and it will lock the database due to concurrent writers from different parts of the system.

The clean solution will be to refactor the Temporal container in order to provide a single database instance to all the services and limit concurrency on the SQL driver level. However, such change will cause a significant impact and will require very careful testing of the server. Instead, the change has been addressed on the plugin level only, since it's directly responsible for providing the valid database instance. Ideally, we also have to separate the database from the matching service and use the most appropriate solution for each service.

The following configuration is valid to start Temporal without any external dependencies or databases:

log:
  stdout: true
  level: error

persistence:
  defaultStore: default
  visibilityStore: visibility
  numHistoryShards: 1
  datastores:
    default:
      sql:
        user: ""
        password: ""
        pluginName: "sqlite"
        databaseName: "default"
        connectAddr: "localhost"
        connectProtocol: "tcp"
        connectAttributes:
          mode: "memory"
          cache: "private"
          setup: true
        maxConns: 1
        maxIdleConns: 1
        maxConnLifetime: "1h"
        tls:
          enabled: false
          сaFile: ""
          certFile: ""
          keyFile: ""
          enableHostVerification: false
          serverName: ""

    visibility:
      sql:
        user: ""
        password: ""
        pluginName: "sqlite"
        databaseName: "default"
        connectAddr: "localhost"
        connectProtocol: "tcp"
        connectAttributes:
          mode: "memory"
          cache: "private"
          setup: true
        maxConns: 1
        maxIdleConns: 1
        maxConnLifetime: "1h"
        tls:
          enabled: false
          сaFile: ""
          certFile: ""
          keyFile: ""
          enableHostVerification: false
          serverName: ""

services:
  frontend:
    rpc:
      grpcPort: 7233
      membershipPort: 6933
      bindOnLocalHost: true
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "temporal"
  matching:
    rpc:
      grpcPort: 7235
      membershipPort: 6935
      bindOnLocalHost: true
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "temporal"
  history:
    rpc:
      grpcPort: 7234
      membershipPort: 6934
      bindOnLocalHost: true
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "temporal"
  worker:
    rpc:
      grpcPort: 7239
      membershipPort: 6939
      bindOnLocalHost: true
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "temporal"

publicClient:
  hostPort: "localhost:7233"

global:
  membership:
    maxJoinDuration: 30s
    broadcastAddress: "127.0.0.1"

clusterMetadata:
  enableGlobalNamespace: false
  failoverVersionIncrement: 10
  masterClusterName: "active"
  currentClusterName: "active"
  clusterInformation:
    active:
      enabled: true
      initialFailoverVersion: 1
      rpcName: "frontend"
      rpcAddress: "localhost:7233"

The temporal database will be fully configured in memory (see connectionAttribute setup) and ready to use within a few seconds. You still have to init the namespace.

This setup can not be used on production for obvious reasons but can be used for demos, tests, and micro applications which do not experience high loads.

Tested locally with multiple concurrent workflows (1000). Observed performance ~30 workflows per second (hello world with a single activity, on PHP SDK).

Does not affect production usage.

P.S. Please let me know what you want me to change in here.

@wolfy-j wolfy-j requested a review from a team October 29, 2021 13:57
Copy link
Member

@paulnpdev paulnpdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of nits

common/persistence/sql/sqlplugin/sqlite/conn_pool.go Outdated Show resolved Hide resolved
common/persistence/sql/sqlplugin/sqlite/conn_pool.go Outdated Show resolved Hide resolved
@wolfy-j wolfy-j requested a review from paulnpdev October 29, 2021 16:37
@wxing1292 wxing1292 self-assigned this Oct 30, 2021
Copy link
Member

@yiminc yiminc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @wolfy-j , thank you for your contribution.

refCount int
}

func newConnPool() *connPool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connection pool exists here due to frontend / matching / history / worker (4 services) each will create its own DB connection & causing DB locking issue?

can you plz put a comment here?
we can possibly improve the DI to initialize this DB connection only once (for single binary), then this conn pool management can possibly be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added. Yes, refactoring DI would be the ideal approach, but I'm not sure about the size of the impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also try to turn off cluster, gossip and other services irrelevant for single binary server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, some services open and close connections sequentially, getting DB connection stopped and destroying the database in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also try to turn off cluster, gossip and other services irrelevant for single binary server?

we definitely should (assuming we are talking about the ideal version), plus inter-service communication can by pass network layer.

Also, some services open and close connections sequentially, getting DB connection stopped and destroying the database in memory.

sorry what do you mean? if using DI, DI should only initialize the DB conn once (assuming 1 and only 1 conn)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case won't exist with DI, but it exists now. We can probably take a look at it once it will be time to clean up all unnecessary services.

Copy link
Contributor

@wxing1292 wxing1292 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also create a new development_sqlite.yaml file with the static configs?

cmd/tools/sql/main.go Outdated Show resolved Hide resolved
@wxing1292
Copy link
Contributor

also run make goimports

@wolfy-j wolfy-j requested a review from wxing1292 October 31, 2021 09:32
@wolfy-j
Copy link
Contributor Author

wolfy-j commented Nov 2, 2021

I guess it's the question about turning CGO off or on now.

@@ -41,6 +41,7 @@ import (
"go.temporal.io/server/common/log/tag"
_ "go.temporal.io/server/common/persistence/sql/sqlplugin/mysql" // needed to load mysql plugin
_ "go.temporal.io/server/common/persistence/sql/sqlplugin/postgresql" // needed to load postgresql plugin
_ "go.temporal.io/server/common/persistence/sql/sqlplugin/sqlite" // needed to load sqlite plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what breaks the build. Apparently w/o this line, sqlite package is not even compiled. With this line it gets compiled but compilation fails because we don't use CGO for our production Linux builds and

Important: because this is a CGO enabled package you are required to set the environment variable CGO_ENABLED=1 and have a gcc compile present within your path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to enable CGO and don't know what is the best way to solve this. The only one place where sqlite3 is used. Maybe we can avoid importing this package and use some other approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you already have a driver which utilizes the SQLite CGO package. The alternative is to use a different type of embedded database... or something like this - https://www.reddit.com/r/golang/comments/n1u2b1/new_advanced_cgofree_sqlite_package/

I agree about avoiding CGO, maybe create a different build pipeline? Or it's too complicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also turn off the plugin in server and SQL so DataDog can patch their TemporalLite server.

Copy link
Contributor

@jlegrone jlegrone Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there was lots of discussion around CGO on the original PR. I did have a non-CGO version working at one point using modernc.org/sqlite, but we decided to just skip importing the driver by default: #1968 (comment)

I'd suggest doing the same here and follow up on supporting non-CGO builds or dealing with cross-compilation issues introduced by CGO separately.

Copy link
Contributor

@jlegrone jlegrone Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the mean time it would be pretty straightforward to add a file to the main package which uses a build tag to conditionally import the sqlite driver when CGO is enabled. That way anyone wanting to try out the sqlite driver just needs to compile from source rather than having to write their own Temporal server wrapper or pull in Temporalite.

@@ -112,9 +123,38 @@ func (p *plugin) createDBConnection(
// Maps struct names in CamelCase to snake without need for db struct tags.
db.MapperFunc(strcase.ToSnake)

// only suitable for in memory databases!
if _, ok := cfg.ConnectAttributes["setup"]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up we pass connection attributes more or less transparently into the sqlite driver here:

for k, v := range cfg.ConnectAttributes {
key := strings.TrimSpace(k)
value := strings.TrimSpace(v)
if parameters.Get(key) != "" {
return nil, fmt.Errorf("duplicate connection attr: %v:%v, %v:%v",
key,
parameters.Get(key),
key, value,
)
}
parameters.Set(key, value)
}

The current driver may not be picky about it but I think we should avoid passing in non-standard sqlite config options. How about checking if the driver is using memory mode instead? When persisting to a file we can use the same migration patterns we do for postgres/mysql/cassandra backends.

Suggested change
if _, ok := cfg.ConnectAttributes["setup"]; ok {
if mode := cfg.ConnectAttributes["mode"]; mode == "memory" {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen a notification about this question. Ugh. I want to leave the dedicated flag since it makes possible to run bootstrap on sqlite file database (which is handy). But now this flag will be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlegrone Also, I tried https://github.com/mattn/go-sqlite3 and I do not see a quick path to integrate it here without adding a few DB layer abstractions for sqlx compatibility. It might be easier and more effective to build a new DB type for in-memory use.

@alexshtin alexshtin force-pushed the feature/single-binary-server branch 2 times, most recently from 9bd25fc to 244df69 Compare January 21, 2022 01:20
Copy link
Member

@alexshtin alexshtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enabled cgo by default in #2396, rebased this PR and tried to run it locally:

$ make bins         
Delete old binaries...
Build temporal-server with CGO_ENABLED=1 for linux/amd64...
go build -o temporal-server ./cmd/server
Build tctl with CGO_ENABLED=1 for linux/amd64...
go build -o tctl ./cmd/tools/cli
Build tctl-authorization-plugin with CGO_ENABLED=1 for linux/amd64...
go build -o tctl-authorization-plugin ./cmd/tools/cli/plugins/authorization
Build temporal-cassandra-tool with CGO_ENABLED=1 for linux/amd64...
go build -o temporal-cassandra-tool ./cmd/tools/cassandra
Build temporal-sql-tool with CGO_ENABLED=1 for linux/amd64...
go build -o temporal-sql-tool ./cmd/tools/sql

$ ./temporal-server --zone sqlite start
2022/01/20 17:21:41 Loading config; env=development,zone=sqlite,configDir=config
2022/01/20 17:21:41 Loading config files=[config/base.yaml config/development.yaml config/development_sqlite.yaml]

and it just hangs. I didn't debug it further. Does it work for you locally?

// only suitable for in memory databases!
if mode := cfg.ConnectAttributes["mode"]; mode == "memory" {
// creates temporary DB overlay in order to configure database and schemas
err := p.setupSQLiteDatabase(cfg, db)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call now conflicts with explicit schema creation in sqlite_test.go. Where it should be removed: here or in tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why unit tests are failing on Buildkite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to handle this inline here to simplify usage of the plugin; I don't know of any cases where it wouldn't make sense to initialize the schema with in-memory mode enabled.

In Temporalite we initialize schema when in-memory mode is being used OR when the db file doesn't already exist. We'd be able to skip the setup for in-memory mode here: https://github.com/DataDog/temporalite/blob/de413afe117f4a5117d1148198622f0f54d5edc2/server.go#L55-L64

As an aside, for file persistence mode I think we should aim for consistency with how migrations work with other sql storage drivers, ie. support sqlite via https://github.com/temporalio/temporal/tree/master/tools/sql.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, removing it from tests.

connectAttributes:
mode: "memory"
cache: "private"
setup: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this setup parameter anymore, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's no longer needed.

@alexshtin alexshtin force-pushed the feature/single-binary-server branch from 2d8c7c0 to 00a0426 Compare January 30, 2022 07:31
@alexshtin
Copy link
Member

My bad, server actually was working. I am so get used to the log console output that we usually get, and I didn't notice that sqlite config has less verbose log level. I am changing the config to match to other SQL implementations. It might be noisy but I would like to keep in sync.

@alexshtin alexshtin merged commit 736fabf into temporalio:master Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants