Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(wal): reenables wal mode for litefs after resolution of issue #621

Merged
merged 6 commits into from
Feb 27, 2024

Conversation

NGimbal
Copy link
Contributor

@NGimbal NGimbal commented Feb 20, 2024

It looks like the issue that were preventing the Epic Stack from using WAL mode has been resolved. This PR updates the version of LiteFS used to 0.5.11 and un-comments the lines that enable WAL mode.

Enables WAL mode for the local db per this comment. Not sure if this is correct. 👀

I started looking into this because the call to prisma.verification.upsert in the function prepareVerification started failing for me locally with the following error: Invalid 'prisma.verification.upsert()' invocation: internal error: entered unreachable code. I started investigating WAL mode because of this GH issue.

Test Plan

I believe the existing tests cover this change.

Checklist

  • Tests updated
  • Docs updated

Screenshots

No visual changes

@kentcdodds
Copy link
Member

Have you verified that this issue no longer exists with the latest version of LiteFS? #598 (comment)

@kentcdodds
Copy link
Member

Also, thanks for working on this!

@NGimbal
Copy link
Contributor Author

NGimbal commented Feb 20, 2024

Have you verified that this issue no longer exists with the latest version of LiteFS? #598 (comment)

I hadn't looked at this issue yet actually 😅 just read through it. I will go through the steps to reproduce the error with this branch and report back. Thank you for this amazing resource!

@NGimbal
Copy link
Contributor Author

NGimbal commented Feb 20, 2024

@kentcdodds - I was able to deploy, make a change (similar to the change you made in your tests) and deploy the template app without an issue.

Here is the successful GH action run: https://github.com/NGimbal/epic-stack/actions/runs/7981017813/job/21791913762
Here is the change: NGimbal@913a434
The deployment also worked in production: https://github.com/NGimbal/epic-stack/actions/runs/7981106705

Let me know if this confirms that this issue has been resolved or if further testing is needed.

Copy link
Member

@kentcdodds kentcdodds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I just would like to check on a few things before merging.

@@ -16,6 +16,10 @@ export const prisma = remember('prisma', () => {
{ level: 'warn', emit: 'stdout' },
],
})

client.$queryRaw`PRAGMA journal_mode = WAL;`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why we're doing this? @benbjohnson, this will run as the first thing the prisma client does when it connects to the sqlite database. Do you see this as problematic or a reasonable workaround? Should we prevent this from running on read-only replicas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not sure if this is the right thing to be doing. I added it because I figured we had to set the journal_mode in order to use WAL mode - I don't think it's on by default in LiteFS. The comment that I linked above is part of a thread where Prisma sqlite users are discussing WAL mode and this seemed like a sensible option. I would definitely defer to @benjohnson.

Copy link
Contributor Author

@NGimbal NGimbal Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I realize that this may be dumb because the command in the Dockerfile already does this. I guess my question would be for the local database and the test database? How can we make sure these are configured like the production database?

I may be overthinking this. I will remove this line for now.

Copy link
Contributor Author

@NGimbal NGimbal Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes more sense I believe: 4ef05f5

@@ -1,3 +1,5 @@
PRAGMA journal_mode = WAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does make more sense. But based on this it looks like prisma will acquire the exclusive lock whenever it connects: superfly/litefs#425 (comment)

I'm just not 100% certain this will work in the long term. And I'm not 100% certain I know how to check for this.

Calling my friends from Prisma and LiteFS!

@rtbenfield and @benbjohnson 🙏

Could you tell me whether this should work around this issue: superfly/litefs#425 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this actually breaks the tests >.< it seems like for some reason the test db is not created properly after running:
prisma migrate reset --force --skip-seed --skip-generate here.

I'll wait for help from these folks before noodling on this further.

@NGimbal
Copy link
Contributor Author

NGimbal commented Feb 21, 2024

Ok I know I said I would stop noodling... but I didn't.

Per the most recent comments I realized that we really only need to be setting journal_mode for the local db and for the test db. So that led me to realize that seed.ts and global-setup.ts were the most appropriate locations.

I can confirm that tests pass, it deploys in CI, and that the test database and local database do in fact have journal_mode=WAL. I will (for real 😅) leave this for now. Thanks for your patience 🙏.

@NGimbal
Copy link
Contributor Author

NGimbal commented Feb 21, 2024

Ok yeah this might be a bad idea - I just tried creating a migration and re-deploying my app with these changes applied and the deployment failed.

The logs showed this message:
2024-02-21T21:28:14.793 app[784e734a231618] ewr [info] level=INFO msg="fuse: write(): wal error: wal header: cannot write to WAL header without WRITE lock, exclusive locking not allowed"

Looks like the underlying issue might not be fixed yet.

Full logs:

2024-02-21T21:28:06.074 app[784e734a231618] ewr [info] level=INFO msg="wal-sync: short wal file exists on \"cache.db\", skipping sync with ltx"
2024-02-21T21:28:06.076 app[784e734a231618] ewr [info] level=INFO msg="wal-sync: short wal file exists on \"sqlite.db\", skipping sync with ltx"
2024-02-21T21:28:06.079 app[784e734a231618] ewr [info] level=INFO msg="using existing cluster id: \"LFSC628AAEC93495D628\""
2024-02-21T21:28:06.085 app[784e734a231618] ewr [info] level=INFO msg="LiteFS mounted to: /litefs/data"
2024-02-21T21:28:06.085 app[784e734a231618] ewr [info] level=INFO msg="http server listening on: http://localhost:20202"
2024-02-21T21:28:06.085 app[784e734a231618] ewr [info] level=INFO msg="waiting to connect to cluster"
2024-02-21T21:28:06.150 app[784e734a231618] ewr [info] level=INFO msg="923D5B0B11EC9704: primary lease acquired, advertising as http://784e734a231618.vm.build-plus-26bf-staging.internal:20202"
2024-02-21T21:28:06.161 app[784e734a231618] ewr [info] level=INFO msg="connected to cluster, ready"
2024-02-21T21:28:06.161 app[784e734a231618] ewr [info] level=INFO msg="node is a candidate, automatically promoting to primary"
2024-02-21T21:28:06.162 app[784e734a231618] ewr [info] level=INFO msg="node is already primary, skipping promotion"
2024-02-21T21:28:06.162 app[784e734a231618] ewr [info] level=INFO msg="proxy server listening on: http://localhost:8080"
2024-02-21T21:28:06.162 app[784e734a231618] ewr [info] level=INFO msg="executing command: node [./other/setup-swap.js]"
2024-02-21T21:28:06.418 app[784e734a231618] ewr [info] setting up swapfile...
2024-02-21T21:28:06.459 app[784e734a231618] ewr [info] swapfile setup complete
2024-02-21T21:28:06.463 app[784e734a231618] ewr [info] level=INFO msg="executing command: npx [prisma migrate deploy]"
2024-02-21T21:28:08.989 app[784e734a231618] ewr [info] npm WARN exec The following package was not found and will be installed: prisma@5.10.2
2024-02-21T21:28:11.205 app[784e734a231618] ewr [info] level=INFO msg="cannot set environment metadata" reason="retries exceeded" err="cannot set machine metadata: code=412"
2024-02-21T21:28:14.617 app[784e734a231618] ewr [info] Prisma schema loaded from prisma/schema.prisma
2024-02-21T21:28:14.640 app[784e734a231618] ewr [info] Datasource "db": SQLite database "sqlite.db" at "file:/litefs/data/sqlite.db"
2024-02-21T21:28:14.772 app[784e734a231618] ewr [info] 4 migrations found in prisma/migrations
2024-02-21T21:28:14.793 app[784e734a231618] ewr [info] level=INFO msg="fuse: write(): wal error: wal header: cannot write to WAL header without WRITE lock, exclusive locking not allowed"
2024-02-21T21:28:14.796 app[784e734a231618] ewr [info] Error: SQLite database error
2024-02-21T21:28:14.796 app[784e734a231618] ewr [info] disk I/O error
2024-02-21T21:28:14.796 app[784e734a231618] ewr [info] 0: schema_core::commands::apply_migrations::Applying migration
2024-02-21T21:28:14.796 app[784e734a231618] ewr [info] with migration_name="20240221211329_project_org_invites"
2024-02-21T21:28:14.796 app[784e734a231618] ewr [info] at schema-engine/core/src/commands/apply_migrations.rs:91
2024-02-21T21:28:14.796 app[784e734a231618] ewr [info] 1: schema_core::state::ApplyMigrations
2024-02-21T21:28:14.796 app[784e734a231618] ewr [info] at schema-engine/core/src/state.rs:201
2024-02-21T21:28:14.817 app[784e734a231618] ewr [info] npm notice
2024-02-21T21:28:14.818 app[784e734a231618] ewr [info] npm notice New minor version of npm available! 10.2.4 -> 10.4.0
2024-02-21T21:28:14.818 app[784e734a231618] ewr [info] npm notice Changelog: <https://github.com/npm/cli/releases/tag/v10.4.0>
2024-02-21T21:28:14.818 app[784e734a231618] ewr [info] npm notice Run `npm install -g npm@10.4.0` to update!
2024-02-21T21:28:14.819 app[784e734a231618] ewr [info] npm notice
2024-02-21T21:28:14.852 app[784e734a231618] ewr [info] ERROR: cannot exec: sync cmd: cannot run command: exit status 1
2024-02-21T21:28:14.857 app[784e734a231618] ewr [info] level=INFO msg="923D5B0B11EC9704: exiting primary, destroying lease"
2024-02-21T21:28:14.963 app[784e734a231618] ewr [info] INFO Main child exited normally with code: 1
2024-02-21T21:28:14.965 app[784e734a231618] ewr [info] WARN Reaped child process with pid: 390 and signal: SIGTERM, core dumped? false
2024-02-21T21:28:14.966 app[784e734a231618] ewr [info] INFO Starting clean up.
2024-02-21T21:28:14.967 app[784e734a231618] ewr [info] INFO Umounting /dev/vdb from /data
2024-02-21T21:28:14.971 app[784e734a231618] ewr [info] WARN hallpass exited, pid: 314, status: signal: 15 (SIGTERM)
2024-02-21T21:28:14.983 app[784e734a231618] ewr [info] 2024/02/21 21:28:14 listening on [fdaa:5:80ea:a7b:15e:1c74:74ee:2]:22 (DNS: [fdaa::3]:53)
2024-02-21T21:28:15.971 app[784e734a231618] ewr [info] [ 10.329453] reboot: Restarting system
2024-02-21T21:28:17.040 runner[784e734a231618] ewr [info] machine did not have a restart policy, defaulting to restart
2024-02-21T21:28:52.233 app[784e734a231618] ewr [info] [ 0.037409] PCI: Fatal: No config space access function found
2024-02-21T21:28:52.453 app[784e734a231618] ewr [info] INFO Starting init (commit: 93e3fc8d)...
2024-02-21T21:28:52.469 app[784e734a231618] ewr [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2024-02-21T21:28:52.472 app[784e734a231618] ewr [info] INFO Resized /data to 1069547520 bytes
2024-02-21T21:28:52.473 app[784e734a231618] ewr [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount` as root
2024-02-21T21:28:52.484 app[784e734a231618] ewr [info] INFO [fly api proxy] listening at /.fly/api
2024-02-21T21:28:52.493 app[784e734a231618] ewr [info] 2024/02/21 21:28:52 listening on [fdaa:5:80ea:a7b:15e:1c74:74ee:2]:22 (DNS: [fdaa::3]:53)
2024-02-21T21:28:52.536 runner[784e734a231618] ewr [info] Machine started in 504ms
2024-02-21T21:28:52.537 app[784e734a231618] ewr [info] config file read from /etc/litefs.yml
2024-02-21T21:28:52.537 app[784e734a231618] ewr [info] LiteFS v0.5.11, commit=63eab529dc3353e8d159e097ffc4caa7badb8cb3
2024-02-21T21:28:52.537 app[784e734a231618] ewr [info] level=INFO msg="host environment detected" type=fly.io
2024-02-21T21:28:52.537 app[784e734a231618] ewr [info] level=INFO msg="no backup client configured, skipping"
2024-02-21T21:28:52.537 app[784e734a231618] ewr [info] level=INFO msg="Using Consul to determine primary"
2024-02-21T21:28:52.623 app[784e734a231618] ewr [info] level=INFO msg="initializing consul: key=epic-stack-litefs/build-plus-26bf-staging url=https://:4746719b-ed0d-e6cc-ad29-6ca3c436ccea@consul-iad-11.fly-shared.net/build-plus-26bf-staging-yexkqwr5xve1m38d/ hostname=784e734a231618 advertise-url=http://784e734a231618.vm.build-plus-26bf-staging.internal:20202"
2024-02-21T21:28:52.626 app[784e734a231618] ewr [info] level=INFO msg="wal-sync: short wal file exists on \"cache.db\", skipping sync with ltx"
2024-02-21T21:28:52.629 app[784e734a231618] ewr [info] level=INFO msg="wal-sync: short wal file exists on \"sqlite.db\", skipping sync with ltx"
2024-02-21T21:28:52.633 app[784e734a231618] ewr [info] level=INFO msg="using existing cluster id: \"LFSC628AAEC93495D628\""
2024-02-21T21:28:52.639 app[784e734a231618] ewr [info] level=INFO msg="LiteFS mounted to: /litefs/data"
2024-02-21T21:28:52.639 app[784e734a231618] ewr [info] level=INFO msg="http server listening on: http://localhost:20202"
2024-02-21T21:28:52.639 app[784e734a231618] ewr [info] level=INFO msg="waiting to connect to cluster"
2024-02-21T21:28:52.691 app[784e734a231618] ewr [info] level=INFO msg="708EA8C184699DCC: primary lease acquired, advertising as http://784e734a231618.vm.build-plus-26bf-staging.internal:20202"
2024-02-21T21:28:52.702 app[784e734a231618] ewr [info] level=INFO msg="connected to cluster, ready"
2024-02-21T21:28:52.702 app[784e734a231618] ewr [info] level=INFO msg="node is a candidate, automatically promoting to primary"
2024-02-21T21:28:52.702 app[784e734a231618] ewr [info] level=INFO msg="node is already primary, skipping promotion"
2024-02-21T21:28:52.702 app[784e734a231618] ewr [info] level=INFO msg="proxy server listening on: http://localhost:8080"
2024-02-21T21:28:52.702 app[784e734a231618] ewr [info] level=INFO msg="executing command: node [./other/setup-swap.js]"
2024-02-21T21:28:52.979 app[784e734a231618] ewr [info] setting up swapfile...
2024-02-21T21:28:53.024 app[784e734a231618] ewr [info] swapfile setup complete
2024-02-21T21:28:53.030 app[784e734a231618] ewr [info] level=INFO msg="executing command: npx [prisma migrate deploy]"
2024-02-21T21:28:55.667 app[784e734a231618] ewr [info] npm WARN exec The following package was not found and will be installed: prisma@5.10.2
2024-02-21T21:28:57.748 app[784e734a231618] ewr [info] level=INFO msg="cannot set environment metadata" reason="retries exceeded" err="cannot set machine metadata: code=412"
2024-02-21T21:29:01.783 app[784e734a231618] ewr [info] Prisma schema loaded from prisma/schema.prisma
2024-02-21T21:29:01.811 app[784e734a231618] ewr [info] Datasource "db": SQLite database "sqlite.db" at "file:/litefs/data/sqlite.db"
2024-02-21T21:29:01.934 app[784e734a231618] ewr [info] 4 migrations found in prisma/migrations
2024-02-21T21:29:01.951 app[784e734a231618] ewr [info] level=INFO msg="fuse: write(): wal error: wal header: cannot write to WAL header without WRITE lock, exclusive locking not allowed"
2024-02-21T21:29:01.956 app[784e734a231618] ewr [info] Error: SQLite database error
2024-02-21T21:29:01.956 app[784e734a231618] ewr [info] disk I/O error
2024-02-21T21:29:01.956 app[784e734a231618] ewr [info] 0: schema_core::commands::apply_migrations::Applying migration
2024-02-21T21:29:01.956 app[784e734a231618] ewr [info] with migration_name="20240221211329_project_org_invites"
2024-02-21T21:29:01.956 app[784e734a231618] ewr [info] at schema-engine/core/src/commands/apply_migrations.rs:91
2024-02-21T21:29:01.956 app[784e734a231618] ewr [info] 1: schema_core::state::ApplyMigrations
2024-02-21T21:29:01.956 app[784e734a231618] ewr [info] at schema-engine/core/src/state.rs:201
2024-02-21T21:29:01.980 app[784e734a231618] ewr [info] npm notice
2024-02-21T21:29:01.980 app[784e734a231618] ewr [info] npm notice New minor version of npm available! 10.2.4 -> 10.4.0
2024-02-21T21:29:01.981 app[784e734a231618] ewr [info] npm notice Changelog: <https://github.com/npm/cli/releases/tag/v10.4.0>
2024-02-21T21:29:01.981 app[784e734a231618] ewr [info] npm notice Run `npm install -g npm@10.4.0` to update!
2024-02-21T21:29:01.981 app[784e734a231618] ewr [info] npm notice
2024-02-21T21:29:02.007 app[784e734a231618] ewr [info] ERROR: cannot exec: sync cmd: cannot run command: exit status 1
2024-02-21T21:29:02.014 app[784e734a231618] ewr [info] level=INFO msg="708EA8C184699DCC: exiting primary, destroying lease"
2024-02-21T21:29:02.500 app[784e734a231618] ewr [info] INFO Main child exited normally with code: 1
2024-02-21T21:29:02.502 app[784e734a231618] ewr [info] WARN Reaped child process with pid: 389 and signal: SIGTERM, core dumped? false
2024-02-21T21:29:02.504 app[784e734a231618] ewr [info] INFO Starting clean up.
2024-02-21T21:29:02.504 app[784e734a231618] ewr [info] INFO Umounting /dev/vdb from /data
2024-02-21T21:29:02.509 app[784e734a231618] ewr [info] WARN hallpass exited, pid: 314, status: signal: 15 (SIGTERM)
2024-02-21T21:29:02.520 app[784e734a231618] ewr [info] 2024/02/21 21:29:02 listening on [fdaa:5:80ea:a7b:15e:1c74:74ee:2]:22 (DNS: [fdaa::3]:53)
2024-02-21T21:29:03.509 app[784e734a231618] ewr [info] [ 11.309680] reboot: Restarting system
2024-02-21T21:29:03.797 runner[784e734a231618] ewr [info] machine has reached its max restart count (10)

@benbjohnson
Copy link

benbjohnson commented Feb 22, 2024

Sorry for the slow reply. I think all you need to do to fix the WAL issue with Prisma is to disable advisory locking by setting an environment variable. It probably makes sense to set it in the [env] section of fly.toml:

[env]
PRISMA_SCHEMA_DISABLE_ADVISORY_LOCK = "1"

SQLite is a single-writer database so the the advisory lock setting the database in exclusive mode seems like overkill. There shouldn't really be a downside to setting this.

@kentcdodds
Copy link
Member

kentcdodds commented Feb 22, 2024

Very interesting! Thanks @benbjohnson! @NGimbal let me know when you've tried this out and then we can look into re-enabling WAL in the Epic Stack 🎉

Copy link
Member

@kentcdodds kentcdodds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to worry about WAL mode locally. So we really only need a single line change in the fly.toml

@NGimbal NGimbal force-pushed the pr/reenable-wal branch 2 times, most recently from 0e6486f to 1a2ffb0 Compare February 23, 2024 02:05
@NGimbal
Copy link
Contributor Author

NGimbal commented Feb 23, 2024

Ok so looks like this works 🥳 Thanks @benbjohnson for the direction here!

Succcessful deploy after taking Ben's suggestion:
https://github.com/NGimbal/epic-stack/actions/runs/8013571034

Successful deploy after making a subsequent change:
https://github.com/NGimbal/epic-stack/actions/runs/8013627275

Successful deploy after a migration:
https://github.com/NGimbal/epic-stack/actions/runs/8013660785

Successful deploy after making another change... for good measure:
https://github.com/NGimbal/epic-stack/actions/runs/8013694713

Looks like it's working!

Copy link
Member

@kentcdodds kentcdodds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super! Thank you so much!

fly.toml Outdated Show resolved Hide resolved
other/litefs.yml Outdated Show resolved Hide resolved
other/litefs.yml Outdated Show resolved Hide resolved
Copy link
Member

@kentcdodds kentcdodds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more thing

fly.toml Outdated
@@ -15,6 +15,10 @@ destination = "/data"
[deploy]
release_command = "node ./other/sentry-create-release"

[env]
# For WAL support: https://github.com/prisma/prisma-engines/issues/4675#issuecomment-1914383246
PRISMA_SCHEMA_DISABLE_ADVISORY_LOCK = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the more I think about this, the more I think this belongs in the Dockerfile because that's where we set up litefs. It's less a fly issue and more a litefs issue. Do you mind moving this (along with the comment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved the env variable to the docker file - one question, do we need to upgrade the litefs version from 0.5.8 to 0.5.11 as well?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to upgrade to v0.5.11. It avoids issues with exclusive locking mode if it ends up being re-enabled for some reason.

@NGimbal NGimbal requested a review from kentcdodds February 27, 2024 18:40
kentcdodds added a commit to kentcdodds/kentcdodds.com that referenced this pull request Feb 27, 2024
Copy link
Member

@kentcdodds kentcdodds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just deployed to my own site. Let's do this thing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants