Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swingstore schema migration framework #8089

Open
Tracked by #8318 ...
warner opened this issue Jul 25, 2023 · 39 comments
Open
Tracked by #8318 ...

swingstore schema migration framework #8089

warner opened this issue Jul 25, 2023 · 39 comments
Assignees
Labels
enhancement New feature or request swing-store SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jul 25, 2023

What is the Problem Being Solved?

The swing-store SQL database will change over time, and we need each new version to be capable of using data created by the previous version. We've only had one release to our mainnet chain so far (the "bulldozer" release, which initialized a brand new DB, using the @agoric/swing-store package at version 0.9.1), so we haven't yet had to perform any kind of upgrade.

PR #8075 needs a schema change to accomodate the new "import record, then populate data" approach used by the rewritten state-sync importer. The change is to remove a CHECK constraint from the snapshots table:

       PRIMARY KEY (vatID, snapPos),
-      UNIQUE (vatID, inUse),
-      CHECK(compressedSnapshot is not null or inUse is null)
+      UNIQUE (vatID, inUse)

Conveniently, we don't actually need this schema change for existing databases: only for the ones created for a state-sync import. So we don't need to build the schema migration tooling quite yet. When we roll out this change, will wind up with two different schemas on mainnet nodes: some with the CHECK constraint, some without.

But, if we didn't get so lucky, we'd need a process to upgrade the schemas of existing databases. This ticket is about a design to accomplish that, for future updates.

Description of the Design

The core will be a new table named version, with a single row named version, which (if present) will contain an integer starting from 1 and incremented for each new version. If this row is present, the integer will exactly define the schema in use.

Each time openSwingStore is told to open a pre-existing DB, our upgrade mechanism will look at this table to decide what changes need to be made (if any). We do not intend to handle downgrades, so openSwingStore will abort if it sees a version that is higher than what it can accomodate.

We must also handle the creation of a brand new DB (initSwingStore or importSwingStore), and the upgrade of a DB that was created before we established this mechanism.

Each given release of @agoric/swing-store will have a "current version" integer. We pretend that the 0.9.1 release used currentVersion = 1. If release N uses currentVersion = N, then databases created by release N will have a version table with N. Databases which start with an older version indicator will be upgraded to N upon first open, and after their first commit(), their version table will also have N, just as if they were created by the current release.

My plan is:

  • split out initialization functions for each component: bundleStore exports a new initBundleStore, in addition to the previous makeBundleStore, etc
    • the initialization functions take a db handle and an oldVersion argument
    • they are expected to create and/or modify any tables or indexes necessary for their component
    • the makeBundleStore function is then limited to building the API around a db handle, and does not modify the DB when first called
  • initSwingStore and openSwingStore continue to share a makeSwingStore function as before (I'd prefer to split them up, such that openSwingStore refuses to create a DB, but that would change the interface enough to be annoying)
  • makeSwingStore starts by determining whether the DB already exists or not, tracked in a variable named didExist
  • it then creates the version table if it did not already exist
    • if didExist === false, this table will be brand new and empty
    • if didExist === true but the DB was created by @agoric/swing-store 0.9.1, the table will also be brand new and empty
    • if didExist === true and the DB was created by the upcoming version of swing-store, the table will not be empty
  • then it extracts the version from the table and stores it in origVersion
  • now we set oldVersion to:
    • undefined if didExist === false, to mean "there are no tables yet: create them"
    • 1 if didExist === true and the version table was empty, to mean "we started with the implicitly-labelled version 1", as created by 0.9.1
    • origVersion otherwise, to mean "we started with this origVersion"
  • next, we call each of initBundleStore, initSnapStore, initKVStore, initTranscriptStore with db and oldVersion
    • each function should react to oldVersion === undefined by creating their tables, using the brand new schema
    • if instead oldVersion === currentVersion, the init function should do nothing
    • otherwise, the init code must perform schema updates to transform the given oldVersion into the schema for currentVersion
  • once all components are initialized, makeSwingStore should write currentVersion into the version table's single row
  • this change will not be committed until hostStorage.commit() is called, so all changes (including the upgrade) will land atomically, and only when the host is ready to commit

The component initialization functions can look something like:

const CURRENT_VERSION = 3;
export function initBundleStore(db, oldVersion) {
  if (oldVersion === CURRENT_VERSION) {
    return;
  }
  // v1 schema: CREATE TABLE ...
  if (oldVersion === 1) {
    db.run('ALTER TABLE ..'); // delta v1->v2
    oldVersion = 2;
  }
  // v2 schema: CREATE TABLE ..
  if (oldVersion === 2) {
    db.run('ALTER TABLE ..'); // delta v2->v3
    oldVersion = 3;
  }

  if (!oldVersion) {
    db.run('CREATE TABLE ..'); // v3 schema
    oldVersion = 3;
  }
  assert.equal(oldVersion, CURRENT_VERSION);
}

Each time we want to make a change, we must increment CURRENT_VERSION, add a new delta, and modify the final if (!oldVersion) clause to make new DBs with the current schema. By adding a copy of the intermediate schemas in the comments, it's easier to verify that the deltas will do the right thing.

If a version does not modify that particular component (e.g. the #8075 change modifies the snapshots table but none of the others), the delta clauses should be empty except for the oldVersion = increments.

cc @mhofman @FUDCo

Other Considerations

SQLite has a user_version for doing things like this (PRAGMA user_version = 2; sets it, PRAGMA user_version; reads it). Should we consider using it instead? Conclusion: no. Although setting it is controlled by transactions (changes do not land until COMMIT), the version does not appear in a sqlite3 foo.sqlite .dump command (file foo.sqlite knows how to report it, but otherwise you need to execute a PRAGMA to see it). This would hamper debuggability.

Maybe pass CURRENT_VERSION into each init function, instead of making multiple copies of it. However for each new version, we still need to add a migration delta clause, I think.

Note that SQLite can only perform limited alterations in the ALTER TABLE statment (only column insertions, deletions, and renames). To remove the CHECK constraint, we must actually copy the whole table to a temporary one, delete the original, and then rename the temporary into place. https://www.sqlite.org/lang_altertable.html has details.

The new init functions will use CREATE TABLE commands, not CREATE TABLE IF NOT EXISTS, because that condition will already be handled by the oldVersion checking logic.

Security Considerations

none, schema migration can only happen when the swingstore is opened, which requires the same level of authority as creating the DB in the first place

Scaling Considerations

Some upgrade operations might have to copy an entire table, which could take a non-trivial amount of time. However, it's absolutely necessary that these changes land in an atomic commit, so we can't defer them or do them in the background.

We should run tests to estimate how long this will take on the mainnet data set and advise validators about how long they should expect to wait.

If makeSwingStore() notices a difference between the oldVersion and CURRENT_VERSION, it should print a message about the DB upgrade starting, and another when it completes.

Test Plan

Unit tests which create deliberately old schemas and populate them with old data, then open the DB and check the subsequent contents and schemas to make sure they've been updated.

The docker-based chain upgrade test will cause a schema migration as a side-effect: we should add checks to that test to examine the new DB and make sure it looks right.

Upgrade Considerations

see above

Note that the #8075 hack means that the implicit "version 1" may or may not have the CHECK constraint, so this particular version of the schema has indeterminate contents. However the first "real" version (2) will have fixed contents, because the v1-to-v2 snapStore migration will copy the table to a new temporary one without the constraint, then delete/rename it back into place, in addition to whatever other v1-to-v2 changes we might add by the time we deploy this.

Note that we don't intend to support downgrades. We might want to back up the DB during chain-software upgrades, just in case, perhaps using the sqlite3 ~/.agoric/data/agoric/swingstore.sqlite ".backup ~/swingstore-backup.sqlite" CLI command, but we must then also advise validators how to delete the 6GB-ish backup file when it is no longer needed.

If we make multiple swingstore changes between released versions, an upgrade might take several steps, not just one. As long as we consistently increment CURRENT_VERSION in each PR that changes schemas (including adding brand new tables), this should work fine.

@warner
Copy link
Member Author

warner commented Jul 25, 2023

In talking with @toliaqat today, we settled on a safer pattern:

export function initBundleStore(db, version) {
  if (!version) {
    db.run('CREATE TABLE ..'); // v1 schema
    version = 1;
  }
  if (version === 1) {
    db.run('ALTER TABLE ..'); // delta v1->v2
    version = 2;
  }
  // v2 schema: CREATE TABLE ..
  if (version === 2) {
    db.run('ALTER TABLE ..'); // delta v2->v3
    version = 3;
  }
  // v3 schema: CREATE TABLE ..
  return version;
}

The swingStore.js code which calls this will look like:

  const currentVersion = deduceVersion(); // use 'didExist' and contents of 'version' table
  const targetVersion = 3;
  const bundleVersion = initBundleStore(db, currentVersion);
  assert.equal(bundleVersion, targetVersion);
  const kvVersion = initKVStore(db, currentVersion);
  assert.equal(kvVersion, targetVersion);
  // repeat for all components

By always starting with the original schema, we know it's always possible to migrate it to the current one. This keeps us honest, and removes the potential for a bug where newly-created DBs get a slightly different schema than old-and-migrated DBs. If targetVersion is incremented and the author forgets to provide a migration delta for some component, they'll get an immediate error. (We'll probably only be modifying one component at a time, so each new targetVersion will have empty deltas for all but one component).

On the downside, it adds overhead in the common case, where we're creating a brand new DB and want the current version. Every DB creation must replay the historical trail. Ontogeny recapitulates phylogeny.

The // v3 schema: comment we add as the penultimate line of each initBundleStore/etc will be the primary documentation of the modern schema. Another downside of this approach is that this comment is not as authoritative as an explicit create-from-scratch CREATE TABLE statement would be.

It would be great if there were a way to assert that the final DB, after all the alterations, matches a more declarative statement of the schema. I know SQLite has a meta-table which lists all other tables and the SQL text used to create them. It might be interesting to query that table and do a text comparison, but 1: there are probably lots of trivial differences (whitespace), and 2: it wouldn't capture things like CREATE INDEX or other changes made by the migration deltas. So it's probably not worth doing.

@warner warner assigned toliaqat and unassigned warner Jul 25, 2023
@mhofman
Copy link
Member

mhofman commented Jul 25, 2023

The core will be a new table named version, with a single row named version, which (if present) will contain an integer starting from 1 and incremented for each new version. If this row is present, the integer will exactly define the schema in use.

Wondering if we may want to have a different version for the snapStore / transcriptStore and bundleStore.

Each time openSwingStore is told to open a pre-existing DB, our upgrade mechanism will look at this table to decide what changes need to be made (if any).

I am uncomfortable with implicit migrations on updates. Can't we assert the right version on open and require an explicit migration function to be run?

Separately, I have mixed feelings about requiring to run the migration steps for a brand new DB. It requires us to keep historical migration functions in code forever, and potentially some logic associated to that version. On the other hand, I do agree it give us more confidence a new DB has the same shape as an upgraded DB.

@warner
Copy link
Member Author

warner commented Jul 26, 2023

The core will be a new table named version, with a single row named version, which (if present) will contain an integer starting from 1 and incremented for each new version. If this row is present, the integer will exactly define the schema in use.

Wondering if we may want to have a different version for the snapStore / transcriptStore and bundleStore.

I guess that'd look like CREATE TABLE version (storeName STRING, version INTEGER);? With one version per store?

I'm dubious. On one hand, it introduces the question of what should we do if some stores have a version in there and some don't, or if some stores are out of date and others are not. I'm sure we'd never want to migrate one store and not the others.

It would allow us to avoid writing a bunch of empty upgrader clauses (if the 3-to-4 delta changes only bundleStore, then all the other stores would need dummy clauses, which might feel annoying).

I think it'd be the most clear to have fewer options: use a single version across the whole DB, so any code which ever needs to touch multiple tables at once (none yet, but maybe some day) can be confident that it has the latest of everything. And it lets us talk about the state of the DB much more simply (imagine the log message we emit during upgrade: is it just "upgrading swingstore DB from version 2 to 3", or "upgrading swingstore DB bundleStore component from version 2 to 3; snapStore is up-to-date at version 4; kvStore is up-to-date at version 1; transcriptStore is up-to-date at version 6").

Each time openSwingStore is told to open a pre-existing DB, our upgrade mechanism will look at this table to decide what changes need to be made (if any).

I am uncomfortable with implicit migrations on updates. Can't we assert the right version on open and require an explicit migration function to be run?

I'm not against that.. is there a good place within the cosmic-swingset upgrade handler to run it? I'm not sure how to square that approach within the current "openSwingStore creates the DB if necessary" behavior. If we were strictly using initSwingStore to create DBs, then having openSwingStore strictly refuse to create or upgrade would make sense, but that's not how it works right now.

Separately, I have mixed feelings about requiring to run the migration steps for a brand new DB. It requires us to keep historical migration functions in code forever, and potentially some logic associated to that version. On the other hand, I do agree it give us more confidence a new DB has the same shape as an upgraded DB.

Yeah, we figured it would feel a bit weird, but it removes the path-dependency concerns pretty directly. No chances that some DBs get one schema and others get a different one, just because of where they started.

@mhofman
Copy link
Member

mhofman commented Jul 26, 2023

Can't we assert the right version on open and require an explicit migration function to be run?

I'm not against that.. is there a good place within the cosmic-swingset upgrade handler to run it? I'm not sure how to square that approach within the current "openSwingStore creates the DB if necessary" behavior. If we were strictly using initSwingStore to create DBs, then having openSwingStore strictly refuse to create or upgrade would make sense, but that's not how it works right now.

#8060 added a isBootstrap field to the AG_COSMOS_INIT messages, and #7994 previously added an upgradePlan field which we can use to differentiate if an init is from an upgrade or not.

Combined we can:

  • create the swingstore only on bootstrap
  • call swing-store migrations only on upgrade

The main problem in this setup is if we attempt to do a data migration during a cosmos upgrade, like we're planning on doing now. The AG_COSMOS_INIT message right now is only sent after the snapshot operations for data migration are performed. In the future we may need a similar "data migration" to repopulate missing artifacts. It's possible that we may need to do a schema migration in the same upgrade, which should really happen before we repopulate missing artifacts. I think we might be able to force the AG_COSMOS_INIT message (and thus schema migration) before repopulation, but that's not wired today.

warner added a commit that referenced this issue Aug 8, 2023
Previously, the swingstore importer would ignore "historical
metadata": records (with hashes) for transcripts and heap snapshots
that are not strictly necessary to rebuild workers. This was a
mistake: our intention was to always preserve these hashes, so that we
might safely (with integrity) repopulate the corresponding data in the
future, using artifacts from untrusted sources.

This commit rewrites the importer to record *all* metadata records in
the first pass, regardless of whether we want historical data or
not. All of these records will be stubs: they contain hashes, but are
missing the actual bundle or snapshot or transcript items, as if they
had been pruned. Then, in the second pass, we populate those stubs
using the matching artifacts (or ignore the historical ones, as
configured by the `includeHistorical` option). A final
`assertComplete` pass insists that all the important (non-historical)
records are fully populated.

The exporter was updated to omit empty artifacts.

New tests were added to assert that metadata is preserved regardless
of import mode, and that the `assertComplete` pass really catches
everything. Also, we check that an import throws if given a mis-sized
artifact, like a transcript span that is missing a few items.

A new `docs/swingstore.md` was added to describe the data model,
including what it means for records to be pruned, and
`docs/data-export.md` was updated.

Note: this commit changes the schema of the `snapshots` table (to
support temporarily-unpopulated `inUse = 1` snapshot data). To be
precise, any swing-store created by this version (either via
`initSwingStore` or `importSwingStore`) will get the new schema:
pre-existing DBs opened with `openSwingStore` will continue to use the
old/strict schema. This is fine for now, but as the comments in
snapStore.js explain, we'll need to implement DB schema versioning and
upgrade (#8089) before we
can safely change any non-`importSwingStore` code to create
unpopulated `inUse=1` records.

fixes #8025
@warner
Copy link
Member Author

warner commented Aug 14, 2023

Oh, also the schema upgrade might create new export-data records to be sent to a callback. So whatever drives the upgrade needs to provide options.exportCallback and react to it being called just like it would during normal operation.

I'm ok with having a separate upgradeSwingStore function that tolerates an out-of-date DB, and then making openSwingStore in-tolerant of an out-of-date DB.

So, I think this is driving us towards:

import { upgradeSwingStore, openSwingStore } from '@agoric/swing-store';

// might upgrade, if so it calls exportCallback() zero or more times and then does a commit
await upgradeSwingStore(dbDir, { exportCallback });

// maybe commit the export-data to host-app DB now

// now open the swingstore for real
const { hostStorage, kernelStorage } = openSwingStore(dbDir, { exportCallback, keepTranscripts: true });

mhofman pushed a commit that referenced this issue Aug 15, 2023
Previously, the swingstore importer would ignore "historical
metadata": records (with hashes) for transcripts and heap snapshots
that are not strictly necessary to rebuild workers. This was a
mistake: our intention was to always preserve these hashes, so that we
might safely (with integrity) repopulate the corresponding data in the
future, using artifacts from untrusted sources.

This commit rewrites the importer to record *all* metadata records in
the first pass, regardless of whether we want historical data or
not. All of these records will be stubs: they contain hashes, but are
missing the actual bundle or snapshot or transcript items, as if they
had been pruned. Then, in the second pass, we populate those stubs
using the matching artifacts (or ignore the historical ones, as
configured by the `includeHistorical` option). A final
`assertComplete` pass insists that all the important (non-historical)
records are fully populated.

The exporter was updated to omit empty artifacts.

New tests were added to assert that metadata is preserved regardless
of import mode, and that the `assertComplete` pass really catches
everything. Also, we check that an import throws if given a mis-sized
artifact, like a transcript span that is missing a few items.

A new `docs/swingstore.md` was added to describe the data model,
including what it means for records to be pruned, and
`docs/data-export.md` was updated.

Note: this commit changes the schema of the `snapshots` table (to
support temporarily-unpopulated `inUse = 1` snapshot data). To be
precise, any swing-store created by this version (either via
`initSwingStore` or `importSwingStore`) will get the new schema:
pre-existing DBs opened with `openSwingStore` will continue to use the
old/strict schema. This is fine for now, but as the comments in
snapStore.js explain, we'll need to implement DB schema versioning and
upgrade (#8089) before we
can safely change any non-`importSwingStore` code to create
unpopulated `inUse=1` records.

fixes #8025
mhofman pushed a commit that referenced this issue Aug 15, 2023
Previously, the swingstore importer would ignore "historical
metadata": records (with hashes) for transcripts and heap snapshots
that are not strictly necessary to rebuild workers. This was a
mistake: our intention was to always preserve these hashes, so that we
might safely (with integrity) repopulate the corresponding data in the
future, using artifacts from untrusted sources.

This commit rewrites the importer to record *all* metadata records in
the first pass, regardless of whether we want historical data or
not. All of these records will be stubs: they contain hashes, but are
missing the actual bundle or snapshot or transcript items, as if they
had been pruned. Then, in the second pass, we populate those stubs
using the matching artifacts (or ignore the historical ones, as
configured by the `includeHistorical` option). A final
`assertComplete` pass insists that all the important (non-historical)
records are fully populated.

The exporter was updated to omit empty artifacts.

New tests were added to assert that metadata is preserved regardless
of import mode, and that the `assertComplete` pass really catches
everything. Also, we check that an import throws if given a mis-sized
artifact, like a transcript span that is missing a few items.

A new `docs/swingstore.md` was added to describe the data model,
including what it means for records to be pruned, and
`docs/data-export.md` was updated.

Note: this commit changes the schema of the `snapshots` table (to
support temporarily-unpopulated `inUse = 1` snapshot data). To be
precise, any swing-store created by this version (either via
`initSwingStore` or `importSwingStore`) will get the new schema:
pre-existing DBs opened with `openSwingStore` will continue to use the
old/strict schema. This is fine for now, but as the comments in
snapStore.js explain, we'll need to implement DB schema versioning and
upgrade (#8089) before we
can safely change any non-`importSwingStore` code to create
unpopulated `inUse=1` records.

fixes #8025
anilhelvaci pushed a commit to anilhelvaci/agoric-sdk that referenced this issue Aug 16, 2023
Previously, the swingstore importer would ignore "historical
metadata": records (with hashes) for transcripts and heap snapshots
that are not strictly necessary to rebuild workers. This was a
mistake: our intention was to always preserve these hashes, so that we
might safely (with integrity) repopulate the corresponding data in the
future, using artifacts from untrusted sources.

This commit rewrites the importer to record *all* metadata records in
the first pass, regardless of whether we want historical data or
not. All of these records will be stubs: they contain hashes, but are
missing the actual bundle or snapshot or transcript items, as if they
had been pruned. Then, in the second pass, we populate those stubs
using the matching artifacts (or ignore the historical ones, as
configured by the `includeHistorical` option). A final
`assertComplete` pass insists that all the important (non-historical)
records are fully populated.

The exporter was updated to omit empty artifacts.

New tests were added to assert that metadata is preserved regardless
of import mode, and that the `assertComplete` pass really catches
everything. Also, we check that an import throws if given a mis-sized
artifact, like a transcript span that is missing a few items.

A new `docs/swingstore.md` was added to describe the data model,
including what it means for records to be pruned, and
`docs/data-export.md` was updated.

Note: this commit changes the schema of the `snapshots` table (to
support temporarily-unpopulated `inUse = 1` snapshot data). To be
precise, any swing-store created by this version (either via
`initSwingStore` or `importSwingStore`) will get the new schema:
pre-existing DBs opened with `openSwingStore` will continue to use the
old/strict schema. This is fine for now, but as the comments in
snapStore.js explain, we'll need to implement DB schema versioning and
upgrade (Agoric#8089) before we
can safely change any non-`importSwingStore` code to create
unpopulated `inUse=1` records.

fixes Agoric#8025
mhofman pushed a commit that referenced this issue Aug 16, 2023
Previously, the swingstore importer would ignore "historical
metadata": records (with hashes) for transcripts and heap snapshots
that are not strictly necessary to rebuild workers. This was a
mistake: our intention was to always preserve these hashes, so that we
might safely (with integrity) repopulate the corresponding data in the
future, using artifacts from untrusted sources.

This commit rewrites the importer to record *all* metadata records in
the first pass, regardless of whether we want historical data or
not. All of these records will be stubs: they contain hashes, but are
missing the actual bundle or snapshot or transcript items, as if they
had been pruned. Then, in the second pass, we populate those stubs
using the matching artifacts (or ignore the historical ones, as
configured by the `includeHistorical` option). A final
`assertComplete` pass insists that all the important (non-historical)
records are fully populated.

The exporter was updated to omit empty artifacts.

New tests were added to assert that metadata is preserved regardless
of import mode, and that the `assertComplete` pass really catches
everything. Also, we check that an import throws if given a mis-sized
artifact, like a transcript span that is missing a few items.

A new `docs/swingstore.md` was added to describe the data model,
including what it means for records to be pruned, and
`docs/data-export.md` was updated.

Note: this commit changes the schema of the `snapshots` table (to
support temporarily-unpopulated `inUse = 1` snapshot data). To be
precise, any swing-store created by this version (either via
`initSwingStore` or `importSwingStore`) will get the new schema:
pre-existing DBs opened with `openSwingStore` will continue to use the
old/strict schema. This is fine for now, but as the comments in
snapStore.js explain, we'll need to implement DB schema versioning and
upgrade (#8089) before we
can safely change any non-`importSwingStore` code to create
unpopulated `inUse=1` records.

fixes #8025
mhofman pushed a commit that referenced this issue Aug 16, 2023
Previously, the swingstore importer would ignore "historical
metadata": records (with hashes) for transcripts and heap snapshots
that are not strictly necessary to rebuild workers. This was a
mistake: our intention was to always preserve these hashes, so that we
might safely (with integrity) repopulate the corresponding data in the
future, using artifacts from untrusted sources.

This commit rewrites the importer to record *all* metadata records in
the first pass, regardless of whether we want historical data or
not. All of these records will be stubs: they contain hashes, but are
missing the actual bundle or snapshot or transcript items, as if they
had been pruned. Then, in the second pass, we populate those stubs
using the matching artifacts (or ignore the historical ones, as
configured by the `includeHistorical` option). A final
`assertComplete` pass insists that all the important (non-historical)
records are fully populated.

The exporter was updated to omit empty artifacts.

New tests were added to assert that metadata is preserved regardless
of import mode, and that the `assertComplete` pass really catches
everything. Also, we check that an import throws if given a mis-sized
artifact, like a transcript span that is missing a few items.

A new `docs/swingstore.md` was added to describe the data model,
including what it means for records to be pruned, and
`docs/data-export.md` was updated.

Note: this commit changes the schema of the `snapshots` table (to
support temporarily-unpopulated `inUse = 1` snapshot data). To be
precise, any swing-store created by this version (either via
`initSwingStore` or `importSwingStore`) will get the new schema:
pre-existing DBs opened with `openSwingStore` will continue to use the
old/strict schema. This is fine for now, but as the comments in
snapStore.js explain, we'll need to implement DB schema versioning and
upgrade (#8089) before we
can safely change any non-`importSwingStore` code to create
unpopulated `inUse=1` records.

fixes #8025
@warner
Copy link
Member Author

warner commented Sep 29, 2023

@mhofman hm, do we have enough cosmos/cosmic-swingset side knowledge to let me get away with properly splitting initSwingStore from openSwingStore now? That is, does the cosmic-swingset side code reliably know (without looking at the data directory) whether it's being asked to create the DB for the very first time, vs we're sure that the DB already exists?

If so, I think I'd want a swing-store API that can be used like this in the "create for the first time" case:

import { initSwingStore } from '@agoric/swing-store';

// creates up-to-date empty DB, commits initial state, returns nothing
await initSwingStore(dbDir);

// now open the swingstore, for initial use
const { hostStorage, kernelStorage } = openSwingStore(dbDir, { exportCallback, keepTranscripts: true });

and this for the it-already-exists case:

import { upgradeSwingStore, openSwingStore } from '@agoric/swing-store';

// might upgrade, if so it calls exportCallback() zero or more times and then does a commit
await upgradeSwingStore(dbDir, { exportCallback });

// now the host should maybe commit the export-data to host-app DB

// now open the swingstore for real
const { hostStorage, kernelStorage } = openSwingStore(dbDir, { exportCallback, keepTranscripts: true });

and ideally the create-for-first-time case doesn't even have the openSwingStore call, because it's called from a function that really only does initialization, and no actual runtime code (which comes later in that same process).

@mhofman
Copy link
Member

mhofman commented Sep 30, 2023

do we have enough cosmos/cosmic-swingset side knowledge to let me get away with properly splitting initSwingStore from openSwingStore now? That is, does the cosmic-swingset side code reliably know (without looking at the data directory) whether it's being asked to create the DB for the very first time, vs we're sure that the DB already exists?

Yes! The cosmos side now informs cosmic-swingset during init whether the start is a bootstrap or not. The sim-chain seem to always start by bootstrapping.

I've been wanting to re-write cosmic-swingset's launchAndInitializeSwingSet to take advantage of this. Then we could plumb that into launch / buildSwingset so that it can stop relying on swingsetIsInitialized.

Regarding migrations, if necessary we might also be able to rely on the upgrade info that is now plumbed into cosmic-swingset as well.

@warner
Copy link
Member Author

warner commented Oct 4, 2023

Ok, so the host app needs to call both upgradeSwingStore() and openSwingStore() on every non-initial startup. And because of the export-data, it also needs to commit the host-side shadow copy after upgradeSwingStore finishes.

That introduces a second critical window: if the application gets interrupted after upgradeSwingStore() has finished, but before the host commits the shadow copy, then the export-data records created during the upgrade will be lost. Swingstore won't remember them next time.

I don't really want to do this, but one fix would be for upgradeSwingStore to not commit, and return a { hostStorage, kernelStorage } pair, to give the host app control over the SQLite commit that records the upgrade changes. That doesn't really close the window: the upgrade-changes window gets merged with the normal-operations window. We currently deal with the latter through the hangover-prevention code, but I don't think the schema upgrade changes can be handled the same way.

Another not-great solution would be to record the ugprade-time export data in a table, and retrieve it later, and delete it even later (after the host DB commits). Basically making the outputs of the process look more like outgoing comms messages, which get recorded/embargoed/retired in our usual pattern. That feels like more complexity than the problem deserves.

The export-data I'm thinking of would be like:

  • we introduce a separate table for c-lists, with a better schema, and the migration converts kvStore rows into clistStore rows
    • clistStore rows need integrity checking during import, so they must be shadowed into export-data just like kvStore rows
    • so each migration means deleting two kvStore rows, adding one clistStore row, and sending two DELETEs and an ADD to the exportCallback
  • actually, just the version table needs to be recorded in the export-data, so even if the schema upgrade only adds an INDEX (and doesn't change any data rows at all), we'd still need to emit a version record into the exportCallback

@FUDCo
Copy link
Contributor

FUDCo commented Oct 4, 2023

The more I ponder this, the less comfortable I am with the very idea of upgradeSwingStore. I'm thinking this is a blind alley.

@warner
Copy link
Member Author

warner commented Oct 4, 2023

The more I ponder this, the less comfortable I am with the very idea of upgradeSwingStore. I'm thinking this is a blind alley.

Is the alternative to automatically perform the upgrades in openSwingStore? That would give us a channel to deliver the exportCallback records, and would defer committing the upgrades until the host called hostStorage.commit() later, so it'd meet the main requirements.

But I think it has the same issue with whether the exportCallback calls are subject to hangover inconsistency or not. In our current scheme, where cosmic-swingset is handling things itself, we need the host to record those export-data records into swingStore.hostStorage as they are emitted, and to replay them if/when we encounter the hangover (i.e. if app restart observes a swingstore which has advanced exactly one block more than the host's IAVL DB).

We need one recovery path for each case that we can handle. If we only have one swingstore commit (i.e. either upgradeSwingStore exists and does not commit, or upgrades happen automatically inside openSwingStore), then we have two cases:

  • host DB matches swingstore: no actions to take
  • host DB is one block behind swingstore: don't re-execute block, do play back saved messages
  • (all other mismatches cause a panic, we aren't a multi-version DB)

If we have both a commit inside upgradeSwingStore and a hostStorage.commit() at and-of-block, then we have three handled cases:

  • host DB matches swingstore: no actions
  • swingstore has committed the upgrade changes, but host DB has not
  • swingstore has commited upgrade changes and block-execution changes, host DB has not (same as above)

The middle case is the hard one: we'd need a flag (more than just a single hostStorage.kvStore.set('blockHeight', NN)) to help discover that we're in it. And the export-data records have to go somewhere.

@mhofman mentioned today that all exportCallback records are expressed as chain sends, so they'll get recorded by the same pathway as e.g. chain-storage writes. That's good, I was worried we wouldn't replay them.

@warner
Copy link
Member Author

warner commented Oct 4, 2023

And let's see, if upgradeSwingStore is a separate function and doesn't do a commit, then it needs to return something with a SQLite DB Connection, where the uncommitted changes will live. I think that means it needs to build and return { hostStorage, kernelStorage }, and that callers must not perform a separate openSwingStore (because either it would return an independent Connection, or it would have to accept a connection as an input, which just seems wrong). And that means it needs to take the same runtime options as openSwingStore, like keepTranscripts, because the caller is going to keep using that instance forever.

That would make for two invocation recipes. "first-time" is same as above:

import { initSwingStore } from '@agoric/swing-store';

// creates up-to-date empty DB, commits initial state, returns nothing
await initSwingStore(dbDir);

// now open the swingstore, for initial use
const { hostStorage, kernelStorage } = openSwingStore(dbDir, { exportCallback, keepTranscripts: true });

but "already-exists" changes:

import { upgradeSwingStore, openSwingStore } from '@agoric/swing-store';

// might upgrade, if so it calls exportCallback() zero or more times. Does not commit.
const { hostStorage, kernelStorage } = upgradeSwingStore(dbDir, { exportCallback, keepTranscripts: true });

// now the host can do block execution as usual

// when done, the host should commit everything

.. which means that upgradeSwingStore is more like "open but do upgrades first", which gets us back to using openSwingStore and having it do automatic upgrades when necessary.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 4, 2023

That pair of recipes looks like a real tar baby. I don't think we should be orchestrating swingstore internals from outside the swingStore. Schema changes to the swingstore seem to me to fall into two buckets:

  • internal changes in how the swingstore represents stuff, done for reasons of efficiency or the swingstore's own convenience. These should have no visible outwards consequences at all from the perspective of the chain. They can be executed by an upgrade preamble in openSwingStore that opens the database, makes the necessary changes, and then immediately commits those changes, after which the swingstore can proceed with its normal operation under the assumption that the database matches the most up to date schema.
  • changes motivated by changes to or addition of consensus-visible kernel features, which should have a clear before/after boundary, likely corresponding to some kind of software upgrade, where prior to the boundary nothing is different and after it database operations may follow an altered pattern. In this case, the boundary itself necessarily coincides with some event where cosmic-swingset is shutdown and restarted, and then as part of the restart openSwingStore does its thing in a way that should be entirely in consensus, so no special magic should be required beyond what we already have to do to maintain consistency.

What am I missing here?

@mhofman
Copy link
Member

mhofman commented Oct 4, 2023

My recommendation:

  • Keep a single explicit commit() at the end of the block, driven by the host. There should never be implicit commits made by SwingSet or swing-store (I would say the implicit swing-store creation also runs afoul of this principle).
  • Either have openSwingStore() implicitly perform migrations (and reflect any necessary "export data" changes into the existing callback), or have an explicit performMigrations function on the object returned by openSwingStore(). Given that upgrades shouldn't be optional and are always performed at restart boundaries, I don't think the latter is justified.
  • Consider having the swing-store expose a new feature that makes it easier for the host to efficiently implement the hangover inconsistency logic. This would be an ordered list of entries, each representing a command that was sent to cosmos, and the response that was received. The host can insert new entries (one at a time), get all entries, and clear all entries. We may take the opportunity to support multiple set of entries, supporting replay for multiple blocks (to solve issues like fsync between cosmos and Swingset must be compatible #6736). Currently this is implemented as an in-memory array that is then JSON serialized into a host key of the kv store before commit.

@warner
Copy link
Member Author

warner commented Oct 4, 2023

@FUDCo writes:

  • internal changes in how the swingstore represents stuff, done for reasons of efficiency or the swingstore's own convenience. These should have no visible outwards consequences at all from the perspective of the chain. They can be executed by an upgrade preamble in openSwingStore that opens the database, makes the necessary changes, and then immediately commits those changes, after which the swingstore can proceed with its normal operation under the assumption that the database matches the most up to date schema.

What am I missing here?

The issue is that swingstore export requires some integrity-protecting data (the "export-data") to be handed to the host for shadowing into authentic storage (e.g. hashes of transcript spans, but also every single kvStore entry). And some of these "internal changes" may modify the export-data, like adding/updating a version row. Kernel-visible API changes (like adding a new clistStore or queueStore component) would require more substantial changes, like moving a bunch of kvStore data into an entirely different table.

Which means while we're doing the upgrade, we're also emitting export-data rows (by invoking the exportCallback). The host needs to safely record the export-data, but the application might get killed after swingstore does its thing, but before the host gets a chance to commit.

If the internal changes get committed immediately, an interruption during that window might lose the export data, which would lead to invalid cosmos state-sync exports.

This is the same sort of hangover-inconsistency problem that could happen with normal changes, like vatstore writes during execution, it's just that we don't usually think of openSwingStore as a mutating API. There's already code in cosmic-swingset to detect when we wake up and swingstore remembers executing a block that cosmos (the host app) does not: cosmic-swingset records the outgoing messages during the previous execution, and "replays" them to the host app if/when this overhang happens. And export-data already goes through this pathway. So we just need to make sure the DB upgrades happen at a time when cosmic-swingset is ready for it.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 4, 2023

My position is that internal swingstore schema changes by definition do not modify export data. In any case, we should rule out changes that modify historical export data, such that any externally visible consequences of a schema update are limited to things that happen after the update. This constrains our future options somewhat, but takes large cohort of headaches off the table.

@mhofman
Copy link
Member

mhofman commented Oct 4, 2023

In this case there's a universally understood switchover point that can be treated as a chain event in its own right and the commit associated with any related change-driven data migration would be the commit for that event.

Agreed, there is such a point. An upgrade is a consensus operation requiring using a new version of software. However it is still part of the block advancement process, constrained by our requirement to only have a single host controlled commit point per block, and that the software upgrade may require other work than performing the swing-store migration, work that may cause further changes to the state before the commit.

The remaining questions are:

  • Should such a switch over require an explicit "migration" to be triggered by the host, or should swing-store perform the migration automatically on open
  • Is the migration allowed to emit export data change events

@FUDCo
Copy link
Contributor

FUDCo commented Oct 4, 2023

  • Should such a switch over require an explicit "migration" to be triggered by the host, or should swing-store perform the migration automatically on open

No, this is none of the host's business.

  • Is the migration allowed to emit export data change events

I'm inclined to say no to this also. It's a constraint, but one that would simplify a lot.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 4, 2023

  • Should such a switch over require an explicit "migration" to be triggered by the host, or should swing-store perform the migration automatically on open

No, this is none of the host's business.

I just realized I gave an either/or question a yes/no answer. I was saying no to the first alternative, with an implicit yes to the second.

@mhofman
Copy link
Member

mhofman commented Oct 4, 2023

  • Is the migration allowed to emit export data change events

I'm inclined to say no to this also. It's a constraint, but one that would simplify a lot.

What would it simplify exactly? Given the other commit requirements, I don't believe it simplifies anything.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 4, 2023

  • Is the migration allowed to emit export data change events

I'm inclined to say no to this also. It's a constraint, but one that would simplify a lot.

What would it simplify exactly? Given the other commit requirements, I don't believe it simplifies anything.

It means you could go ahead and commit the schema changes with the rest of whatever happened in the block, as you wouldn't be exposing export events in some weird intermediate phase of execution. This would address @warner's concern about what happens if there's a failure after the schema update but before the block is committed.

@mhofman
Copy link
Member

mhofman commented Oct 5, 2023

As long as the migration generates export events, there is no problem. I don't think there is any weird intermediate state. Either swing-store got committed at the end of the block, using a new schema, and containing a replay log of the cosmos sends (export events + other device related sends), or it didn't get committed and is still in the state it was previous to the migration.

The only requirement is that the migration does not cause an implicit commit. That is it.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 5, 2023

Ah, I see your point. I was conflating the concern about implicit intermediate commits with concern about endogenous export events, but you're right that those are entirely separate.

@warner
Copy link
Member Author

warner commented Oct 5, 2023

Let's see if I can enumerate the requirements we impose upon consensus-hopeful host authors:

  • all instances of your application shall proceed in an orderly sequence of operations, we'll call them blocks, but you can call them whatever you like
    • each block will use a specific version of swingstore
    • each version of swingstore has a particular API, and obviously your kernel must match
    • no block will use an older version of swingstore than the previous
    • you (feed the kernel with some inputs which cause the kernel to) feed some set of inputs into swingstore in each block
    • all instances must use the same version and supply the same inputs for any given block
  • you can create a swingstore export at any block

And in turn, swingstore promises that:

  • the semantic contents of the swingstore will be the same in all instances (which are at the same block)
    • ("semantic contents" excludes local-only things like stats, so some SQL tables might be different)
  • the swingstore DB version will be the same in all instances
    • this is because all instances are required to use the same swingstore code versions in the same blocks, and because we'll automatically upgrade the DB at open())
  • the export contents will be a deterministic function of the inputs you've fed and swingstore version doing the export
  • a swingstore populated by importing from an export will have the same semantic contents as the original
  • exports produced by yesterday's version will be importable by tomorrow's version

That last point is open to debate. I think it's helpful, it means you can use a state-sync generated before the upgrade on software from after the upgrade, but.. maybe cosmos can't handle that, so there's no point in making swingstore handle it? If we offer that feature, then the importer needs to recognize a version in part of the export-data and basically create an old-format DB, then it can run the normal upgrade process immediately afterwards.

If we choose to not offer that feature, then we still need a version in the export-data, but the importer can just throw an error if it sees the wrong one. And our importer code can be simpler, it doesn't need to handle old versions. And we don't need to decide how far back to support. But, if a state-sync export is produced every morning at 9am, and a noon on monday everybody upgrades to the new version, then a client who comes online monday afternoon won't be able to use that state sync. Either they'll need to 1: boot the old software, build the node from state-sync, let it run to the upgrade point, let it halt, switch to the new code, restart it, let it run to the current block, or 2: wait until tuesday morning so a new-version state-sync snapshot is available.

@mhofman
Copy link
Member

mhofman commented Oct 5, 2023

I think it's helpful, it means you can use a state-sync generated before the upgrade on software from after the upgrade, but.. maybe cosmos can't handle that, so there's no point in making swingstore handle it

Correct, no point in handling. You must use the right software.

I think we can put a stronger requirement:

  • swing-store implementation N+1 only has to be able to migrate swing-store version N. It does not have to offer a functional instance for that swing-store until the migration is complete.

In particular we don't need to be able to import, export or otherwise read / modify mismatched versions besides what is necessary for the migration.

@warner
Copy link
Member Author

warner commented Oct 5, 2023

The implementation I want to pursue will enable upgrades all the way from version 0, since that will keep us honest about upgradability, and will avoid bugs where the upgraded DB could have a different schema than the created-current-version DB. So:

swing-store implementation N+1 only has to be able to migrate swing-store version N

We'll be able to do more than that, and we can safely perform multiple schema changes between subsequent chain-upgrade events (which would otherwise be a scheduling interdependency).

It does not have to offer a functional instance for that swing-store until the migration is complete.

Yep, openSwingStore(dbDir, { upgrade: true }) won't return a { hostStorage, kernelStorage } pair until the (uncommitted) DB state has been brought up to the current version.

And good, by not requiring the ability to import older versions, the importer code is simpler. When we land a swingstore change to implement e.g. version 3, then the changes will be to:

  • add a 2->3 upgrader for each store (all but one will probably be empty)
  • change the exporter to produce v3 exports
  • change the importer to demand a "v3" in the incoming data, and consume v3 exports
  • change the top-level swingStore.js code to create use "3" as the upgrade-target and init-target version

@warner
Copy link
Member Author

warner commented Oct 5, 2023

One other note: I'd originally expected our new-limited once-only initSwingStore() function to create tables but not populate any rows. While discussing things with @toliaqat I realized that we need to populate a single version row during init, and the version row needs to be published as export-data, and that means we'll have some export-data created in a function that doesn't expose a way to deliver it (no exportCallback argument to this new initSwingStore()). But, looking at the implementation, we were smart and made the pendingExports into a DB table. So initSwingStore() can use noteExport as usual, and when we do the commit() at the end of init, we'll durably stash that version = N entry in the DB (the only INSERT visible in a sqlite3 .dump command). Then, when the host comes back around to openSwingStore(dbDir, { exportCallback }), they'll get the version = N export-data when flushPendingExports() runs during their first kernelStorage.endCrank().

@FUDCo
Copy link
Contributor

FUDCo commented Oct 5, 2023

I still don't understand why you think the version row needs to be published as export data. Not only do I think it's unnecessary, I don't even see how it's possible. The export data describes the various stores that the swingstore encompasses, but the version table isn't part of any of them; it's purely internal.

@mhofman
Copy link
Member

mhofman commented Oct 5, 2023

swing-store implementation N+1 only has to be able to migrate swing-store version N

We'll be able to do more than that, and we can safely perform multiple schema changes between subsequent chain-upgrade events (which would otherwise be a scheduling interdependency).

My concern about this is maintenance of that code. It requires us to keep a lot more logic around than is strictly necessary for operations. A chain software upgrade already requires new software, and chain upgrade handlers are not able to handle multiple versions anyway, so why should swing-store have that feature?

I don't even see how it's possible. The export data describes the various stores that the swingstore encompasses, but the version table isn't part of any of them; it's purely internal.

I think @warner explains how it's possible. "export data" content does not need to strictly align with the sub-stores, it can be anything we want. As such we could have a synthetic version entry.

That said, I'm wondering if it might not be better to expose a schema version on the import/export interface instead. It would simplify the logic handling the export data. If the importer is not able to handle that version, just throw right away. Btw, the import/export version does not have to match internal swing-store schema versioning. As @FUDCo mentions, not all internal schema changes would result in a different shape of export data (and we should avoid those export data changes as much as possible given the performance implications on the IAVL DB).

@warner
Copy link
Member Author

warner commented Oct 5, 2023

We'll be able to do more than that, and we can safely perform multiple schema changes between subsequent chain-upgrade events (which would otherwise be a scheduling interdependency).

My concern about this is maintenance of that code. It requires us to keep a lot more logic around than is strictly necessary for operations. A chain software upgrade already requires new software, and chain upgrade handlers are not able to handle multiple versions anyway, so why should swing-store have that feature?

Imagine the following timeline:

  • December: we deploy upgrade-12 with an @agoric/swing-store that uses schema version v2.
  • December: we schedule upgrade-13 for February
  • January: we publish @agoric/swing-store with schema v3
  • January: we start work on v4, scheduled to be complete in March
  • February: upgrade-13 is delayed for other reasons
  • March: we finish v4
  • April: upgrade-13 is ready to go

If we can only upgrade one swingstore sche,a version at a time, then we must withhold the v4-capable version until after upgrade-13 is deployed, then land the dependency on v4, then schedule an upgrade-14, and hope that things don't slip a second time. That means useful features/bugfixes would be artificially delayed by external upgrade schedules, and landing the v4 dependency bump is a leap of faith that v3 will be deployed on time.

If swingstore can upgrade itself all the way, at each step, then we don't need to partition the swingstore upgrades across separate chain upgrades. Whatever version of swingstore/SwingSet is ready can be shipped at any time, and we can land the dependency bump as soon as its ready, entirely decoupled from any particular chain's deployment cadence.

It requires us to keep a lot more logic around than is strictly necessary for operations.

The "DB creation recapitulates DB upgrade" approach described in the initial design (which I know you've got reservations about) means we already have that logic available, and it gets exercised and (lightly) tested with every new DB creation. We might reduce the code size by deleting the older upgraders, but:

  • we'd have to add new code to create a new DB in the non-empty state
  • resulting in two separate code paths in the current repo
  • plus the code path from the historical repo, before the deletion
  • so any given deployment's DB might have been created by one of three paths, only two of which are still visible
  • and we impose constraints on users, who must insert a production deployment between each @agoric/swing-store version that requires a schema upgrade
    • which means we need to tell users which ones do an upgrade
    • which could+should be hidden from them

chain upgrade handlers are not able to handle multiple versions anyway

I don't think the chain upgrade handler will see swing-store schema upgrades, or at least it doesn't need to. A chain upgrade handler might be doing things like controller.validateAndInstallBundle(), or injecting some message to tell the kernel to do something unusual. But I don't think one would ever talk directly to the swing-store.

The kernel code must always be compatible with the swing-store package version that it uses (@agoric/swingset-vat makes kernelStorage API calls to @agoric/swing-store), and we don't currently have a good way to manage that, because the kernel is given a swingstore, created by the host. (Maybe we should add a peerDependency to SwingSet that declares the range of swingstore versions it can use?). But the swing-store package can be bumped without also incurring a schema upgrade.

I don't even see how it's possible. The export data describes the various stores that the swingstore encompasses, but the version table isn't part of any of them; it's purely internal.

I think @warner explains how it's possible. "export data" content does not need to strictly align with the sub-stores, it can be anything we want. As such we could have a synthetic version entry.

Yeah, that was my plan, to annotate the export with the schema version it was sourced from, with an export-data key named version or something. The importer ingests the export-data as usual, but when it reaches the version key, it might throw an incompatibility error.

(and we'd probably define the extra key as an "export version", to decouple it from the schema version: I can imagine allowing multiple closely-related schema versions to all get exported into the same export version)

It'd be nicer if the importer could somehow read version first, before seeing anything else, because we can imagine an export-format difference where v3 produces export-data keys that aren't recognized by a v2 importer, and the user experience would be an "I don't recognize this key, bailing" error, instead of a "this dataset is newer than I can handle" error. But we can always insert the latter text into the unrecognized-key message.

That said, I'm wondering if it might not be better to expose a schema version on the import/export interface instead. It would simplify the logic handling the export data.

Yeah, but it feels like a lot of churn. We defined a "SwingStore Export" to mean a data set with two parts: export-data and artifacts, with one sequencing constraint (import ingests export-data first, then fetches artifacts).

This would require three parts (version, export-data, artifacts), and have two sequencing constraint (import ingests version, then export-data, then artifacts). Let's see, the SwingSetExporter type would acquire a getExportVersion() method, the cosmic-swingset host would be obligated to grab it on every block and store it somewhere in IAVL (in a key that won't be confused with export-data). The cosmic-swingset -side state-sync importer would be obligated to parse it out of IAVL and build a SwingSetExporter that supplies it to the swing-store importer. The swing-store importer queries getExportVersion() before doing anything else, and bails if it doesn't match.

Ok, the SwingSetExporter part of that doesn't sound too bad. I don't know how the IAVL keys are currently encoded, is there room in the namespace for this? @mhofman maybe you could sketch out where the new not-export-data key would be stored.

If we do it this way, it will require matching changes on the cosmic-swingset side, whereas if we encode it into export-data, cosmic-swingset doesn't need to change. That's the main cost. The benefit is a cleaner design.

@mhofman
Copy link
Member

mhofman commented Oct 5, 2023

chain upgrade handlers are not able to handle multiple versions anyway

I don't think the chain upgrade handler will see swing-store schema upgrades, or at least it doesn't need to.

If swingstore can upgrade itself all the way, at each step, then we don't need to partition the swingstore upgrades across separate chain upgrades. Whatever version of swingstore/SwingSet is ready can be shipped at any time, and we can land the dependency bump as soon as its ready, entirely decoupled from any particular chain's deployment cadence.

Correct, that would put a requirement that there is at most a single migration per chain upgrade, or at least for those cases, that the migration logic from swing-store is able to handle a multi-version jump. I agree it does put an awkward dependency from "main chain upgrade schedule" onto the swing-store implementation, which is not optimal.

The importer ingests the export-data as usual, but when it reaches the version key, it might throw an incompatibility error.

The problem is that it may not be able to understand what to do with this data until it reaches the version info, putting undue complications onto the implementation.

Let's see, the SwingSetExporter type would acquire a getExportVersion() method, the cosmic-swingset host would be obligated to grab it on every block and store it somewhere in IAVL (in a key that won't be confused with export-data).

No we can just store the version in a special state-sync payload, no need to store it in the IAVL tree.

The swing-store importer queries getExportVersion() before doing anything else, and bails if it doesn't match.

Correct!

If we do it this way, it will require matching changes on the cosmic-swingset side, whereas if we encode it into export-data, cosmic-swingset doesn't need to change. That's the main cost.

I don't mind that cost. It will complicate a little how state-sync payloads are handled, but we knew we'd get there some time. Cosmos already has versioning in place to handle these situations.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 5, 2023

I'm not following the discussion of the "per" relationship between swingstore migrations and chain upgrades here. If we have multiple changes to the swingstore that are landed in between two chain upgrades, then from the chain's perspective they collectively just look like one big migration. Internal to swingstore this might be accomplished by multiple version steps, each with its own logic, but I don't think that changes the view as seen by the chain. (As before, I still don't see what business it is of the chain's what the swingstore schema version number is, but I don't think that's actually material to this particular question.)

@warner
Copy link
Member Author

warner commented Oct 5, 2023

The problem is that it may not be able to understand what to do with this data until it reaches the version info, putting undue complications onto the implementation.

Hm.. I agree that it might not understand what to do, but I don't think that matters, or would require us to put more work/complication into the importer implementation.

By deciding to reject old versions, we've allowed the importer to omit code that handles old versions correctly. So each export-data entry shows up, the importer assumes that it's current-format and interprets as such. That will put the correct data into the DB if it was indeed current-format, and it will either throw an error or do something wrong if it was not current-format. When the version entry appears, the importer either throws or is satisfied. At the end of the process, we assert that a valid version entry did appear at some point.

The only way to make it successfully to the end of the export-data stream is for the version entry to be correct. So it doesn't matter how badly the importer misinterprets wrong-version data, because we'll throw eventually, and that will abort the entire commit. So we don't need any complications in the implementation, except maybe to wrap a catch around the whole import and add ".. and this error might be due to an old version" text to the message, for ergnomics.

Let's see, the SwingSetExporter type would acquire a getExportVersion() method, the cosmic-swingset host would be obligated to grab it on every block and store it somewhere in IAVL (in a key that won't be confused with export-data).

No we can just store the version in a special state-sync payload, no need to store it in the IAVL tree.

What happens if the provider of your state-sync payload lies about the version in use? How does the importer validate the claimed version string?

I think an attack would be shaped like:

  • an export happens at block 10 (using export-version v1)
  • the chain upgrades at block 11 (switching to v2)
  • someone doctors the b10 payload to claim it offers v2
  • the importer starts at block 12

The importer would process the payload according to v2 rules even though it wasn't. Any difference in the interpretation of export data would be available for the attacker to exploit.

Hm, but the payload itself is untrusted. The cosmos-side importer populates the IAVL tree, then hashes it, then compares the root against the chain. If that succeeds, it feeds the export-data (all of which is now known-good) into a swingstore importer. The importer then receives artifacts and compares them against the export-data. The cosmos side doesn't know how to interpret the export-data, it just writes into IAVL and then reads a subset out to the swingstore importer.

The "version in a special state-sync payload" would mean some field in the payload provides the claimed swing-store export-version. The cosmos-side importer would have to parse this from the payload, remember it outside of IAVL, and then provide it (unverified) to the swingstore importer, via swingStoreExporter.getExportVersion(). The swingstore has nothing to verify this against, so it would trust getExportVersion() just as deeply as it trusts getExportData(), despite the cosmos side not being able to do any verification.

Nope, that's not ok. Version confusions are security bugs. We need a way to make it verified.

@warner
Copy link
Member Author

warner commented Oct 5, 2023

@FUDCo writes:

I'm not following the discussion of the "per" relationship between swingstore migrations and chain upgrades here. If we have multiple changes to the swingstore that are landed in between two chain upgrades, then from the chain's perspective they collectively just look like one big migration. Internal to swingstore this might be accomplished by multiple version steps, each with its own logic, but I don't think that changes the view as seen by the chain. (As before, I still don't see what business it is of the chain's what the swingstore schema version number is, but I don't think that's actually material to this particular question.)

Agreed on all counts. And yeah, the chain ( / host application in general) should not know what the swingstore schema version number is. An upgrade process that could only make one step at a time would effectively be revealing the schema version number to the host application, at least to the extent that there's this hidden number that is sometimes incremented by a new version of @agoric/swing-store and sometimes not, and you aren't allowed to jump more than one of them at a time, which means you have to deploy your host app upgrades faster than the swing-store schema version grows.

@mhofman
Copy link
Member

mhofman commented Oct 5, 2023

What happens if the provider of your state-sync payload lies about the version in use? How does the importer validate the claimed version string?

Nope, that's not ok. Version confusions are security bugs. We need a way to make it verified.

Right, only the export data is trusted in our model. However I'm doubtful there is much attack potential by having this version unverified.

So each export-data entry shows up, the importer assumes that it's current-format and interprets as such.

Fair. Ok I'm fine with a synthetic export data entry for the "export version"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request swing-store SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

4 participants