-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chain-sends might get committed late: vulnerable crash window in cosmic-swingset #9655
Comments
Refining the analysisThe comms protocol must accomodate each peer falling arbitrarily far behind. When we look at a single direction (eg Alice->Bob), the recipient cannot be ahead of the sender (the message hasn't been created yet), but it might have acked the most recent message, or only the previous one, or the one before that, etc, or perhaps it hasn't acked any messages at all. The table of possible states looks like this, where the x axis is which message Alice has emitted, and the Y axis is which message Bob has acked:
In contrast, the cosmos/kernel protocol cannot fall more than one step behind: the kernel will not be given work for block 2 until cosmos has committed block 1. There are only two databases, and cosmos strictly commits a given block to IAVL after swingstore has committed the same to SQL. So either both have committed the same thing, or SQL has committed and IAVL has not.
This relaxes the "embargoed durable sends" rule of the Waterken protocol. Our But we don't need the sends to be embargoed, because the recipient won't remember them (i.e. commit IAVL) unless the kernel reaches its own commit point. We're allowed to reveal the bridge-device actions as soon as they happen, just as long as we remember them (for the benefit of cosmos, not the kernel) at least until we start the next block. The "recipient won't commit early" property is what allows our bridge device messages to have answers. The comms protocol doesn't have this luxury, so all comms messages are async (send, no response). Types of Chain SendsWe basically have three kinds of messages:
The first two are coming from the bridge or mailbox device, and either write to or read from some portion of IAVL. These are the real device IO pathway for the kernel. The third, export-data writes, is not driven by the kernel; it's driven by swingstore. While the export-data needs to be written into IAVL, and the changes need to be durable (against the "SQL committed but IAVL didn't" case), we could afford to use a different pathway than the device IO messages. For example, one option would be to accumulate all the export-data callbacks into RAM, and then store them directly into their own portion of Integrating export-data deltasFrom swing-store's point of view, export-data is a straightforward part of exports, and export-data callbacks are clearly necessary to allow host applications to incrementally commit to the export contents. When we combine this with the Waterken durable-output IO model, it becomes clear that any host app which uses a second DB and export-data callbacks will need to write the export-data deltas to So, what if we baked this into swing-store? Instead of having the host app build its own mechanism around hostStorage.commit();
const deltas = hostStorage.getExportDeltas(); Assuming we can tolerate getting the export-data deltas late in the block cycle, then instead of using an immediate The host app would finish The host app would still need something like If we decided that the immediate callback provided performance benefits, we could do both. function recover() {
for (let msg of JSON.parse(hostStorage.kvStore.get('chainSends'))) {
sendToChain(msg);
}
for (let [key,valueOrNull] of hostStorage.getExportDeltas()) {
sendToChain({ type: 'swingsetExportData', key, value: valueOrNull });
}
} |
I reviewed the kernel code, and So the swingstore's semi-accidental "we'll buffer export-data callbacks until So I think we don't currently have a vulnerable crash window. To guard against changes in either swingstore or the kernel, we could enhance cosmic-swingset to assert that the
I'd like to allow swingstore to flush its callbacks any time, but the current situation will work safely, and the guards above would catch behavioral changes which would violate the assumptions we're currently relying upon. |
Possibly related: #8652 |
The swingstore export-data callback gives us export-data records, which must be written into IAVL by sending them over to the golang side with swingStoreExportCallback . However, that callback isn't ready right away, so if e.g. openSwingStore() were to invoke it, we might lose those records. Likewise saveOutsideState() gathers the chainSends just before calling commit, so if the callback were invoked during commit(), those records would be left for a subsequent block, which would break consensus if the node crashed before the next commit. This commit adds an `allowExportCallback` flag, to catch these two cases. It is enabled at the start of AG_COSMOS_INIT and BEGIN_BLOCK, and then disabled just before we flush the chain sends in saveOutsideState() (called during COMMIT_BLOCK). Note that swingstore is within its rights to call exportCallback during openSwingStore() or commit(), it just happens to not do so right now. If that changes under maintenance, this guard should turn a corruption bug into a crash bug refs #9655
The swingstore export-data callback gives us export-data records, which must be written into IAVL by sending them over to the golang side with swingStoreExportCallback . However, that callback isn't ready right away, so if e.g. openSwingStore() were to invoke it, we might lose those records. Likewise saveOutsideState() gathers the chainSends just before calling commit, so if the callback were invoked during commit(), those records would be left for a subsequent block, which would break consensus if the node crashed before the next commit. This commit adds an `allowExportCallback` flag, to catch these two cases. It is enabled at the start of AG_COSMOS_INIT and BEGIN_BLOCK, and then disabled just before we flush the chain sends in saveOutsideState() (called during COMMIT_BLOCK). Note that swingstore is within its rights to call exportCallback during openSwingStore() or commit(), it just happens to not do so right now. If that changes under maintenance, this guard should turn a corruption bug into a crash bug refs #9655
@mhofman and I were analyzing the way cosmic-swingset handles the swingstore
export-data
, and think we identified a small window of vulnerability, during which an application crash would leave the two databases (swingstore's SQLite, and the cosmos IAVL/LevelDB) in an inconsistent state.I've been looking for the right place to perform the
upgradeSwingset()
call introduced by #9169. This needs to happen after the swingstore is created withopenSwingStore()
, but beforemakeSwingsetController()
is called to use it. And the upgrade process will cause changes to the export-data, which might cause theexportCallback
to be invoked (or might not, depending upon how swingstore buffers and flushes those changes, see below), which means the host must be ready for those callbacks.Really, the swingstore design assumes that the host is ready for those callbacks from the moment swingstore is given the callback function: it should be legal for swingstore to invoke
options.exportCallback
even beforeopenSwingstore()
has returned, so that internal upgrades (eg #8089) can get their changes covered by the export-data hashes.But, looking at launch-chain.js, it seems like we can't actually accomodate that.
It's working now because swingstore actually buffers the export-data until a
kernelStorage.endCrank()
(which doesn't happen during startup, only during acontroller.run()
), or ahostStorage.commit()
, both of which happen late enough that the callbacks are prepared to work.Waterken-style Output Embargo
In general, we base our execution model upon the (closely related and cross-inspirational) E and Waterken systems. We start with the E vat model:
To make this model robust against crashes, each "atomicity domain" has a reliable database, and we define a commit/checkpoint model which ensures that no matter where a given domain crashes, when it is restarted from the most recent DB checkpoint, it will proceed forwards in a way that is consistent with what happen before the crash.
We try to follow the Waterken system's "hangover inconsistency prevention" logic in our IO design. The basic idea is that output messages are embargoed until after the end-of-crank commit. During a crank, when a component (vat, or kernel overall) wants to send a message, we instead store it in a table of "pending outbound messages". At the commit point, this table is commited along with all the internal state changes (as well as the beginning-of-crank step that removed a delivery from the input queue). Then, only after commit, the surrounding code can release the outbound messages to some remote system (with a different DB and atomicity domain).
(if we were to release the outbound messages earlier, our host might crash before the commit point, and then the new post-reboot execution might do something different, which is the failure mode we call "hangover inconsistency")
To achieve at-least-once delivery reliability of messages, the IO system needs:
The Waterken server recorded all of its outbound messages as files in a directory, and attempted to re-send all of them each time a TCP (HTTP) connection was established to that peer. Old messages were retired (deleted) upon receipt of an ack, which was always a "high-water mark" (so
ACK: 24
would retire all messages from 0 to 24).The swingset comms protocol (and device-mailbox) is built around this principle, and provides reliable cross-kernel messaging (which we aren't currently using in cosmic-swingset, but which might get more attention as we look at running multiple communication chains).
Cosmos vs Swingset Atomicity Domains
Unfortunately, Cosmos and Swingset use separate databases. Very early on, we explored keeping all Swingset state in the Cosmos IAVL tree, but concluded that it would be too slow, and would prevent us from taking advantage of SQL features like
CREATE INDEX
that would provide a better data model (we'd have to encode everything into the IAVL key space, with a single lexicographic iteration primitive).That means we have two atomicity domains, one per database. At end-of-block, we use
hostStorage.commit()
to perform a SQLCOMMIT
of the swingstore, and then a moment later, cosmos does a commit of the IAVL tree (i.e. a LevelDBcommit()
) to record all the cosmos-side state changes.Assuming both systems are reliable on their own, this is safe against crashes that occur before the swingstore commit, or after the IAVL commit. When the application is rebooted, it will resume from a mutually-consistent state.
But, of course, we must also survive a crash which occurs between the two commit points. To achieve this, we apply a form of the Waterken protocol to the connection between Cosmos and Swingset. It's not quite the same, but it has similar attributes.
On each block, the cosmos/golang side sends directives into the kernel side, like
AG_COSMOS_INIT
/BEGIN_BLOCK
/END_BLOCK
/COMMIT_BLOCK
(these are processed in launch-chain.js). Each one contains the block height that the cosmos side is currently executing. Upon receipt ofEND_BLOCK
, the kernel side does a bunch of execution, and then commits its state, including a note about the block height that it just executed. As that execution runs, the kernel will perform device IO calls (mostly device-bridge messages likeVBANK_GIVE
andVBANK_GET_BALANCE
), which are sent over to the cosmos side for processing.We capture those device IO calls (and their results) in a list named
savedChainSends
. The list is stored durably.Then, when the kernel is given a directive (
BEGIN_BLOCK
in particular), it compares the incoming block height against the last block it remembers executing and committing. In the normal case, the incoming height will be one higher than the one we remember, e.g. "pleaseBEGIN_BLOCK
of block 12" plus "I remember executing block 11".But it might be equal (e.g. both 12), which means the kernel is being asked to execute a block that it already executed. This is the "ignore duplicate inbound messages" portion of the Waterken protocol. It means that we previously crashed after the swingstore commit (which captured the state changes of block 12), but before Cosmos managed to commit the block-12 IAVL changes. There are device IO calls that cosmos does not remember getting, and it needs them to achieve the correct block-12 state.
So in this case, we replay the
savedChainSends
back to cosmos instead of performing the swingset execution. This is very much like the way that vat workers are brought back up-to-date (from a heap snapshot) by replaying a transcript. In this case, we don't really care about the return values of theVBANK_GET_BALANCE
calls: the kernel doesn't need them, although we might assert that cosmos returns the same values it did the first time around.We need to save these chainSends durably so they'll survive the host crash, and we need to save them in the same atomicity domain (eg same DB) as the kernel state. But they're outside of the kernel's data model, so they must not interfere with
kernelStorage
. This is whatswingStore.hostStorage
was created for: a separate chunk of thekvStore
table that can be used by the host app to track whatever it needs, but which gets committed at exactly the same time as everything inkernelStorage
.A function named
clearChainSends
empties the in-RAM list of chainSends and returns an Array, so the data can be stored durably intohostStorage
. ThesaveOutsideState
function does this, and storesblockHeight
, and then callshostStorage.commit()
. This is called when processing theCOMMIT_BLOCK
directive:Swingstore Export-Data
Swingstore's
export-data
is a list of key-value pairs which provide integrity validation for the swingstore export artifacts: when you want to export the state of a swingstore, you get theexport-data
list and you get a list of named artifacts, when you want to import it, you feed it theexport-data
list and then you feed it the artifacts. The importer relies upon the accuracy of theexport-data
, but it does not rely upon the accuracy of the artifacts.This drastically improves the performance of validated exports: for each block, we update a little bit of export-data and store it into the cosmos IAVL state, but we do not need to emit all the artifacts. Artifacts are not generated until a cosmos state-sync export is requested, which might only be once per day (depending on how the node is configured). The cosmos/CometBFT state-sync import process will reliably populate IAVL completely, and verify the IAVL root against the chain's consensus state. We leverage that to provide validation of the swingstore artifacts.
So, as the kernel processes a block, it makes changes to the swingstore state, which causes changes to the export-data records. The swingstore API allows the host application to configure an
exportCallback
, which is called with these export-data deltas. The host is expected to apply these deltas to its own durable state (eg IAVL), so that a future state-sync export can include them, so that a future state-sync import can provide them toimportSwingStore
, so they can be used to validate the otherwise-untrusted swingstore artifact blobs.Because the swingstore
kvStore
is not itself a Merkle tree, we shadow each kvStore entry into its own export-data entry (transcripts and snapshots and bundles are handled more efficiently). So the stream of export-data deltas will include a lot of overlap: a givenkvStore
key might be changed a dozen times over the course of a block. For performance, we batch the export-data deltas, in the hopes of removing these redundant changes, and only delivering one delta per key.Internally, the swingstore records the export-data deltas in a SQL table named
pendingExports
. An internal function namedflushPendingExports()
removes the deltas from the table and feeds them to theexportCallback
. This function is currently called in two places. The first iskernelStorage.endCrank()
, which is called at the end of each crank. The second ishostStorage.commit()
, which first doesflushPendingExports()
, and then does the SQLCOMMIT
statement that finally checkpoints the swingstore DB.Note that
hostStorage
does not feed export-data (swingStore.js:hostKVStore.set
lacks thenoteExport
calls whichkernelKVStore.set
has). The host application is responsible for populatinghostStorage
during state-sync import to the correct values. This probably also means we can't (safely) produce a state-sync export from a DB that is in the split-commit scenario, but that's ok because we only ever try to make an export after the IAVL commit finishes.launch-chain.js tries to do async export-data callbacks
In the name of performance, launch-chain.js takes the synchronous
exportCallback()
invocations and attempts to process them in parallel with kernel execution. The export-data deltas need to be persisted in the IAVL tree, which means they need to be sent over the wall to the golang/cosmos side of the application, which means they travel thechainSend()
pathway (in a message namedswingStoreUpdateExportData
). The cross-language context switch makes that relatively slow, so we didn't want the kernel to block waiting for it to complete.So, the swingstore is given
swingStoreExportSyncCallback
, which pushes the changes into a queue, and does some work to initate achainSend
, but does not wait for it to complete. Some sort of async callback is arrange to react to its completion by pulling the next item from the queue andchainSend
ing that. And then there's a flush operation that waits for everything to complete.This gives the golang side more time to process most of the export-data deltas. To support this, we'd like as many of the deltas to be delivered early: if the only
flushPendingExports()
happened duringhostStorage.commit()
, then there wouldn't be must opportunity for parallelization. That's why we added a flush toendCrank()
: it's a reasonable tradeoff between parallelizability and delta-merging.@mhohman and I agreed that this code is really hard to follow and is in dire need of refactoring. We think the async aspect of it increases the potential for bugs, and doesn't match the the intention of the swingstore
exportCallback
API, and might kick us into that "two systems, reliable on their own, but now interact in surprising ways" category.Problem 1: Export-Data is a chainSend, chainSend is stored in hostStorage
This finally brings us to the problems. The first is about how export-data deltas are committed.
Each
controller.step()
orcontroller.run()
ends with a call tokernelKeeper.endCrank()
. This is fortunate, because it means we do aflushPendingExports
before returning control to the host application.If any other swingstore changes were made, they would still be stuck in the
pendingExports
table when theCOMMIT_BLOCK
directive was processed andsaveOutsideState()
reached the call tocommit()
, at which point they'd be flushed, which would put them inchainSends
, which would be hanging out in RAM, and which might or might not make it into IAVL before the cosmos-side IAVL commit (I'm not sure how the interlock works there). If the application crashed, they would be lost. If the application survives long enough to get to the next block, they'd be committed in that next block. Overall this would probably cause divergence between nodes that did the commit early and nodes that did it late, and recovering from a split-commit scenario would be dubious.The swingstore design, and the documented API contract it has with the caller, should make it perfectly acceptable for
commit()
to write out any lingering export-data deltas. It invokes theexportCallback
before the SQL commit, so it has satisfied its end of the contract. In fact the presence of aflushPendingExports
insideendCrank
should be merely an optimization, to benefit parallelization. So I feel it's a bug that we're actually dependent upon thatendCrank
flush for correctness.To protect against future changes, we need a different design. One idea we discussed was to expose the currently-private
flushPendingExports
ashostStorage.flushPendingExports()
, and require that the host call it manually, before callinghostStorage.commit()
. With this approach, we'd makesaveOutsideState
look like:We'd probably also make
commit()
assert that callbacks table is empty. This is a drag, it increases the host's API burden (exposing a previously-hidden implementation detail, and requiring them to engage in a more complex protocol). But it would ensure that the deltas were really flushed by the time the host uses them, and ensure that no additional deltas will appear before thehostStorage.commit()
call is made.That could work, but I'd prefer to find a way that doesn't expose the internal details like that.
(interlude: chainSends as Waterken output embargo)
If we look at this in terms of a Waterken IO design, we're using hostStorage
chainSends
as the embargoed-output table. We aren't actually embargoing the outputs (chainSend()
sends them immediately to the golang side, in addition to recording them). We don't need to embargo because the cosmos side does not have a completely independent lifetime (unlike remote kernels): the IAVL tree is only committed after the SQL DB commits. So in the two-by-two table of (IAVL,swingstore) times (committed,not), we can never be in the corner where IAVL has committed but SQL has not. So we must record the chainSends, but we don't need to hide them.This is good, because otherwise we couldn't do
VBANK_GET_BALANCE
, which requires a return value. A deep design decision in our platform is whether components can do blocking reads or not. For now, the kernel can do a blocking read of the cosmos side (VBANK_GET_BALANCE
), and vats can do a blocking read of a device node (result = syscall.callNow(device_vref, methargs)
). But we need to remove those from vats before we can make parallel execution work (#6447 (comment)).So
chainSends
are more like a vat's transcript, where we perform the syscalls immediately, but record them just in case we need to play them back in a future restart (bringing a vat worker online).And the trick is that we need this kernel-level transcript-like data to be committed in the same atomicity domain as the kernel's own state, which requires
hostStorage
.chainSends
currently have a somewhat tortuous path fromexportCallback
to thechainSends
array, toclearChainSends()
, to thehostStorage.kvStore.set()
. And the async/parallel nature of that path does not give me confidence that every delta has made it to the end by the time we commit.It might be more direct if
exportCallback
immediately (synchronously) stored its delta in hostStorage, so thatcommit()
could be called at any moment, and could do aflushExportCallbacks
internally, and we'd still always be up-to-date. However the volume of the deltas means that we probably can't afford to update the one hostStoragechainSends
key on every callback.So, I'm still looking for an approach that is more satisfying than exposing
flushExportCallbacks
and obligating the caller to use it.Note that our
mailboxStorage
approach has similar issues, but pre-dateschainSends
and export-data, and we haven't looked at it for a long time.Problem 2: When is upgradeSwingset safe?
This sets up the more immediate problem that I set out to address: when is it safe for us to call
upgradeSwingset
?The new
upgradeSwingset
call must happen before we build the kernel, because the kernel asserts that the internal kvStore schema version is up-to-date. And ifupgradeSwingset
actually upgrades anything, it will change kvStore values, which will produce export-data deltas.This needs to happen before we build the kernel, so before
buildVatController()
, which makes it pretty early, well before we're processing any blocks. The swingstore API assumes thatexportCallback
can be called instantly, even duringopenSwingstore
(for #8089 upgrade purposes). But the wholechainSends
getting looped intohostStorage
thing means that our host app is not actually ready that early (modulo buffering). At the very least, we must gethostStorage
back fromopenSwingstore
before a "cleaner" (blocking/synchronous)exportCallback
could possibly write anything into it.We have two levels of buffering which help out (any maybe a third). The first is that swingstore internally buffers the deltas in the
pendingExports
table (so in the pending/uncommitted SQL transaction, on disk, but not durable). They remain there until the firstendCrank
orcommit
.The second is that cosmic-swingset holds
chainSends
in RAM. The maybe-third is the async queue thing, which (maybe?) allows someexportCallback
calls to not actually run until later, with a flush mechanism to ensure they actually do run before we callcommit()
.So I think we can call
upgradeSwingset
early, just afteropenSwingstore
. The export-data changes will hang out in thependingExports
table until the firstcontroller.run()
hits anendCrank
, at which point theflushExportCallbacks
will start them on their journey tochainSends
. They should finish that journey by the time we callcommit
, and they'll get committed in the first block after the chain upgrade.But... what if we don't happen to hit an
endCrank
? I don't think this can happen.. ifcontroller.run()
finds no work to be done, thefinally
block in kernel.js still invokesendCrank
. But if somehow we avoided anendCrank
, we'd wind up in the Problem 1 case above, which would spell trouble.The text was updated successfully, but these errors were encountered: