price-feed vats hold old QuotePayments in recovery set, causing storage leak #8400

warner · 2023-09-28T00:46:35Z

Describe the bug

I spent a few days investigating storage usage on mainnet, and learned that about 60% of our kvstore entries are the result of a bug in the price feed contracts, and this count grows by about 10 rows for every new price quote created.

External oracles submit messages (signed txns) into the chain with claims about the e.g. ATOM/USD price at a given time. These get executed as contract operations (create an invitation with args that include the price claim, make an offer, get a seat, etc). The price feed contract then publishes a price quote to subscribers, such as an aggregator that averages/etc quotes from multiple sources to drive Vault creation or liquidation.

While the authority to produce a quote is limited by virtue of the wallets which have access to the price feed's objects, the quote objects themselves might be delivered to arbitrary subscribers (over untrusted paths), and those subscribers want to verify that the quote object they received is genuine. The current implementation uses ERTP Payment objects to represent the price quotes, and subscribers can call E(priceQuoteIssuer).getAmountOf(allegedQuote) to confirm their authenticity.

generating a quote object:

agoric-sdk/packages/zoe/src/contracts/priceAggregator.js

Lines 103 to 109 in 92b6cd7

    
           const authenticateQuote = async quote => { 
        
             /** @type {Amount<'set'>} */ 
        
             // xxx type should be inferred from brand and value 
        
             const quoteAmount = AmountMath.make(quoteKit.brand, harden(quote)); 
        
             const quotePayment = await E(quoteKit.mint).mintPayment(quoteAmount); 
        
             return harden({ quoteAmount, quotePayment }); 
        
           };

verifying one:

agoric-sdk/packages/zoe/src/contractSupport/priceAuthorityTransform.js

Lines 90 to 92 in 92b6cd7

    
           const { value: sourceQuoteValue } = await E(sourceQuoteIssuer).getAmountOf( 
        
             sourceQuotePayment, 
        
           );

However, subscribers like the scaledPriceAuthority stop there: the quote payment is dropped, without being deposited or burned.

This interacts badly with a safety feature of ERTP: the Recovery Set. To guard against assets being lost, each time you call payment = await E(purse).withdraw(), the payment is added to the Purse's "recovery set": a strong SetStore. If you send the Payment to a correspondent, but they fail to deposit it, you can call getRecoverySet() to access all these Payments, and then deposit them back into a Purse. Or, you can call recoverAll() to sweep them back into the original Purse. Brand new tokens, created by E(mint).mintPayment(), are held in a special mintRecoveryPurse, available from the IssuerKit. All Payments are removed from their Recovery Set as soon as they are deposited in some other Purse, or burned.

Recovery normally isn't necessary, because well-written code will deposit a Payment as soon as it receives it, and only some sort of error would prevent that from happening. However, quotes-as-payments aren't assets to be hoarded: the code only uses ERTP Payments for their authenticity-verification properties, not their exclusive-transfer properties. So it is not surprising that the price contracts fail to deposit() or burn() them.

The consequence is that the Payment remains in the recovery set forever.

The contract and ERTP code uses a variety of SetStores, WeakMapStores, and Exos to express this logic. Swingset (specifically liveslots) supports those with "virtual collections" and "virtual objects", all of which bottoms out in a per-vat key-value table named the "vatstore". This holds rows to represent collection entries, ordinals to enable consistent iteration through collections, reference counts, weak-key reference-graph edges, export statuses, and finally the virtual-object state itself.

The particular data structures used by ERTP means that each Payment costs about 10 vatstore entries. In the following diagram, each circle is a vatstore entry, of various types.

And this is an example of one particular (still-extant) payment, where each gray box is a separate vatstore entry:

Since each vatstore entry is really stored in the kernel's kvStore (a table inside the "SwingStore" SQLite DB), they cost 10 kvStore entries each. And because state-sync integrity requires us to mirror all kvStore items into the IAVL database, they also cost 10 IAVL entries each. We've been creating quotes once every five minutes since the Vaults feature was released last June, and by now we have about 120k quote payments kept alive in v29 (the ATOM-USD price-feed vat). A similar problem is keeping 90k payments alive in v46 (scaledPriceAuthority). As we add new denominations in the future, their price feed vats will have the same issue. I estimate this currently costs 440MB of space (combining both swingstore and IAVL), growing at 5MB/day.

Additional Costs

v46 (scaledPriceAuthority) has 90k quote payments in this state, however it has also exported about 30k of them to v7-board. That vat is still holding weak references (c-list entries in the "recognizable" state, as opposed to "reachable"). This causes additional kvstore entries for each one: two for the export-side c-list, two in the kernel object table, two in the import-side clist, and some amount of RAM in v7-board.

We believe that, at some point, v46 received the payment and sent it to E(board.readOnlyMarshaller).toCapData(). The serialization process submits the received quote-payment Presence object to passStyleOf, which would have concluded that the Presence was passable, and added the object to it's WeakMap cache. This WeakMap is virtualized by liveslots (to prevent userspace from sensing GC, as usual), and VirtualObjectAwareWeakMap remembers a real Map keyed by vref strings. So v7-board will spend some (small) amount of RAM on each such object. These WeakMap references cause the c-list to remain in the "recognizable" state: they do not represent a strong ("reachable") reference.

Both the c-list/kvstore entries and the RAM costs will go away if/when the quote payment objects are released.

We've been considering allowing the Endo/passStyleOf code to use a real WeakMap instead of the virtualized one. This requires the permission of liveslots (otherwise arbitrary user code could do it, violating our GC-sensing rules), so coordination between Swingset and Endo is ongoing. If implemented, this would remove the separate passStyleOf cache and would probably allow v9-board to avoid spending either RAM or c-list imports on the object.

Potential Fixes

To flatten this growth rate, at today's Zoe meeting we discussed one short-term fix named "burn after reading": clients should burn their quote payment objects after verifying them. For example, the v29 price-feed quotes are received by v46-scaledPriceAuthority, where priceAuthorityTransform.js, could do:

    const { value: sourceQuoteValue } = await E(sourceQuoteIssuer).getAmountOf(
      sourceQuotePayment,
    );
   await E(sourceQuoteIssuer).burn(sourceQuotePayment); // burn after reading

We would need similar changes in all other places that receive quote payments. In particular, the scaled/aggregated quotes produced by v46-scaledPriceAuthority are delivered to v48-vaultFactory, where something needs its own burn() call.

Another approach would be to have the producing code rotate through recovery purses every hour or so:

mint Payments as before, but immediately deposit that first payment into a separate recovery Purse, then .withdraw() a new one from that purse and send the second payment to the client
then, once an hour, drop the old purse and start using a new one, relying upon GC to delete the recovery set, and thus forget about all the old payments (if these represented real assets, this would lose funds, but quotes aren't worth anything)
note that this doesn't help remediate the existing payments, because those are held by the recovery set associated with the Issuer's single mintRecoveryPurse, which is closed over by mintPayment() and cannot be changed or destroyed short of a vat upgrade which first modifies baggage to remove the reference

In the longer term, we should find a lighter-weight means to verify the authenticity of price quotes. ERTP Payments offer both verification (through issuer.getAmountOf()) and exclusive transfer (through purse.deposit(payment)), but this use case does not require exclusive transfer. The ocap community provides other tools for mere verification, like the Branding pattern. The "issuer" replacement can retain a WeakMap of quote objects, and offer an API to perform validQuotes.has(allegedQuote).

Remediation

In addition to fixing new price quotes, we also need to dispose of the hundreds of thousands of existing ones. One idea from today's meeting:

periodically (perhaps once per price quote), the producer could access mintRecoveryPurse, fetch its recovery set, iterate through a few dozen payments, and burn them all
- to avoid invalidating a quote before a consumer gets a chance to verify it, check the timestamp on the quote, and only burn it if it's at least an hour older than the new quote being created
- this would spread the work out over time: if we burn ten old payments upon each new 5-minute quote, we'd remediate all 120k from v29 in about 41 days, without causing an undue amount of work for any new quote

Another approach would be to have an upgraded producer vat immediately perform mintRecoveryPurse.recoverAll() (perhaps assuming that no quote payments are outstanding at the moment of upgrade, or accepting one quote getting invalidated before reading). However we need to be cognizant of the size of the recovery set: this would cause v29 to immediately delete about a million items from its vatstore, requiring maybe ten million syscalls, all in a single crank. Who knows how long that might take. In addition, that would remove a million rows from the IAVL tree, and we know that IAVL pruning takes a long time when there has been a lot of churn, so each validator might experience a significant (vote-missing) delay some arbitrary number of blocks after the upgrade (depending upon their individual pruning configuration).

Note that different approaches require upgrades to different vats, and other bugs make some vats are more practical to upgrade than others. In particular, we'd like to fix the price-feed code before making a lot of new deployments, to avoid adding more load to the system, but eventually we need to fix (and also remediate) the existing vats.

cc @Chris-Hibbert @erights

The text was updated successfully, but these errors were encountered:

Chris-Hibbert · 2023-12-11T18:03:30Z

@warner wrote:

I did some estimates on the 10-at-a-time thing, I think that's a good way to approach it.
As of 01-dec-2023, for bug #8400 (Quote Payment), we've got 218k payment objects to delete in v46-scaledPriceAuthorityATOM 291k in v29-ATOM-USD-price_feed, 98k in v68-stATOM-USD_price_feed, 74k in v69-caledPriceAuthority. At the current PushPrice rate (one every 60 seconds), we'll clear the v46 backlog in 15 days, causing a 1.2-second BringOutYourDead every 30 minutes. The v29 backlog will have a similar load but take proportionally longer (maybe 20 days)

Chris-Hibbert · 2023-12-19T00:57:36Z

#8498 is an update to #8418 that deletes the vestigial payments incrementally.

warner · 2024-02-08T22:49:22Z

Note to self: to measure the size of this problem, use SwingSet/misc-tools/categorize-kvstore.js:

% node ~/trees/agoric-sdk/packages/SwingSet/misc-tools/categorize-kvstore.js --datafile data17.json ingest run-17-swingstore.sqlite
% node ~/trees/agoric-sdk/packages/SwingSet/misc-tools/categorize-kvstore.js --datafile data17.json many-voms |grep "quote payment"

vatID  kindID tag                  defined referenced recognized exported sample
v29        22 quote payment         325384     325372     650744        0 {}
v46        22 quote payment         244029     244029     488058    81361 {}
v68        22 quote payment         144240     144240     288480        0 {}
v69        22 quote payment         108180     108180     216360    36080 {}

refs: #8740 closes: #8918 refs: #8400 ## Description Add a new Auction instance in A3P, so #8757 can make use of it. Also provides upgrade proposals which can be be applied to MainNet and other chains. ### Security Considerations N/A ### Scaling Considerations This is largely in service of #8400, which reports that priceFeed vats are accumulating garbage. This PR gives a new auction which can rely on new priceFeeds. The existing auction is not upgradeable and its pricefeeds can't be updated. ### Documentation Considerations No user-visible changes to behavior. ### Testing Considerations Tested in A3P ### Upgrade Considerations Auctions are not upgradeable, so we have to replace them and update their clients.

closes: #8049 closes: #8740 closes: #8868 closes: #8918 closes: #8981 closes: #8079 refs: #8400 closes: #8735 closes: #7873 closes: #8726 closes: #7954 closes: #8757 closes: #8728 closes: #8789 ## Description Upgrade **VaultFactory** in A3P, relying on the new PriceFeeds, and auctions. The actual upgrade waits for the priceFeeds to start supplying before doing the upgrade, so there won't be any gap in priceUpdates. When the upgrade is finished, we also update the auctioneerKit and Auction instance in the bootstrap environment. This PR demonstrates that VaultFactory can be upgraded even though governance is not persistent (#8123). ### Security Considerations N/A ### Scaling Considerations This is largely in service of #8400, which reports that priceFeed vats are accumulating garbage. This PR switches to new priceFeeds, which won't have that problem, though cleaning up the existing vats is a task for the future. ### Documentation Considerations No changes to user-visible behavior. ### Testing Considerations A3P tests that verify that vaultFactory has been upgraded, that a new Auctioneer is running and is receiving prices. Verify that when prices drop, assets are sold via the auction, the bidder gets the proceeds, and the vaults are liquidated or reconstituted appropriately. ### Upgrade Considerations Upgrade all the vats!

Chris-Hibbert · 2024-05-21T00:39:45Z

After #9283 and #9382, we'll no longer have an active leak. The remaining issue is how to deal with the accumulated storage, and that's addressed in #8928. I'm reassigning this to @warner.

refs: #8400 refs: #8498 closes: #9371 ## Description #8400 reported that the priceFeed vats hold onto old quotes in their recoverySet, preventing them from being collected. #9283 applied the fixes to master. These fixes address the growth in priceFeed vats and Zoe, but scaledPriceAuthorities were still growing. This resolves that problem by upgrading scaledPriceAuthority contracts to not use their recoverySets. <details> <summary><b>Expand</b> for performance graphs</summary> <img width="809" alt="image" src="https://github.com/Agoric/agoric-sdk/assets/13372636/889bb283-89c8-434f-8a67-efa56d0382ad"> Kernel allocation is in blue, and the scale is on the left. It varies from 48862 to 49743, with a small amount of long-term growth.The other active vats (v9=Zoe, v29=ATOM-USD_priceFeed, v43=wallet, 72=ATOM-USD_priceFeed, 74=auctioneer) use the scale on the right, with Zoe varying tightly around 3600, and the others low and stable. scaledPriceAuthority-ATOM doesn't have enough variation to be worth graphing. </details> ### Security Considerations Upgrade existing contracts. No new vulnerabilities. ### Scaling Considerations This addresses the largest known category of growth on the chain. ### Documentation Considerations Add some documentation on creating proposals. ### Testing Considerations Tested in A3P. We should exercise all the clients of priceFeeds in our test environments. ### Upgrade Considerations This PR includes a proposal that will upgrade all vats with `scaledPriceAuthority` in their label. That should work on or test chains as well as MainNet. These changes should be included in the next release.

aj-agoric · 2024-05-28T17:22:05Z

@warner to make another ticket describing the planned remediation. We will close this ticket once the sub tickets are landed on 16.

warner · 2024-06-14T00:29:31Z

Ok #9483 will track the creation+deployment of a chain-upgrade step that terminates (and initiates the slow deletion of) the old price-feed vats. That, plus/preceded-by the deployment of the #9283 / #9382 fixes to mainnet, will complete the remediation of this problem, so I'm closing this ticket.

refs: #9886 ## Description Some tests in DevNet indicate that the Oracles need the roundID in vstorage to be reset so they can sync. ### Security Considerations Not a security issue. ### Scaling Considerations Not directly a scaling concern. (see #8400). ### Documentation Considerations not user visible. It would be nice if the protocol between oracles and fluxAggregators were documented somewhere ### Testing Considerations updated the a3p tests to increment the roundId before the upgrade and show that this upgrade resets it. ### Upgrade Considerations This change is to the fluxAggregator which is already being started afresh in this proposal.

closes: #9370 ## Description The priceFeed proposal has been tested as an independent coreEval, but the current plan calls for it to be included in Upgrade 18. This adds it to `upgrade.go` and moves the tests to `n:upgrade-next`. ### Security Considerations No additional security implications. ### Scaling Considerations Addresses pre-existing scaling concerns (see #8400), and adds no new scaling issues. ### Documentation Considerations None ### Testing Considerations An existing test is moved to a new location. The current version seems to tickle #10292, so tests may not pass at this point. ### Upgrade Considerations Targeted for upgrade 18.

warner added bug Something isn't working performance Performance related issues Zoe Contract Contracts within Zoe Inter-protocol Overarching Inter Protocol labels Sep 28, 2023

This was referenced Sep 28, 2023

ExitObject/SeatHandle cross-vat reference cycle retains old objects #8401

Closed

test kernel GC code for large-cardinality operations #8402

Closed

rate-limited GC in BringOutYourDead #8417

Open

erights mentioned this issue Oct 3, 2023

feat(ertp): Upgrade quotes to drop newly optional recovery sets #8418

Closed

9 tasks

warner mentioned this issue Dec 6, 2023

add controller.reapAllVats() #8626

Closed

warner mentioned this issue Dec 11, 2023

use mainfork to measure one-fell-swoop remediation cost #8644

Closed

This was referenced Jan 5, 2024

Vaults detects failing priceAuthority and requests new one #8696

Merged

Plan for replacing priceFeeds, redirecting oracles #8740

Closed

This was referenced Feb 9, 2024

idea: remediation vat, dirty-vat-flag -driven BOYD #8877

Open

terminate vats slowly #8928

Open

This was referenced Feb 29, 2024

Make Recovery sets optional #9013

Merged

a3p proposal to create new priceFeeds #9044

Merged

This was referenced Mar 27, 2024

Upgrade VaultFactory in A3P, and use new auctions and price feeds #8757

Closed

Start a new auction in a3p-integration #9158

Merged

warner mentioned this issue Apr 15, 2024

delete historical transcript spans (unless config switch says to retain) #9174

Open

Chris-Hibbert mentioned this issue Apr 24, 2024

8981 upgrade vaults #9283

Merged

aj-agoric assigned Chris-Hibbert May 2, 2024

This was referenced May 15, 2024

Vats can be shut down to reap their objects using the adminNode. #9367

Open

create proposal for upgrade16 to replace all priceFeed vats #9370

Closed

upgrade scaledPriceAuthorities; measure impact #9371

Closed

upgrade scaledPriceAuthorities #9382

Merged

Chris-Hibbert assigned warner and unassigned Chris-Hibbert May 21, 2024

warner mentioned this issue Jun 11, 2024

terminate old mainnet price-feed vats #9483

Open

warner closed this as completed Jun 14, 2024

dckc mentioned this issue Jun 24, 2024

feat(vats): support ORACLE_ADDRESSES from environment #9570

Closed

1 task

This was referenced Aug 14, 2024

Upgrade vaults and auctions with a coreEval #9887

Closed

Replace priceFeeds and scaledPriceAuthority #9928

Closed

This was referenced Aug 29, 2024

start replacement scaled price authorities #9998

Closed

coreEval for replacing price feeds #9999

Closed

Chris-Hibbert mentioned this issue Oct 17, 2024

set roundId to 1 & write to vstorage at start #10291

Merged

Chris-Hibbert mentioned this issue Oct 22, 2024

Turn priceFeed proposal into a SoftwareUpgrade #10317

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

price-feed vats hold old QuotePayments in recovery set, causing storage leak #8400

price-feed vats hold old QuotePayments in recovery set, causing storage leak #8400

warner commented Sep 28, 2023

Chris-Hibbert commented Dec 11, 2023

Chris-Hibbert commented Dec 19, 2023

warner commented Feb 8, 2024

Chris-Hibbert commented May 21, 2024

aj-agoric commented May 28, 2024

warner commented Jun 14, 2024

price-feed vats hold old QuotePayments in recovery set, causing storage leak #8400

price-feed vats hold old QuotePayments in recovery set, causing storage leak #8400

Comments

warner commented Sep 28, 2023

Describe the bug

Additional Costs

Potential Fixes

Remediation

Chris-Hibbert commented Dec 11, 2023

Chris-Hibbert commented Dec 19, 2023

warner commented Feb 8, 2024

Chris-Hibbert commented May 21, 2024

aj-agoric commented May 28, 2024

warner commented Jun 14, 2024