Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor updating persister #2359

Conversation

domZippilli
Copy link
Contributor

@domZippilli domZippilli commented Jun 15, 2023

Monitor-updating KV store persister

Adds a new MonitorUpdatingPersister struct that customers may choose to use, so that new events on a channel do not serialize the entire monitor to storage.

How it works

  • Stores monitors in monitors/
  • Stores updates in monitor_updates/
  • When the "update persisted monitor" call comes in, either:
    • Rewrites the monitor, cleaning up old updates at this time, or
    • Adds an update file
  • At init when it deserializes monitors, it:
    • Reads each monitor and deserializes it
    • Reads each update for that monitor, deserializes it, and applies it if it has an update_id > the monitor's, or is the closed channel update constant.

Migration Path

No migration should be necessary! It should "just work" since the full monitors are stored the same way. When this persister runs the first time, it'll deserialize the existing full monitors, and the start writing updates.

@tnull
Copy link
Contributor

tnull commented Jun 15, 2023

The easiest way to do this might be to extend the KVStorePersister trait to include get/delete/list, as shown. I think it might also be possible with some kind of extension trait, say "KVStoreReadListDelete", but the challenge is avoiding conflicting implementations of Persist for a KVStorePersister vs KVStoreReadListDelete, especially since they'd need to differ.

Mh, if we want to go this way, may I suggest that we upstream LDK Node's KVStore trait?

This might make a slight migration necessary as namespaces are not handled as implicit path prefixes anymore, but are dedicated variables. However, it also has generally been designed to be persistence-backend agnostic, i.e., it doesn't make assumptions about FilesystemPersister specific things such as Paths etc. Plus, we'd already have two backends ready-to-go (FilesystemStore / SqliteStore)

@benthecarman
Copy link
Contributor

Should keep the ability to only persist the monitors still. IIRC persisting all the updates takes more storage.

@domZippilli
Copy link
Contributor Author

Sounds like upstreaming the KVStore would be the way to go. Pretty sure I can implement the updating persister on that. Then KVStorePersister can retain its old behavior of persisting whole monitors, for folks who would prefer the balance of less space/more IO.

The PathBuf for keys was motivated by an opinion that key names should be hierarchical in this application, rather than an assumption about using the filesystem. Mostly a holdover from the implementation I did for my own node (where we plan to move to !filesystem), and I wasn't looking as closely at the separation of concerns as I should for a library. The storage trait doesn't need that opinion. The updating persister probably does, as I have to organize the updates to correlate with the monitor, and natural order helps. I should be able to keep that opinion in that implementation though.

Thanks for this feedback, I'll report back with some updates when I get some more time to work on it.

@TheBlueMatt
Copy link
Collaborator

Sounds like upstreaming the KVStore would be the way to go. Pretty sure I can implement the updating persister on that. Then KVStorePersister can retain its old behavior of persisting whole monitors, for folks who would prefer the balance of less space/more IO.

I don't feel like this needs a separate trait - we can have an option to turn the update storage off, but its nice for the general trait to have reads so that we can provide utilities to load things at startup rather than everyone having to write it by hand.

The PathBuf for keys was motivated by an opinion that key names should be hierarchical in this application, rather than an assumption about using the filesystem. Mostly a holdover from the implementation I did for my own node (where we plan to move to !filesystem), and I wasn't looking as closely at the separation of concerns as I should for a library. The storage trait doesn't need that opinion. The updating persister probably does, as I have to organize the updates to correlate with the monitor, and natural order helps. I should be able to keep that opinion in that implementation though.

Sadly PathBuf in Rust is this weird platform-dependent type - the character set of it is different from that of String so it behaves differently. I feel a bit strange using a platform-dependent type like that in an API but maybe it doesn't matter. In any case, I admit I like the ldk-node trait with the "namespace" argument which we can use (we could make it multi-layer too if we want separate folders for each channel's pending monitor updates?).

@domZippilli
Copy link
Contributor Author

we can have an option to turn the update storage off

Can you suggest a good way to do that? This occurred to me, but I wasn't seeing a way to get that configuration into the scope of the KVStorePersister's Persist implementation.

Sadly PathBuf in Rust is this weird platform-dependent type

That's a good point. It's cumbersome to get strings out of it, too.

The namespace part of the KVStore trait was actually a little bit of cognitive load for me. I ended up treating it like the prefix of a key in object storage, though it was unnatural for me to have that be a separate arg all the time, so I ended up working around it.

FWIW, the implementation seen here does store updates in a directory:

  • Update key: "monitor_updates/deadbeef_1/0000000000000000001"
  • Update path: "monitor_updates/deadbeef_1"
  • Update name: "0000000000000000001"
  • Monitor key: "monitors/deadbeef_1"
  • Monitor path: "monitors"
  • Monitor name: "deadbeef_1"

@TheBlueMatt
Copy link
Collaborator

Can you suggest a good way to do that? This occurred to me, but I wasn't seeing a way to get that configuration into the scope of the KVStorePersister's Persist implementation.

Hmm, lol, yea, that's actually not trivial...We'd have to replace the generic impl with a wrapper struct that takes a KVStorePersister and a bool/enum describing the storage style. I think it'd be fine to do that, wouldn't really have a big impact on users.

@domZippilli
Copy link
Contributor Author

replace the generic impl with a wrapper struct that takes a KVStorePersister and a bool/enum describing the storage style

Yeah, so then it becomes opt-in by way of using that wrapper, instead of the trait directly. Which seems easy enough, but also very similar to what I was imagining by using the KVStore, where you'd opt-in by using a sort of ChannelKVStore and going on the KVStore track, which differs in that it supports the get/list/delete required for updates. I imagine we could impl KVStore for FilesystemPersister (that's in ldk-node now). I actually tried adapting my code to the KVStore trait this afternoon, and got stuck on some kind of serialization bug, but at least the types are solvable.

I think what I will do is wait to see a PR for upstreaming KVStore before doing any more hacking on this, and revisit it when a get/list/delete-capable KV trait in LDK is settled.

@TheBlueMatt TheBlueMatt self-assigned this Jul 24, 2023
@TheBlueMatt
Copy link
Collaborator

Sorry for the delay here. Not 100% sure how we want to go about adding a get/list/delelte-capable KV trait on LDK, but a goal for 0.0.117 on my end is to add an async-capable KV trait on LDK so I may be able to think on it when I get there. Tentatively assigning myself for that, but please ping me in a month if I forget and don't have any followup here.

@tnull
Copy link
Contributor

tnull commented Jul 24, 2023

Sorry for the delay here. Not 100% sure how we want to go about adding a get/list/delelte-capable KV trait on LDK, but a goal for 0.0.117 on my end is to add an async-capable KV trait on LDK so I may be able to think on it when I get there. Tentatively assigning myself for that, but please ping me in a month if I forget and don't have any followup here.

Yeah, unfortunately this is still blocked on me upstreaming the KVStore. I hope to get to it this week or so and then hopefully both PRs can land for 0.0.117.

@domZippilli excuse the delay here!

@tnull
Copy link
Contributor

tnull commented Aug 4, 2023

Excuse the delay once again. Now finally came around to upstream the KVStore interface: #2472

Feel free to rebase this PR on top to make use of the updated interface!

(Note I now also included LDK Node's variant of read_channel_monitors as a utility method which uses any KVStore implementation)

@tnull tnull self-requested a review August 4, 2023 12:29
@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch from dac1604 to a1f6c67 Compare August 17, 2023 20:23
@domZippilli
Copy link
Contributor Author

domZippilli commented Aug 17, 2023

Hi @tnull and @TheBlueMatt,

I went ahead and revised this to use the trait under review in #2472.

I have not tested it yet. I'm sure there are bugs. In particular, the addition of namespace to the KV API means I need to handle key paths differently, and that's good for tripping me up. But, that will all shake out when I can set up a test environment.

Here's where I think this stands:

  • The diff includes changes from Replace KVStorePersister with KVStore #2472. To make it easier to look at, my changes (so far) are only in lightning/src/util/persist.rs.
  • The requirement to have this be opt-in is still not solved. As you can see, I commented out the full-monitor-writing implementation of Persist. Now that I see how this all took effect, I think Matt's suggestion of having a wrapper that takes a config for persistence style is the easiest way to make it work. I think it will be a minor API break for the next LDK version this gets in to.

So, PTAL, advise on things you'd like to see changed, and in particular advise if you have any direction about how to do the opt-in. In the meantime I will work on tests.

@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch from a1f6c67 to 8658c7d Compare August 17, 2023 20:35
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
@domZippilli
Copy link
Contributor Author

domZippilli commented Aug 18, 2023

Hi @tnull and @TheBlueMatt,

This isn't done yet, but it's significantly updated and probably at a good stopping point to get more feedback, and let final decisions on #2472 get made. My changes are in persist.rs, starting about line 203.

Some key points to review:

  • I reworked the path logic to use the 2-tier namespace design in KVStore. It's not ideal, but it works fine. I think I could redo it with 1 (no namespace), 3 (namespace + subnamespace?), or N tiers if you end up going that way.
  • How are we feeling about the structure? The previous iteration was a trait implementation. Now this is a new struct that people can choose to use as a sort of "default" or "reference" implementation of storing updates individually with KVStore. Will that work well with the rest of LDK? It seems like it will, and does opt-in nicely?
  • The last test is broken. I'm not sure it's broken as far as what it's supposed to test (returning PermanentFailure), but it could be a sign of some other issue?
  • What else should I be testing? Are there existing tests in the LDK repo I should be referencing and adapting to this PR?

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first high-level pass, didn't look to closely at the logic itself and the tests yet. You might need to rebase as the KVStore PR now maintains the code in the lightning-persister crate and I now also included the TestStore upstreamed from LDK Node so you don't have to roll your own.

lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch 2 times, most recently from a7962e3 to 8de5176 Compare August 22, 2023 00:30
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM, I think.

lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch 2 times, most recently from c5d8a36 to de73d99 Compare August 22, 2023 23:56
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator

Fixes #1426

@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch from d7ebca1 to 8f409c4 Compare September 26, 2023 00:38
@G8XSU
Copy link
Contributor

G8XSU commented Sep 26, 2023

Commit needs to be merged into single commit, where it's description contains motivation.

removing everything else about other commit msgs.

@domZippilli
Copy link
Contributor Author

To update the group generally:

  • There are some weird indentation and formatting things going on. I plan to make a final formatting pass with manual adjustments and then call everyone's attention to that.
  • Before I do that there are a few test and functional tweaks to make. I will consolidate to a single commit with the clean commit message as well.

@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch 2 times, most recently from ab0fc9d to b3b92a6 Compare September 26, 2023 17:39
@domZippilli
Copy link
Contributor Author

To update the group generally:

  • There are some weird indentation and formatting things going on. I plan to make a final formatting pass with manual adjustments and then call everyone's attention to that.
  • Before I do that there are a few test and functional tweaks to make. I will consolidate to a single commit with the clean commit message as well.

I made the significant tweaks that I knew I had to, and then also formatted one last time with rustfmt and then did a manual pass to get things aesthetically right. Also trimmed my in-line comments to the 100 column mark.

Should be good to go for more detailed formatting review (and of course other stuff).

@domZippilli
Copy link
Contributor Author

domZippilli commented Sep 26, 2023

I pushed an update to the changelog that tells more of the upgrade story (attn: @G8XSU who suggested this).

It does occur to me -- is there a dangerous situation we could get into if one immediately wants to downgrade? Imagine this:

  1. You "point" MUP at your existing monitors.
  2. It reads them into memory.
  3. It writes some updates, but does not consolidate.
  4. You shut down and decide to downgrade.

At that point you would load stale ChannelMonitors. Some possible cures/mitigations:

  1. Does LDK already rewrite all monitors at init? I think that's sometimes true, like when there's a new block, but not always true.
  2. We could add a manual consolidation function, that must be called on shutdown or as part of a downgrade. This is different than the cleanup_stale_updates which does not consolidate. It would have to be run when the node is not "up" though, since otherwise I don't think I can lock the channels from getting updated. I could also make this function not write the sentinel bytes, since those would be known-updated monitors (and they'd still be compatible with the MUP since it tolerates not seeing the sentinel).
  3. I suppose if the MUP immediately rewrote any monitor it reads without sentinel bytes, that would also "mark" them and require you to go through the steps to consolidate and convert it back.

Thoughts?

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty much good to me, only a few minor things.

I think we should land this soon to make sure it lands for the release. And, if any further minor feedback (docs/formatting) comes up, address it as a part of a follow-up.

lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
pending_changelog/monitorupdatingpersister.txt Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
@jkczyz
Copy link
Contributor

jkczyz commented Sep 26, 2023

I pushed an update to the changelog that tells more of the upgrade story (attn: @G8XSU who suggested this).

It does occur to me -- is there a dangerous situation we could get into if one immediately wants to downgrade? Imagine this:

  1. You "point" MUP at your existing monitors.
  2. It reads them into memory.
  3. It writes some updates, but does not consolidate.
  4. You shut down and decide to downgrade.

At that point you would load stale ChannelMonitors. Some possible cures/mitigations:

  1. Does LDK already rewrite all monitors at init? I think that's sometimes true, like when there's a new block, but not always true.
  2. We could add a manual consolidation function, that must be called on shutdown or as part of a downgrade. This is different than the cleanup_stale_updates which does not consolidate. It would have to be run when the node is not "up" though, since otherwise I don't think I can lock the channels from getting updated. I could also make this function not write the sentinel bytes, since those would be known-updated monitors (and they'd still be compatible with the MUP since it tolerates not seeing the sentinel).
  3. I suppose if the MUP immediately rewrote any monitor it reads without sentinel bytes, that would also "mark" them and require you to go through the steps to consolidate and convert it back.

Thoughts?

The documented start-up procedure is to write all ChannelMonitors, so I believe that (1) should suffice:

/// 6) Re-persist the [`ChannelMonitor`]s to ensure the latest state is on disk.
/// Note that if you're using a [`ChainMonitor`] for your [`chain::Watch`] implementation, you
/// will likely accomplish this as a side-effect of calling [`chain::Watch::watch_channel`] in
/// the next step.

We probably can't rely on (2) in case shutdown doesn't happen gracefully. And I'd shy away from (3) as it seems like we shouldn't have a side-effect when reading, IMO.

@domZippilli
Copy link
Contributor Author

The documented start-up procedure is to write all ChannelMonitors, so I believe that (1) should suffice

Sounds good. I think anybody who is building from one of the LDK exemplars will be doing this even without reading the manual.

@G8XSU
Copy link
Contributor

G8XSU commented Sep 26, 2023

Yes, I think we should be covered since we mention it in backward compat section last point.

@domZippilli
Copy link
Contributor Author

Another issue I wanted to bring up:

Should we be giving guidance for maximum_pending_updates? Any data we could work from to form a recommendation?

I don't have such data, so I could characterize the trade off. As the number (N) increases:

  • You will tend to write more CMUs than CMs, approaching 1 CM write for every N channel updates (it'll never be exactly that since you'll write CMs for new blocks and stuff).
  • You will have different "waves" of lazy deletes to deal with. The KVStore API doesn't actually batch, but lazy writes will come in ~square waves for each CM write, and the number of deletes in the wave increases with N, while the frequency decreases. So bigger N means bigger, less frequent "waves."
  • You will potentially have more listing to do if you need to run cleanup_stale_updates, though the worst case is not really affected by this setting, since it is "all updates on the channel" if your deletes were totally broken for a while. But for a more typical "crashed and dropped the pending deletes" case, ~2*N will be the outer bound of the listing size.

Given all that, 10 or 100 is probably right for most people. You'll attenuate your I/O churn by ~1/N and have smoother throughput on deletes, and lower typical worst case bounds for cleanup listing.

@G8XSU
Copy link
Contributor

G8XSU commented Sep 26, 2023

Should we be giving guidance for maximum_pending_updates?

I think it will very much depend on scale of number of payments.

We can leave this upto client given they have to understand these tradeoffs and make a design choice regarding their storage.
We can keep it as followup once we have thought more about this and have more data maybe ?

Out of the things you mentioned, main thing would be "waves" of deletes of updates. And for that maybe smaller number probably makes sense. But if a node is dealing with huge scale of # payments, then they might want to just have a higher N and offload deletes for later.

@domZippilli
Copy link
Contributor Author

We can leave this upto client given they have to understand these tradeoffs and make a design choice regarding their storage.
We can keep it as followup once we have thought more about this and have more data maybe ?

That's fine. I veered into recommending a number (for most people), and that's not necessarily what I want to do at this stage, it's more explaining the tradeoffs that I think would benefit the customer. Maybe adding a short paragraph that summarizes two or three of those considerations?

I think you're probably right that at very high volumes, deletes may be something you don't do at all, at least not in-band with monitor writes. Though at that scale you're probably looking at this struct as a reference and not using it unmodified anyway.

@jkczyz
Copy link
Contributor

jkczyz commented Sep 26, 2023

I think it will very much depend on scale of number of payments.

Agreed. It's fine to lay out the tradeoffs and let them experiment with what works for them.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo remaining comments and documenting tradeoffs for maximum_pending_updates.

Please wrap commit messages at 72 characters as per: https://cbea.ms/git-commit/

lightning/src/util/persist.rs Show resolved Hide resolved
@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch from cddc8e7 to faa0f56 Compare September 26, 2023 21:23
@domZippilli
Copy link
Contributor Author

domZippilli commented Sep 26, 2023

Based on instruction from LDK Discord, I'm going to do one final squash of the 11 fixup commits I have at this point.

@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch from 4ee435e to ce1e88b Compare September 26, 2023 21:44
if let Ok((_, old_monitor)) = maybe_old_monitor {
// Updates must be > the old monitor's update_id. If we can't add, the channel
// is already stored closed so we can skip this.
if let Some(start) = old_monitor.get_latest_update_id().checked_add(1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to start cleanup from old_monitor.get_latest_update_id(); (not add +1)

there is a case where a monitor_update is persisted.
and immediately afterwards a block is connected.

in case of block_connected, we will end up persisting full monitor with same latest_update_id.
hence we might not be deleting the monitor_update persisted earlier if we use start+1;

(after this we might issue some extra deletes in other cases but that should be fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping me get to the bottom of this (looked scary for a minute!). I've FP'd a change that removes the checked add, bypasses the assertion we're not deleting non-existent things on start of this loop, and removes the hacky conditions on the test that exercised this. Should all be working now.

@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch from ce1e88b to 9298de0 Compare September 26, 2023 23:14
@domZippilli
Copy link
Contributor Author

One more FP, I turned off some local feedback on my build and missed a mistake when refactoring the trait bounds.

MonitorUpdatingPersister is an implementation of Persister that stores
ChannelMonitorUpdates separately from ChannelMonitors. Its RFC is
in lightningdevkit#2545, at https://github.com/orgs/lightningdevkit/discussions/2545.

Co-Authored-By: Elias Rohrer <dev@tnull.de>
@domZippilli domZippilli force-pushed the 2023-06-monitor-updating-persister branch from 9298de0 to 0430d33 Compare September 26, 2023 23:30
Copy link
Contributor

@G8XSU G8XSU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets Go !

More or less waiting for CI now.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Followups can come later.

signer_provider: SP,
) -> Self
where
ES::Target: EntropySource + Sized,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These bounds are redundant, I think we can just drop the full where clause here.

Comment on lines +418 to +419
ES::Target: EntropySource + Sized,
SP::Target: SignerProvider + Sized,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two bounds look redundant and can likely be removed (same in following function).

@TheBlueMatt TheBlueMatt merged commit bf91440 into lightningdevkit:main Sep 27, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants