Store alarm value in sqlite database, for sqlite-backed DOs #2648

jclee · 2024-09-04T00:45:31Z

Implements the setting of the alarm value in the sqlite database, when setAlarm()/getAlarm() methods are called.

Also implements armAlarmHandler() and cancelDeferredDelete() internally, in a manner similar to ActorCache, rather than calling out to the hook functions. Currently, only the setAlarm() hook function is used, and just to request notification from the AlarmScheduler.

Still need to implement proper sequencing of alarm update vs. commitCallback, but wanted to see if the general approach is OK.

src/workerd/util/sqlite-metadata.c++

src/workerd/io/actor-sqlite.h

kentonv · 2024-09-12T18:29:36Z

src/workerd/io/actor-sqlite.c++

+    }
+  }
+
+  // TODO(soon): ActorCache implements a 4x retry of failed flushes... Do we need anything similar


Probably, yes. It's possible that the update will throw DISCONNECTED if it was sent to the server over a connection that was lost during the call. We need to retry in that case.

However, arguably, it could be the hook's responsibility to perform the retry, rather than do it here. In the workerd case, where the hook is implemented entirely locally, there'd be no reason to retry.

Seconded. We need retries somewhere.

src/workerd/io/actor-sqlite.c++

src/workerd/io/actor-sqlite.h

a-robinson · 2024-09-12T21:00:16Z

src/workerd/io/actor-sqlite.c++

+    }
+  }
+
+  // TODO(soon): ActorCache implements a 4x retry of failed flushes... Do we need anything similar


Seconded. We need retries somewhere.

src/workerd/io/actor-sqlite.c++

a-robinson · 2024-09-12T21:22:44Z

src/workerd/io/actor-sqlite.c++

+  // TODO(now): Here, similar to the ActorCache code, we're assuming that if the alarm time
+  // matches but is not in a clean state (i.e. is dirty or flushing), that it's OK to schedule the
+  // delete, and that it's OK to use the current alarm time to initialize the alarm state back to
+  // "clean" if the alarm is cancelled.  Is that OK?


It's apparently not disastrous given that we've been relying on it for quite some time now. I'd allow it, since it seems pretty likely that if we're getting invoked at the given time that the time was probably successfully flushed.

I suppose that the sqlite storage does add a second source of truth that makes this a little more confusing though. But even so, if the alarm time matched, and we successfully ran an alarm invocation at that time, then it seems correct to delete it?

So I think this is fine, but let me know if you see an issue that I'm missing.

Re: the sqlite storage adding a second source of truth... My original intent was for the sqlite storage to be the source of truth, at least for the local, possibly-not-yet-persisted alarm state, but I wasn't sure if it would be expensive to query sqlite each time we need to check the alarm value, so I'm essentially caching it in the currentAlarmTime member. If it's actually not that expensive to run the query, I think I could make the code a bit simpler.

I'd think it's fine to just hit sqlite every time. I don't know what @kentonv thinks.

It could have a very minor affect on billing since it would mean getAlarm() counts as one row read, but (A) I think we're going technically going to start causing a row to be read by the first getAlarm even with this caching, and (B) rows read are so preposterously dirt cheap that it's hard to imagine it becoming a big issue. That'll also depend on how #2707 and the related work plays out. I believe the goal was to not start charging for getAlarm, which may require some extra care to achieve :/

With the current implementation of counting rowsRead/rowsWritten, alarms will also be counted. Let me open a ticket to track this under https://jira.cfdata.org/browse/STOR-3551

https://jira.cfdata.org/browse/STOR-3744 - will need to be fixed before we enable usage based billing

I think it's fine to query sqlite again every time and I think it's fine that this is billed as a row read.

Hi 👋,

About half a year ago on Discord it was confirmed by @a-robinson that getAlarm would not be billed: https://discord.com/channels/595317990191398933/773219443911819284/1232360354210648136 - so (without access to the tickets and without knowing your plans) to make sure that people are aware of the change, please include this as a note somewhere in the docs or changelog.

Not knowing pricing (just a guess based on D1) right now, I'm not worried about sudden increases in bills, I'm just asking for this to be documented, please 😁

src/workerd/io/actor-sqlite.c++

src/workerd/api/actor-alarms-test.js

src/workerd/io/actor-sqlite-test.c++

jclee · 2024-09-17T07:13:26Z

I've reworked the state tracking to rely on alarm db queries for the local state, which is probably less bug-prone. I've also implemented the sequencing of alarm scheduling updates vs. database commits.

The things that I know still need to be implemented:

need to fix/test deleteAll interaction.
can remove unused hook methods, simplify workerd alarm manager
possibly need tighter check in armAlarmHandler()
probably should populate lastConfirmedScheduledAlarm from alarm manager on startup
need to support alarm rescheduling when alarm is early
eventually, need to handle alarm cleanup when deleting alarm that is the only thing in the db?
on the edgeworker side, hook up alarm scheduling, implement retries, add test coverage.

a-robinson · 2024-09-17T16:58:38Z

Sorry for needing to ask this, but do you know what I can click on to only see what's changed since the first round of review?

In the options I see, the top "Compare" link appears to be a rebase on master, the bottom "Compare" link appears empty, and 8b607ac appears to be a mixture of old changes with new changes.

kentonv · 2024-09-17T17:40:28Z

@a-robinson If you click on the "from" commit of the last force-push, it looks like it points to a fixup commit, and its parent is also a fixup commit, and the parent of that is the "to" commit of the previous force-push. So it seems these are the changes:

@jclee On GitHub it's important to actually post links to the fixup commits as they disappear from the UI as soon as you squash them. FWIW the fixup.sh script in the internal codebase gives you a list of these at the end to copy/paste.

kentonv · 2024-09-17T17:41:47Z

Hmm... these two fixups couldn't be all the changes, though.

kentonv · 2024-09-17T17:45:54Z

Ahh, there were more fixup-ish commits before the rebase-on-master... I think this is the whole set:

https://github.com/cloudflare/workerd/compare/17c964ac6340f6d877c5aebc79a723c2794195db..e7d6809862baa17f3a6b36ac99593da62d5fa19b

kentonv · 2024-09-17T17:48:36Z

TBH it seems like things changed enough in actor-sqlite.c++ to re-review it from scratch. sqlite-metadata didn't change AFAICT.

jclee · 2024-09-17T17:58:59Z

Ugh... Sorry, I'd pushed a version of my branch with WIP changes and thought it was OK because the github UI was still showing the commits post-rebase, but I guess it removes them after a little while. Here's a diff between the state the branch was in when last reviewed (branch jlee/srs-alarms-wip1) and the state after adding my WIP changes but before rebasing (branch jlee/srs-alarms-wip2):

jlee/srs-alarms-wip1...jlee/srs-alarms-wip2

The current state of the PR is the rebased result plus the couple minor fixes that Kenton pointed out (5a8ac6b and
d2417f9).

(I'm not sure how helpful the interim commits are, as quite a lot changed, particularly in 3d8640b and 9d6d2dc, but I tried to isolate the more mechanical changes.)

src/workerd/io/actor-sqlite.c++

src/workerd/io/actor-sqlite.h

src/workerd/io/actor-sqlite.c++

src/workerd/util/sqlite-metadata.c++

a-robinson · 2024-09-17T18:54:50Z

Thanks for the context on what's changed! I don't know how github hasn't made iterative reviews just work by now, it's clearly not impossible given that third party services like reviewable.io are able to manage it.

kentonv · 2024-09-17T19:45:44Z

Speaking of GitHub code review being bad, it appears the UI has decided to hide three comments from my recent review, because I mean, you don't really need to see all the comments, right? /s

I doubt you'd miss it @jclee but I've definitely had cases in the past where people have totally skimmed over this box and never saw the comments within, so heads up!

src/workerd/io/actor-sqlite.c++

src/workerd/util/sqlite-metadata-test.c++

src/workerd/io/actor-sqlite.h

a-robinson · 2024-09-18T20:42:24Z

src/workerd/io/actor-sqlite.c++

      commitTasks(*this) {
  db->onWrite(KJ_BIND_METHOD(*this, onWrite));
+  lastConfirmedAlarmDbState = metadata.getAlarm();


If we're reading the alarm every time a DO starts, then I don't think we can justify charging for it. Which isn't a big deal if we're already 100% going to make sure these don't get counted as row reads, I'm just calling it out since I don't remember the prior discussion coming to a super clear conclusion.

I'm not sure myself how we're accounting for sqlite reads that are required as part of DO operation, but I guess one additional thing to note is that both SqliteMetadata and SqliteKv perform queries on startup to read whether their tables have already been created... Not sure if those queries increment the same counters and need to be similarly accounted for.

Fair point. cc @ramyabygari , who (based on the internal ticket) I believe is going to be the person working on ensuring alarm reads aren't billed

a-robinson · 2024-09-19T02:41:31Z

src/workerd/io/actor-sqlite.c++

+
+  KJ_IF_SOME(pending, pendingCommit) {
+    co_await pending.addBranch();
+    co_return;


The additional comments wouldn't hurt. I have to say I'm needing to think pretty hard to really understand what this is all doing :/

src/workerd/io/actor-sqlite.c++

jclee · 2024-09-19T20:54:25Z

src/workerd/io/actor-sqlite.h

-    virtual void cancelDeferredAlarmDeletion();
+    // Makes a request to the alarm manager to run the alarm handler at the given time, returning
+    // a promise that resolves when the scheduling has succeeded.
+    virtual kj::Promise<void> scheduleRun(kj::Maybe<kj::Date> newAlarmTime);


So, one additional issue I've found while implementing the edgeworker side of scheduling is that ActorCache explicitly uses separate setAlarm() and deleteAlarm() RPC calls. I thought the single scheduleRun() method might be sufficient, but I see that the deleteAlarm() method has an optional parameter to allow specifying an explicit alarm time, which ActorCache passes in when doing the deferred deletion at the end of a handler.

I'm not sure of the exact situation this might be protecting against, but it seems likely that constraining the deferred deletion is important for the at-least-once delivery guarantee, so I think I should revise the Hooks API to match.

I don't have a specific example scenario in mind -- it looks like the motivation for adding that was due to the internal AlarmManager -- but it is a good idea to be paranoid about deleting alarms, since an alarm failing to run is potentially a major bug, while an alarm being invoked when it shouldn't is just a minor annoyance.

FWIW, I took an attempt at passing the specific alarm time on alarm deletion, but plumbing the deletion time from the end of the handler to the point of scheduling adds a fair bit of complexity to what is already pretty complex code... Maybe there's a better way to do it, than I'm doing, but I'm thinking it may be better to omit for now if it isn't strictly needed for correctness.

Example code (on another branch, not in this PR): 62936f6

kentonv · 2024-09-20T15:22:25Z

src/workerd/api/global-scope.c++

-    return WorkerInterface::AlarmResult{.retry = false, .outcome = EventOutcome::CANCELED};
+    }
+    KJ_CASE_ONEOF(armResult, ActorCacheInterface::CancelAlarmHandler) {
+      // TODO(cleanup): might be possible to use kj::READY_NOW promises instead of kj::none to


Yes I think that would be fine.

src/workerd/io/actor-sqlite.c++

jclee · 2024-09-20T19:18:06Z

Thanks for the feedback so far! I think I've addressed the known issues on the workerd side -- please let me know if further changes would help.

I'm still working on adding alarm scheduling retries and some more testing on the edgeworker side.

I still need to rebase this PR to resolve a conflict in #includes, but can wait if it's going to complicate review on github. I'm also not sure if the current granularity of commits will be useful post-merge, so I was thinking of mostly merging the ActorSqlite alarms implementation together in one commit. Is that OK?

a-robinson · 2024-09-20T19:21:08Z

I still need to rebase this PR to resolve a conflict in #includes, but can wait if it's going to complicate review on github

Mind waiting until I get through the latest batch of commits? I'm reading through them now. (Thanks for separating them out, by the way)

I was thinking of mostly merging the ActorSqlite alarms implementation together in one commit. Is that OK?

SGTM

a-robinson · 2024-09-20T20:15:12Z

src/workerd/io/actor-sqlite.c++

-    // After the first write in the handler occurs, the logic here is correct again as the current
-    // alarm time would match what we are reporting to the user from getAlarm().
-    //
-    // TODO(now): optimization no longer useful in sqlite?


Is the idea that this is no longer useful because sqlite won't actually rewrite the page if it's unchanged? Or if it's something else, why?

I'd added an optimization in the SqliteMetadata caching layer to skip writes if the write already matched the cached value, so yeah, writing the same value probably wouldn't currently even reach sqlite, here.

Also, I think the motivation for the analogous comment in ActorCache is because we don't want to skip the alarm dirtiness logic when the current alarm time matches the requested alarm time but we still have a deferred delete. But in the case of the below code, we're running the dirtiness logic regardless (even though I guess we'd similarly only need to run it when we have a deferred delete, assuming handlers and transactions nest sanely).

a-robinson · 2024-09-20T20:15:53Z

src/workerd/io/actor-sqlite.c++

@@ -546,7 +563,6 @@ void ActorSqlite::shutdown(kj::Maybe<const kj::Exception&> maybeException) {

 kj::OneOf<ActorSqlite::CancelAlarmHandler, ActorSqlite::RunAlarmHandler> ActorSqlite::
    armAlarmHandler(kj::Date scheduledTime, bool noCache) {
-  KJ_ASSERT(!haveDeferredDelete);


Why did this need to be removed? I'd still think it should always be false when starting a new alarm handler?

I wasn't sure that's the case... I think we only currently set it to false when:

setAlarm() is called on the root level transaction

committing a transaction with a dirty alarm value

starting a new alarm handler when the handler time is later than the local database state

We're not currently setting it to false when a DeferredAlarmDeleter is destroyed and starts the deferred alarm deletion, and maybe we should be? But I'm also not totally sure if it's possible for the DeferredAlarmDeleter to be attached to something that takes a few turns of the event loop to fully resolve, such that it's possible for other work in the event queue to start a new handler before DeferredAlarmDeleter's destructor runs.

But thinking about it more, yeah, this seems unlikely, and I think should already be covered by the log-once warnings, so I'll try reintroducing the assert and fix the DeferredAlarmDeleter to clear the flag.

Readded the check as a logged-once warning and added code to clear the flag, in cb67b25 .

a-robinson

Nice work on this @jclee , it's very tricky stuff.

jclee · 2024-09-23T16:51:22Z

While iterating on the PR, I noticed that the comparison I was doing for the "if scheduled alarm is earlier, reschedule later" logic was backward in both the implementation and tests :( ... fixed in 6111915 .

I'm now going to rebase to resolve the #include conflict and squash the ActorSqlite changes together. Pre-rebase diff: jlee/srs-alarms-wip3...jlee/srs-alarms-wip4

Exceptions that escape from the alarm handler don't cause the test to fail, so a failing assertion in the alarm handler only caused the test to hang. Calling the reject function of the waiting promise on exception should avoid this. Also tightening up the nullness checks.

…cellation

Analogous to SqliteKv helper class; not used yet.

Also, enforce similar alarm properties in ActorSqlite as we do in ActorCache.

jclee force-pushed the jlee/srs-alarms branch 2 times, most recently from 1bd8f75 to 9f101d3 Compare September 11, 2024 06:26

jclee marked this pull request as ready for review September 11, 2024 06:58

jclee requested review from a team as code owners September 11, 2024 06:58

jclee requested review from harrishancock, Frederik-Baetens, kentonv and a-robinson September 11, 2024 06:58

jclee force-pushed the jlee/srs-alarms branch from 9f101d3 to 17c964a Compare September 11, 2024 07:18

kentonv reviewed Sep 12, 2024

View reviewed changes

a-robinson reviewed Sep 12, 2024

View reviewed changes

jasnell reviewed Sep 13, 2024

View reviewed changes

src/workerd/api/actor-alarms-test.js Outdated Show resolved Hide resolved

jasnell reviewed Sep 13, 2024

View reviewed changes

src/workerd/io/actor-sqlite-test.c++ Outdated Show resolved Hide resolved

jclee force-pushed the jlee/srs-alarms branch from e7d6809 to a7edf79 Compare September 17, 2024 06:22

jclee force-pushed the jlee/srs-alarms branch from d2417f9 to 8b607ac Compare September 17, 2024 07:24

kentonv reviewed Sep 17, 2024

View reviewed changes

jclee force-pushed the jlee/srs-alarms branch from 1937b23 to 615deee Compare September 17, 2024 20:08

kentonv reviewed Sep 17, 2024

View reviewed changes

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

a-robinson reviewed Sep 19, 2024

View reviewed changes

jclee commented Sep 19, 2024

View reviewed changes

kentonv reviewed Sep 20, 2024

View reviewed changes

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

a-robinson reviewed Sep 20, 2024

View reviewed changes

a-robinson approved these changes Sep 20, 2024

View reviewed changes

kentonv approved these changes Sep 20, 2024

View reviewed changes

jclee added 2 commits September 23, 2024 10:00

ActorSqlite: Revise armAlarmHandler() to allow for waiting before can…

49f408c

…cellation

jclee force-pushed the jlee/srs-alarms branch from cb67b25 to e13e274 Compare September 23, 2024 18:37

jclee added 2 commits September 23, 2024 12:08

Add SqliteMetadata helper class for reading/writing sqlite alarm time

66d1f2f

Analogous to SqliteKv helper class; not used yet.

sqlite actor alarms: store alarm time in sqlite

4f9e6aa

Also, enforce similar alarm properties in ActorSqlite as we do in ActorCache.

jclee force-pushed the jlee/srs-alarms branch from e13e274 to 4f9e6aa Compare September 23, 2024 19:09

jclee merged commit e1eb9b6 into main Sep 23, 2024
14 checks passed

jclee deleted the jlee/srs-alarms branch September 23, 2024 21:36

Store alarm value in sqlite database, for sqlite-backed DOs #2648

Store alarm value in sqlite database, for sqlite-backed DOs #2648

Conversation

jclee commented Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrima-cf Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jclee commented Sep 17, 2024

a-robinson commented Sep 17, 2024

kentonv commented Sep 17, 2024

kentonv commented Sep 17, 2024

kentonv commented Sep 17, 2024

kentonv commented Sep 17, 2024

jclee commented Sep 17, 2024 • edited Loading

a-robinson commented Sep 17, 2024

kentonv commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jclee Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jclee commented Sep 20, 2024

a-robinson commented Sep 20, 2024

Choose a reason for hiding this comment

jclee Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-robinson left a comment

Choose a reason for hiding this comment

jclee commented Sep 23, 2024 • edited Loading

jclee commented Sep 4, 2024 •

edited

Loading

shrima-cf Sep 13, 2024 •

edited

Loading

jclee commented Sep 17, 2024 •

edited

Loading

jclee Sep 20, 2024 •

edited

Loading

jclee Sep 20, 2024 •

edited

Loading

jclee commented Sep 23, 2024 •

edited

Loading