Remove persistence of RSS plans #7185

andrewjstone · 2024-11-27T21:37:34Z

Implements #7174

Obviates the need for #724

This still needs some manual testing.

Implements #7174 Obviates the need for #724

smklein · 2024-11-27T21:43:26Z

sled-agent/src/rack_setup/service.rs

+}
+
+const RSS_STARTED_FILENAME: &str = "rss-started.marker";
+
 #[derive(Clone, Serialize, Deserialize, Default)]
 struct RssCompleteMarker {}


Noting this for myself and others: The "RSS completed" marker is not changing with this PR, so systems which have fully deployed with RSS already will be unimpacted by this change. (This is good)

Yep, done on purpose. I thought about adding a comment in code but didn't end up doing it. Happy to if you think it will help though.

smklein · 2024-11-27T21:47:58Z

sled-agent/src/rack_setup/service.rs

            &service_plan,
            ExternalPortDiscovery::Auto(switch_mgmt_addrs),
            nexus_address,
        )
        .await?;

+        // Finally, mark that we've completed executing the plans and handed off to nexus.
+        let mut ledger = Ledger::<RssCompleteMarker>::new_with(


If we handoff to nexus, but crash before writing the ledger, wouldn't this system risk re-running RSS concurrently with Nexus changes (e.g., the reconfigurator)?

I know we can't make the "write ledger" and "handoff to nexus" steps atomic, but in the "old version" of this code, we did the following:

Write down the RssCompleteMarker first

Then handoff to Nexus

If the RSS service crashes and reboots, it can handoff to Nexus again

Nexus should know how to handle a handoff idempotently, even well after initialization, so this operation is safe and doesn't risk allocating services outside Nexus' control

If we don't want to do the handoff (e.g., maybe the serialization of this data isn't something we want to care about?) we could have a Nexus internal endpoint that's much simpler -- just asking "did handoff happen".

Then, this PR could:

Write that RSS completed (RssCompletedMarker)

Perform the handoff to Nexus

Carry on

And in the case that we crash after writing RssCompletedMarker:

RSS would read the marker on reboot

RSS could query the "did handoff happen" internal Nexus endpoint (no parameters, just a boolean response)

If yes: Cool, carry on

If no: Throw the same error about needing a clean slate that we would throw if RSS had started, but not completed

(this could also help with the bootstore proposal you wrote up - we wouldn't need to store the entire RSS config, just a boolean indicating that we should be done with RSS)

Great feedback @smklein. Indeed, I was somewhat conflicted here. I like the existing logic of being able to idempotently handoff to nexus if the RssCompleteMarker was written already. But that also did indeed require persisting the plans, which adds more opportunity for things to go wrong and cognitive overhead for developers. I think we could do away with the versioning entirely and just persist the latest version of the plans and that would achieve the primary goal of #7174. I still would prefer not to persist the plans though, as you have surmised.

I like the design of your solution, but it suffers from the problem of having to indefinitely wait for an answer from Nexus or timeout and return an error. I'm also not sure the problem the new nexus API solves is really a problem though. I added the write of a start marker for this case and so we'd just clean slate if this happens. Basically, all the plan persistence and early completed marker write prevented was handling a crash after plan execution but before nexus handoff.

This PR says "if we ever crash in RSS, we can't restart". We may have handed off, we may not have. In either case we accept what we have and clean slate. The gist of this PR is essentially a value judgement that restarting in case of this crash is cheaper and less problematic than persisting plan files. That is certainly open to interpretation.

Now, after saying that, I think there is another alternative, utilizing the bootstore. If nexus handoff actually did complete, but the marker didn't get written (if we keep the order as in this PR), Nexus could write to the bootstore to mark it complete. This would allow us to handle this crash in a similar manner to what we have now.

This would allow us to handle this crash in a similar manner to what we have now.

This isn't really true. But it does at least resolve the handoff completed, but marker not written case - given time for propagation.

it suffers from the problem of having to indefinitely wait for an answer from Nexus or timeout and return an error.

Yeah, I don't view this as a problem for the case where RSS is initially doing handoff (it should be blocked on talking to Nexus!) but agreed that it could be a pain for the case where we just want to check "has RSS already run", which happens on cold boot.

As long as we say "the process of clean slating may wrench away control from a fully running Nexus, and we're okay with that", I agree that this new format is fine.

Now, after saying that, I think there is another alternative, utilizing the bootstore. If nexus handoff actually did complete, but the marker didn't get written (if we keep the order as in this PR), Nexus could write to the bootstore to mark it complete. This would allow us to handle this crash in a similar manner to what we have now.

Yeah, and that's also nice because it propagates to the other scrimlet, right? Having the bootstore is handy, and seems like a nice addition in this part of the stack

it suffers from the problem of having to indefinitely wait for an answer from Nexus or timeout and return an error.

Yeah, I don't view this as a problem for the case where RSS is initially doing handoff (it should be blocked on talking to Nexus!) but agreed that it could be a pain for the case where we just want to check "has RSS already run", which happens on cold boot.

Agree.

As long as we say "the process of clean slating may wrench away control from a fully running Nexus, and we're okay with that", I agree that this new format is fine.

We already have to park the rack before clean-slating, which stops all the nexuses, so this shouldn't introduce any new behavior here.

Now, after saying that, I think there is another alternative, utilizing the bootstore. If nexus handoff actually did complete, but the marker didn't get written (if we keep the order as in this PR), Nexus could write to the bootstore to mark it complete. This would allow us to handle this crash in a similar manner to what we have now.

Yeah, and that's also nice because it propagates to the other scrimlet, right? Having the bootstore is handy, and seems like a nice addition in this part of the stack

Agreed. I will be back in the bootstore /trust quorum area very shortly. Interestingly, once we have full trust quorum, the gossip can actually be deterministic per key. This is because with full trust quorum the entire membership is known at each node. We can therefore store a bitmap (just a u32 - yay for 32 sleds) that indicates which nodes have seen the latest version. We refresh the bitmaps and gossip again when membership changes.

Remove persistence of RSS plans

1682fd4

Implements #7174 Obviates the need for #724

andrewjstone requested review from smklein, jgallagher and davepacheco November 27, 2024 21:37

smklein reviewed Nov 27, 2024

View reviewed changes

fix cargo doc

0bfd891

andrewjstone mentioned this pull request Nov 27, 2024

Persist RSS completed marker on all sleds #7186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove persistence of RSS plans #7185

Remove persistence of RSS plans #7185

andrewjstone commented Nov 27, 2024

smklein Nov 27, 2024

andrewjstone Nov 27, 2024

smklein Nov 27, 2024

smklein Nov 27, 2024 •

edited

Loading

smklein Nov 27, 2024

andrewjstone Nov 27, 2024

andrewjstone Nov 27, 2024

smklein Nov 27, 2024

andrewjstone Nov 27, 2024

Remove persistence of RSS plans #7185

Are you sure you want to change the base?

Remove persistence of RSS plans #7185

Conversation

andrewjstone commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Nov 27, 2024 •

edited

Loading