Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove persistence of RSS plans #7185

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andrewjstone
Copy link
Contributor

Implements #7174

Obviates the need for #724

This still needs some manual testing.

Implements #7174

Obviates the need for #724
}

const RSS_STARTED_FILENAME: &str = "rss-started.marker";

#[derive(Clone, Serialize, Deserialize, Default)]
struct RssCompleteMarker {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting this for myself and others: The "RSS completed" marker is not changing with this PR, so systems which have fully deployed with RSS already will be unimpacted by this change. (This is good)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done on purpose. I thought about adding a comment in code but didn't end up doing it. Happy to if you think it will help though.

&service_plan,
ExternalPortDiscovery::Auto(switch_mgmt_addrs),
nexus_address,
)
.await?;

// Finally, mark that we've completed executing the plans and handed off to nexus.
let mut ledger = Ledger::<RssCompleteMarker>::new_with(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we handoff to nexus, but crash before writing the ledger, wouldn't this system risk re-running RSS concurrently with Nexus changes (e.g., the reconfigurator)?

I know we can't make the "write ledger" and "handoff to nexus" steps atomic, but in the "old version" of this code, we did the following:

  • Write down the RssCompleteMarker first
  • Then handoff to Nexus
  • If the RSS service crashes and reboots, it can handoff to Nexus again
  • Nexus should know how to handle a handoff idempotently, even well after initialization, so this operation is safe and doesn't risk allocating services outside Nexus' control

Copy link
Collaborator

@smklein smklein Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want to do the handoff (e.g., maybe the serialization of this data isn't something we want to care about?) we could have a Nexus internal endpoint that's much simpler -- just asking "did handoff happen".

Then, this PR could:

  • Write that RSS completed (RssCompletedMarker)
  • Perform the handoff to Nexus
  • Carry on

And in the case that we crash after writing RssCompletedMarker:

  • RSS would read the marker on reboot
  • RSS could query the "did handoff happen" internal Nexus endpoint (no parameters, just a boolean response)
  • If yes: Cool, carry on
  • If no: Throw the same error about needing a clean slate that we would throw if RSS had started, but not completed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this could also help with the bootstore proposal you wrote up - we wouldn't need to store the entire RSS config, just a boolean indicating that we should be done with RSS)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feedback @smklein. Indeed, I was somewhat conflicted here. I like the existing logic of being able to idempotently handoff to nexus if the RssCompleteMarker was written already. But that also did indeed require persisting the plans, which adds more opportunity for things to go wrong and cognitive overhead for developers. I think we could do away with the versioning entirely and just persist the latest version of the plans and that would achieve the primary goal of #7174. I still would prefer not to persist the plans though, as you have surmised.

I like the design of your solution, but it suffers from the problem of having to indefinitely wait for an answer from Nexus or timeout and return an error. I'm also not sure the problem the new nexus API solves is really a problem though. I added the write of a start marker for this case and so we'd just clean slate if this happens. Basically, all the plan persistence and early completed marker write prevented was handling a crash after plan execution but before nexus handoff.

This PR says "if we ever crash in RSS, we can't restart". We may have handed off, we may not have. In either case we accept what we have and clean slate. The gist of this PR is essentially a value judgement that restarting in case of this crash is cheaper and less problematic than persisting plan files. That is certainly open to interpretation.

Now, after saying that, I think there is another alternative, utilizing the bootstore. If nexus handoff actually did complete, but the marker didn't get written (if we keep the order as in this PR), Nexus could write to the bootstore to mark it complete. This would allow us to handle this crash in a similar manner to what we have now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would allow us to handle this crash in a similar manner to what we have now.

This isn't really true. But it does at least resolve the handoff completed, but marker not written case - given time for propagation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it suffers from the problem of having to indefinitely wait for an answer from Nexus or timeout and return an error.

Yeah, I don't view this as a problem for the case where RSS is initially doing handoff (it should be blocked on talking to Nexus!) but agreed that it could be a pain for the case where we just want to check "has RSS already run", which happens on cold boot.

As long as we say "the process of clean slating may wrench away control from a fully running Nexus, and we're okay with that", I agree that this new format is fine.

Now, after saying that, I think there is another alternative, utilizing the bootstore. If nexus handoff actually did complete, but the marker didn't get written (if we keep the order as in this PR), Nexus could write to the bootstore to mark it complete. This would allow us to handle this crash in a similar manner to what we have now.

Yeah, and that's also nice because it propagates to the other scrimlet, right? Having the bootstore is handy, and seems like a nice addition in this part of the stack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it suffers from the problem of having to indefinitely wait for an answer from Nexus or timeout and return an error.

Yeah, I don't view this as a problem for the case where RSS is initially doing handoff (it should be blocked on talking to Nexus!) but agreed that it could be a pain for the case where we just want to check "has RSS already run", which happens on cold boot.

Agree.

As long as we say "the process of clean slating may wrench away control from a fully running Nexus, and we're okay with that", I agree that this new format is fine.

We already have to park the rack before clean-slating, which stops all the nexuses, so this shouldn't introduce any new behavior here.

Now, after saying that, I think there is another alternative, utilizing the bootstore. If nexus handoff actually did complete, but the marker didn't get written (if we keep the order as in this PR), Nexus could write to the bootstore to mark it complete. This would allow us to handle this crash in a similar manner to what we have now.

Yeah, and that's also nice because it propagates to the other scrimlet, right? Having the bootstore is handy, and seems like a nice addition in this part of the stack

Agreed. I will be back in the bootstore /trust quorum area very shortly. Interestingly, once we have full trust quorum, the gossip can actually be deterministic per key. This is because with full trust quorum the entire membership is known at each node. We can therefore store a bitmap (just a u32 - yay for 32 sleds) that indicates which nodes have seen the latest version. We refresh the bitmaps and gossip again when membership changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants