Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline: Provide incoming batch's starting CSN to the PendingStateManager #21714

Merged
merged 12 commits into from
Jul 1, 2024

Conversation

markfields
Copy link
Member

@markfields markfields commented Jun 28, 2024

Description

Fixes AB#8544

On submit, the PendingStateManager stores the batch's starting CSN along with each message. We need the corresponding information for incoming messages too, for two reasons:

  1. We can assert that they match as we process incoming acks for local messages
  2. We need it to compute batch ID, when checking for the same batch coming from a forked container.

Today, two things happening in RemoteMessageProcessor keep the correct info from flowing to the PSM:

  1. Grouped Batches overwrite clientSequenceNumber with the index into the batch
  2. Without Grouped Batching, batch messages are processed one-by-one, so the context of the start of the batch is not available when later messages are sent to PSM.

Here's the fix - the RemoteMessageProcessor has the full context and can return the batch's starting CSN with each message, irrespective of grouped batching, compression, chunking, etc.

Reviewer Guidance

These changes are mostly internal to RemoteMessageProcessor, plus the signature changes to plumb the CSN down to PSM. To prove correctness, we add an error log in PSM if the batch start CSNs don't match.

Future consideration

A more comprehensive approach which I like, but is more disruptive, would be to have the RemoteMessageProcessor hold onto the batch messages until the last one arrives, and only then release the batch to be returned for processing (similar to how OpSplitter works today). This would make it more obvious that batches are processed all at once, and when processing the batch other parts of the system - e.g. PendingStateManager - could trivially get at the start of the batch (for CSN and in the future the batch ID).

@github-actions github-actions bot added area: runtime Runtime related issues base: main PRs targeted against main branch labels Jun 28, 2024
@markfields markfields requested review from a team, pragya91, agarwal-navin, jatgarg, kian-thompson and rajatch-ff and removed request for a team June 28, 2024 23:47
@@ -2619,7 +2623,11 @@ export class ContainerRuntime
// but will not modify the contents object (likely it will replace it on the message).
const messageCopy = { ...messageArg };
const savedOp = (messageCopy.metadata as ISavedOpMetadata)?.savedOp;
for (const message of this.remoteMessageProcessor.process(messageCopy)) {
const processResult = this.remoteMessageProcessor.process(messageCopy);
if (processResult === undefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It confuses me that process result can ever be undefined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's any chunk but the last one, the OpSplitter just holds onto them and doesn't return anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should assert non final chunk is the only case that hits this condition

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's necessary, but maybe I'm not seeing it. What/where exactly would you assert?

Copy link
Contributor

@dannimad dannimad Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a more explicit* approach. The code is making assumptions about remoteMessageProcessor that could change in any moment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to add a comment here explaining the undefined result and its meaning and the early return.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a comment. This whole section of code has massive assumptions undergirding it which I don't like - biggest one is that this is secretly running in the equivalent of a synchronous for loop for all messages in a single batch. So the early return here is more like a continue.

As for assumptions about what RemoteMessageProcessor does, that's a good call-out. Keep in mind too that this a very specific helper class used only in this one place, so there is some expectation that the two sides of this call understand how the other side works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly considering a follow up to make this the contract between the two: RemoteMessageProcessor only returns something when it reaches the end of a batch, otherwise it returns undefined. That'll be a lot simpler for this code to reason over IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assert messageCopy is of type ChunkedOp and maybe that is not the last one. Just making sure that we're falling into the expected case and being more explicit

Copy link
Member Author

@markfields markfields Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I can see the appeal of that. Tactically, it's kind of difficult* and since I have it in mind to make further clarifications as mentioned, I'll pass for now.

* "difficult", more like breaks the abstraction. This bit of code doesn't otherwise deal with digging into the inner type or inspecting that payload (although I do believe all the info you're talking about is all in there). Even the comment I put there doesn't mention chunked ops specifically, because it's not directly relevant - the contract of this function is just that RMP will return 0 or more messages to be processed next, as it ingests the virtualized ops one-by-one.

return {
messages: this.opGroupingManager.ungroupOp(message).map(unpack),
batchStartCsn,
};
}

// Do a final unpack of runtime messages in case the message was not grouped, compressed, or chunked
unpackRuntimeMessage(message);
Copy link
Contributor

@tyler-cai-microsoft tyler-cai-microsoft Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish this was typesafe, not important for this PR.

type MessageWithContext = {
local: boolean;
savedOp?: boolean;
batchStartCsn: number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@markfields markfields changed the title Offline: Provide the batch's starting CSN to the PendingStateManager Offline: Provide incoming batch's starting CSN to the PendingStateManager Jun 30, 2024
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Jul 1, 2024

@fluid-example/bundle-size-tests: +3.9 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 454.67 KB 455.62 KB +973 Bytes
azureClient.js 552.27 KB 553.23 KB +987 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 255.97 KB 256.9 KB +953 Bytes
fluidFramework.js 385.11 KB 385.13 KB +14 Bytes
loader.js 134.01 KB 134.03 KB +14 Bytes
map.js 42.17 KB 42.17 KB +7 Bytes
matrix.js 145.44 KB 145.45 KB +7 Bytes
odspClient.js 520.06 KB 521.02 KB +987 Bytes
odspDriver.js 96.94 KB 96.96 KB +21 Bytes
odspPrefetchSnapshot.js 42.23 KB 42.24 KB +14 Bytes
sharedString.js 162.52 KB 162.53 KB +7 Bytes
sharedTree.js 375.58 KB 375.58 KB +7 Bytes
Total Size 3.24 MB 3.25 MB +3.9 KB

Baseline commit: 775079f

Generated by 🚫 dangerJS against 074f059

Copy link
Contributor

@dannimad dannimad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markfields markfields merged commit 5f69bf6 into microsoft:main Jul 1, 2024
30 checks passed
@markfields markfields deleted the o3/csn-on-process branch July 1, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants