Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9922 Make summaries more accurately account user ops #9957

Merged

Conversation

kian-thompson
Copy link
Contributor

@kian-thompson kian-thompson commented Apr 19, 2022

#9922
AB#126

Currently, we are running heuristic summaries when we observe a certain number of ops come in. The problem is that we consider system and non-system ops to have the same weight, when we really need to be treating non-system ops as more important for creating summaries.
The basic idea is to give non-system ops more weight compared to system ops in terms of determining when to summarize. We don't want to entirely get rid of handling system ops for summaries because this could adversely affect boot times.

@kian-thompson kian-thompson requested a review from vladsud April 19, 2022 00:34
@github-actions github-actions bot added area: runtime Runtime related issues base: next PRs targeted against next branch labels Apr 19, 2022
@vladsud
Copy link
Contributor

vladsud commented Apr 19, 2022

Linking to issue #9922
Hm, it looks more complicated than I anticipated :) Let's chat about it

if (error !== undefined) {
return;
}
this.heuristicData.lastOpSequenceNumber = sequenceNumber;

// Check for enqueued on-demand summaries; Intentionally do nothing otherwise
if (!this.tryRunEnqueuedSummary()) {
this.heuristicRunner?.run();
this.heuristicRunner?.run(this.numSystemOps, this.numNonSystemOps);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that it's better to move all the logic around tracking ops (including op handler) into this.heuristicRunner.

Copy link
Contributor Author

@kian-thompson kian-thompson Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I'm a huge fan of this idea. I see the heuristic runner as determining, based on the current data, whether a summary should be run or not. In my opinion, it shouldn't also be the one in charge of collecting/manipulating this data.
The only thing it should be "running" is very specific heuristic runners that will self-actuate (ex: idle timer).

If we also want this SummarizeHeuristicRunner class to be tracking the data, then we should move all the SummarizeHeuristicData data manipulation to this class as well. It wouldn't make sense to have it split. But, in doing this we need the SummaryGenerator to know about the ISummarizeHeuristicRunner, which isn't ideal or a good practice.

- Implement new strategy design for more configurability
- We shouldn't rely on "run" to start the idle timer. In my opinion, it's better to explicity define a start method instead of expect this side functionality
this.opListener = (error: any, op: ISequencedDocumentMessage) => runningSummarizer.handleOp(error, op);
this.runtime.on("batchEnd", this.opListener);
this.opListener = (op: ISequencedDocumentMessage) => runningSummarizer.handleOp(op);
this.runtime.deltaManager.on("op", this.opListener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I gave you somewhat wrong advice.
What you did with ophandler is good for counting ops.

But I missed the fact that we should not run summaries in the middle of a batch. Basically, ops can be grouped into "batches". it's a mechanism to ensure that all related changes are reflected in the model before render happens (or any other async activity). All ops in a batch are always processed in one go, synchronously.
So earlier code had "batchEnd" handler and summaries were considered only at the end of batch, plus some system ops.
We can summarize after any system op. But, we should not count "summarize" and "summaryAck" ops, as that will result in positive feedback loop / infinitely summarizing summary ops :)

I'd still use more general flow as you have for counting (with exception of excluding explicitly "summarize" and "summaryAck" ops). As for potential trigger to run a summary - I'd think that using micro-task (i.e. Promose.resolve().then) after any non-summary op is a better way to deal with batches. It ensures that we will not break batches (it's a strong guarantee), but it also will find occasionally better point in time to run a summary. I.e. if we are getting 3 batches each 100 ops long, the best time to summarize after processing all 3 batches, not after just the first one. It's probably noop change today as maxOps = 1000 (so it's very unlikely we would accumulate that many), but we are going to change it to 100, and chances of hitting this condition will increse.

Or we can keep old batchEnd flow + some system ops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to take a look if we have a test that validates we are not going into infinite summarization loop due to summary ops

Copy link
Contributor

@vladsud vladsud Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it would be great to inspect code RE possibility of summary summarizing summary ops. When I look at telemetry, I see that minimum number of ops we summarize is 2. That suggests that we somehow can summarize only summary + summary ack ops.
I believe (based on experience in other places) that timer callback can run after timer is canceled, if it was scheduled before cancelation. You sort of closing this gap with a check in runSummarize(), I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to add: It's acceptable not to filter summary ops, but simply disregard 2 system ops in our weighted formula (a though based on suggestion to use OpTracker in other comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promose.resolve().then

I'm not familiar with this pattern, but I got the gist of what it's used for reading through some usages. However, I don't understand how using this pattern would help prevent against running the summary multiple times?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not. It only helps to get on a clean stack (this pattern might be useful to avoid reentrancy as well).
Usually, we add a boolean to track it. See this.pendingReconnect for an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say without anything else in consideration, running summary on clean stack is better than running it in the middle of op processing pipeline. You never know where it will backfire.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand these concepts correctly, but even if we're running on a clean stack wouldn't we potentially be running runningSummarizer.handleOp a large number of times that eventually end up doing nothing? I suppose the cost of executing all that code wouldn't be too high, but could potentially be a factor?

Copy link
Contributor Author

@kian-thompson kian-thompson Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it might be difficult to track non-system ops nicely, would it make sense to hook onto the batchEnd event again and just track the number of system ops? Then we could do some simple math to determine how many non-system ops there are per summary.
We could also try to pass along the OpTracker object for this information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your concert about perf? There is no way to track just system ops, so if we want to count them, we have to listen to all ops. It not a big perf hit from these handlers, but we should definitely measure. We already were listening for all ops, so you are not adding more cost here.
It feels orthogonal to how / when we decide to run a summary.


// We shouldn't attempt a summary if there are no new processed ops
const opsSinceLastAck = this.opsSinceLastAck;
if (opsSinceLastAck > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check mostly to protect against timer firing when we have no ops?
If so, might be worth to move this check into timer callback, and possible replacing check here with assert - it would more clearly communicate intention and invariants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this check is just to prevent ever triggering a summary from this class when we know there are no new ops to summarize.
Moving this check assumes that we only call ISummarizeHeuristicRunner.run() after we have processed an op. It's a trade off in my eyes. Do we want to ensure the intention of what happens before this "run" method is called or do we want to always prevent the unnecessary summary?

@vladsud vladsud requested a review from NicholasCouri April 21, 2022 14:50
}

this.idleTimer?.restart();
Copy link
Contributor

@vladsud vladsud Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this object is created on demand. That means we already have some trailing ops when we get here, and we already lost ability to separate them into user ops & system ops. We should looking into doing one of the following:

  1. Assume some mix (i.e. 100% of user ops, or 70% - 30%) and initiate state appropriately
  2. Figure out how to move op counting out of this class, i.e. have some part of this logic be running all the time.
    • That's essentially the same as leveraging OpTracker class. It is created when we load from snapshot, before we process any ops, and thus it does not have same problem. Though we would need to subtract 2 from number of system ops to ignore summarize ops - I think that's fine approach.

on(event: "batchEnd", listener: (error: any, op: ISequencedDocumentMessage) => void): this;
/** @deprecated 1.0, please remove all implementations and usage */
removeListener(event: "batchEnd", listener: (error: any, op: ISequencedDocumentMessage) => void): this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this interface is used only internally in this package and should not be exported (if it is exported, your build will fail with back/forward compat tests). I'd rather make sure it's not exported and not care about deprecated markup - we can delete them right away (though see my comment on integrity of batches first).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're exporting the ContainerRuntime class don't we also need to export the interfaces it implements? I believe we'd have a visibility mismatch.

- Give both system and non-system ops the same weight so there's no guessing on what type of ops will be used
@github-actions github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label May 4, 2022
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented May 4, 2022

@fluid-example/bundle-size-tests: +2.62 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 391.99 KB 393.23 KB +1.24 KB
connectionState.js 711 Bytes 711 Bytes No change
containerRuntime.js 211.07 KB 212.31 KB +1.24 KB
loader.js 146.19 KB 146.24 KB +45 Bytes
map.js 38.29 KB 38.29 KB No change
matrix.js 121.7 KB 121.7 KB No change
odspDriver.js 146.02 KB 146.06 KB +44 Bytes
odspPrefetchSnapshot.js 36.82 KB 36.87 KB +44 Bytes
sharedString.js 138.31 KB 138.31 KB No change
Total Size 1.23 MB 1.24 MB +2.62 KB

Baseline commit: 4747eea

Generated by 🚫 dangerJS against 06eedb2

@kian-thompson kian-thompson marked this pull request as ready for review May 4, 2022 23:06
@kian-thompson kian-thompson requested review from a team as code owners May 4, 2022 23:06
Copy link
Contributor

@NicholasCouri NicholasCouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

public handleOp(op: ISequencedDocumentMessage) {
this.heuristicData.lastOpSequenceNumber = op.sequenceNumber;

if (isSystemMessage(op)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use here isRuntimeMessage, or even better - just op.type === "op" (i.e. not count summarize op as runtime op).
Not sure if we want to have a funciton for that.

Copy link
Contributor

@vladsud vladsud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - please take a look at old & new comments.
Please make sure to test it manually (across all interesting combinations) and also see if we have any gaps in UT coverage, especially around things like not summarizing summary ops, triggering summary on all triggers (idle, maxtime, max ops).

@kian-thompson kian-thompson merged commit ccad1b2 into microsoft:next Jun 6, 2022
@kian-thompson kian-thompson deleted the 9922-weight-system-ops-differently branch June 6, 2022 23:28
kian-thompson added a commit that referenced this pull request Oct 27, 2022
[How contribute to this
repo](https://github.com/microsoft/FluidFramework/blob/main/CONTRIBUTING.md).

[Guidelines for Pull
Requests](https://github.com/microsoft/FluidFramework/wiki/PR-Guidelines#guidelines).

## Description

These APIs were deprecated in
[#9957](#9957) but
weren't mentioned in BREAKING.md
ghost pushed a commit that referenced this pull request Oct 27, 2022
[How contribute to this repo](https://github.com/microsoft/FluidFramework/blob/main/CONTRIBUTING.md).

[Guidelines for Pull Requests](https://github.com/microsoft/FluidFramework/wiki/PR-Guidelines#guidelines).

## Description

These APIs were deprecated in [#9957](#9957) and mentioned as such in #12702

## Breaking Changes

The `"batchEnd"` listener in `ISummarizerRuntime` has been removed. Please remove all usage and implementations of `ISummarizerRuntime.on("batchEnd", ...)` and `ISummarizerRuntime.removeListener("batchEnd", ...)`.

If these methods are needed, please refer to the `IContainerRuntimeBase` interface.
alexvy86 pushed a commit to alexvy86/FluidFramework that referenced this pull request Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: next PRs targeted against next branch public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants