9922 Make summaries more accurately account user ops #9957

kian-thompson · 2022-04-19T00:34:44Z

Currently, we are running heuristic summaries when we observe a certain number of ops come in. The problem is that we consider system and non-system ops to have the same weight, when we really need to be treating non-system ops as more important for creating summaries.
The basic idea is to give non-system ops more weight compared to system ops in terms of determining when to summarize. We don't want to entirely get rid of handling system ops for summaries because this could adversely affect boot times.

vladsud · 2022-04-19T05:55:19Z

Linking to issue #9922
Hm, it looks more complicated than I anticipated :) Let's chat about it

packages/runtime/container-runtime/src/containerRuntime.ts

packages/runtime/container-runtime/src/runningSummarizer.ts

vladsud · 2022-04-19T20:09:08Z

packages/runtime/container-runtime/src/runningSummarizer.ts

        if (error !== undefined) {
            return;
        }
        this.heuristicData.lastOpSequenceNumber = sequenceNumber;

        // Check for enqueued on-demand summaries; Intentionally do nothing otherwise
        if (!this.tryRunEnqueuedSummary()) {
-            this.heuristicRunner?.run();
+            this.heuristicRunner?.run(this.numSystemOps, this.numNonSystemOps);


I have a feeling that it's better to move all the logic around tracking ops (including op handler) into this.heuristicRunner.

I don't know if I'm a huge fan of this idea. I see the heuristic runner as determining, based on the current data, whether a summary should be run or not. In my opinion, it shouldn't also be the one in charge of collecting/manipulating this data.
The only thing it should be "running" is very specific heuristic runners that will self-actuate (ex: idle timer).

If we also want this SummarizeHeuristicRunner class to be tracking the data, then we should move all the SummarizeHeuristicData data manipulation to this class as well. It wouldn't make sense to have it split. But, in doing this we need the SummaryGenerator to know about the ISummarizeHeuristicRunner, which isn't ideal or a good practice.

packages/runtime/container-runtime/src/runningSummarizer.ts

packages/runtime/container-runtime/src/summarizerHeuristics.ts

- Implement new strategy design for more configurability

- We shouldn't rely on "run" to start the idle timer. In my opinion, it's better to explicity define a start method instead of expect this side functionality

vladsud · 2022-04-21T02:17:56Z

packages/runtime/container-runtime/src/summarizer.ts

-        this.opListener = (error: any, op: ISequencedDocumentMessage) => runningSummarizer.handleOp(error, op);
-        this.runtime.on("batchEnd", this.opListener);
+        this.opListener = (op: ISequencedDocumentMessage) => runningSummarizer.handleOp(op);
+        this.runtime.deltaManager.on("op", this.opListener);


Sorry, I gave you somewhat wrong advice.
What you did with ophandler is good for counting ops.

But I missed the fact that we should not run summaries in the middle of a batch. Basically, ops can be grouped into "batches". it's a mechanism to ensure that all related changes are reflected in the model before render happens (or any other async activity). All ops in a batch are always processed in one go, synchronously.
So earlier code had "batchEnd" handler and summaries were considered only at the end of batch, plus some system ops.
We can summarize after any system op. But, we should not count "summarize" and "summaryAck" ops, as that will result in positive feedback loop / infinitely summarizing summary ops :)

I'd still use more general flow as you have for counting (with exception of excluding explicitly "summarize" and "summaryAck" ops). As for potential trigger to run a summary - I'd think that using micro-task (i.e. Promose.resolve().then) after any non-summary op is a better way to deal with batches. It ensures that we will not break batches (it's a strong guarantee), but it also will find occasionally better point in time to run a summary. I.e. if we are getting 3 batches each 100 ops long, the best time to summarize after processing all 3 batches, not after just the first one. It's probably noop change today as maxOps = 1000 (so it's very unlikely we would accumulate that many), but we are going to change it to 100, and chances of hitting this condition will increse.

Or we can keep old batchEnd flow + some system ops.

It would be nice to take a look if we have a test that validates we are not going into infinite summarization loop due to summary ops

BTW, it would be great to inspect code RE possibility of summary summarizing summary ops. When I look at telemetry, I see that minimum number of ops we summarize is 2. That suggests that we somehow can summarize only summary + summary ack ops.
I believe (based on experience in other places) that timer callback can run after timer is canceled, if it was scheduled before cancelation. You sort of closing this gap with a check in runSummarize(), I believe.

One thing to add: It's acceptable not to filter summary ops, but simply disregard 2 system ops in our weighted formula (a though based on suggestion to use OpTracker in other comment).

Promose.resolve().then

I'm not familiar with this pattern, but I got the gist of what it's used for reading through some usages. However, I don't understand how using this pattern would help prevent against running the summary multiple times?

It will not. It only helps to get on a clean stack (this pattern might be useful to avoid reentrancy as well).
Usually, we add a boolean to track it. See this.pendingReconnect for an example.

I'd say without anything else in consideration, running summary on clean stack is better than running it in the middle of op processing pipeline. You never know where it will backfire.

Not sure if I understand these concepts correctly, but even if we're running on a clean stack wouldn't we potentially be running runningSummarizer.handleOp a large number of times that eventually end up doing nothing? I suppose the cost of executing all that code wouldn't be too high, but could potentially be a factor?

Since it might be difficult to track non-system ops nicely, would it make sense to hook onto the batchEnd event again and just track the number of system ops? Then we could do some simple math to determine how many non-system ops there are per summary.
We could also try to pass along the OpTracker object for this information.

Is your concert about perf? There is no way to track just system ops, so if we want to count them, we have to listen to all ops. It not a big perf hit from these handlers, but we should definitely measure. We already were listening for all ops, so you are not adding more cost here.
It feels orthogonal to how / when we decide to run a summary.

packages/runtime/container-runtime/src/summarizerHeuristics.ts

vladsud · 2022-04-21T14:43:04Z

packages/runtime/container-runtime/src/summarizerHeuristics.ts

+
+            // We shouldn't attempt a summary if there are no new processed ops
+            const opsSinceLastAck = this.opsSinceLastAck;
+            if (opsSinceLastAck > 0) {


Is this check mostly to protect against timer firing when we have no ops?
If so, might be worth to move this check into timer callback, and possible replacing check here with assert - it would more clearly communicate intention and invariants

Yes, this check is just to prevent ever triggering a summary from this class when we know there are no new ops to summarize.
Moving this check assumes that we only call ISummarizeHeuristicRunner.run() after we have processed an op. It's a trade off in my eyes. Do we want to ensure the intention of what happens before this "run" method is called or do we want to always prevent the unnecessary summary?

vladsud · 2022-04-21T14:53:28Z

packages/runtime/container-runtime/src/summarizerHeuristics.ts

        }
+
+        this.idleTimer?.restart();


I think this object is created on demand. That means we already have some trailing ops when we get here, and we already lost ability to separate them into user ops & system ops. We should looking into doing one of the following:

Assume some mix (i.e. 100% of user ops, or 70% - 30%) and initiate state appropriately

Figure out how to move op counting out of this class, i.e. have some part of this logic be running all the time.

That's essentially the same as leveraging OpTracker class. It is created when we load from snapshot, before we process any ops, and thus it does not have same problem. Though we would need to subtract 2 from number of system ops to ignore summarize ops - I think that's fine approach.

packages/runtime/container-runtime/src/summarizerHeuristics.ts

vladsud · 2022-04-21T15:17:47Z

packages/runtime/container-runtime/src/summarizerTypes.ts

    on(event: "batchEnd", listener: (error: any, op: ISequencedDocumentMessage) => void): this;
+    /** @deprecated 1.0, please remove all implementations and usage */
    removeListener(event: "batchEnd", listener: (error: any, op: ISequencedDocumentMessage) => void): this;


I believe this interface is used only internally in this package and should not be exported (if it is exported, your build will fail with back/forward compat tests). I'd rather make sure it's not exported and not care about deprecated markup - we can delete them right away (though see my comment on integrity of batches first).

If we're exporting the ContainerRuntime class don't we also need to export the interfaces it implements? I believe we'd have a visibility mismatch.

- Give both system and non-system ops the same weight so there's no guessing on what type of ops will be used

msfluid-bot · 2022-05-04T22:32:23Z

⯅ @fluid-example/bundle-size-tests: +2.62 KB

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	391.99 KB	393.23 KB	⯅ +1.24 KB
connectionState.js	711 Bytes	711 Bytes	■ No change
containerRuntime.js	211.07 KB	212.31 KB	⯅ +1.24 KB
loader.js	146.19 KB	146.24 KB	⯅ +45 Bytes
map.js	38.29 KB	38.29 KB	■ No change
matrix.js	121.7 KB	121.7 KB	■ No change
odspDriver.js	146.02 KB	146.06 KB	⯅ +44 Bytes
odspPrefetchSnapshot.js	36.82 KB	36.87 KB	⯅ +44 Bytes
sharedString.js	138.31 KB	138.31 KB	■ No change
Total Size	1.23 MB	1.24 MB	⯅ +2.62 KB

Baseline commit: 4747eea

Generated by 🚫 dangerJS against 06eedb2

packages/runtime/container-runtime/src/containerRuntime.ts

NicholasCouri

packages/runtime/container-runtime/src/containerRuntime.ts

vladsud · 2022-05-30T06:04:02Z

packages/runtime/container-runtime/src/runningSummarizer.ts

+    public handleOp(op: ISequencedDocumentMessage) {
+        this.heuristicData.lastOpSequenceNumber = op.sequenceNumber;
+
+        if (isSystemMessage(op)) {


Let's use here isRuntimeMessage, or even better - just op.type === "op" (i.e. not count summarize op as runtime op).
Not sure if we want to have a funciton for that.

vladsud

looks good - please take a look at old & new comments.
Please make sure to test it manually (across all interesting combinations) and also see if we have any gaps in UT coverage, especially around things like not summarizing summary ops, triggering summary on all triggers (idle, maxtime, max ops).

[How contribute to this repo](https://github.com/microsoft/FluidFramework/blob/main/CONTRIBUTING.md). [Guidelines for Pull Requests](https://github.com/microsoft/FluidFramework/wiki/PR-Guidelines#guidelines). ## Description These APIs were deprecated in [#9957](#9957) but weren't mentioned in BREAKING.md

[How contribute to this repo](https://github.com/microsoft/FluidFramework/blob/main/CONTRIBUTING.md). [Guidelines for Pull Requests](https://github.com/microsoft/FluidFramework/wiki/PR-Guidelines#guidelines). ## Description These APIs were deprecated in [#9957](#9957) and mentioned as such in #12702 ## Breaking Changes The `"batchEnd"` listener in `ISummarizerRuntime` has been removed. Please remove all usage and implementations of `ISummarizerRuntime.on("batchEnd", ...)` and `ISummarizerRuntime.removeListener("batchEnd", ...)`. If these methods are needed, please refer to the `IContainerRuntimeBase` interface.

[How contribute to this repo](https://github.com/microsoft/FluidFramework/blob/main/CONTRIBUTING.md). [Guidelines for Pull Requests](https://github.com/microsoft/FluidFramework/wiki/PR-Guidelines#guidelines). ## Description These APIs were deprecated in [microsoft#9957](microsoft#9957) but weren't mentioned in BREAKING.md

kian-thompson added 6 commits April 18, 2022 14:48

Track count of ops in RunningSummarizer

ab9b590

Pass op counts to SummarizeHeuristicRunner

84ddbfb

Add logic for weighting different op types

c437bc6

Reduce default maxOps to 100

926dfdb

Add comment with concerns

da8893e

Pass op counters to heuristicRunner

886b8b7

kian-thompson requested a review from vladsud April 19, 2022 00:34

github-actions bot added area: runtime Runtime related issues base: next PRs targeted against next branch labels Apr 19, 2022