-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
9922 Make summaries more accurately account user ops #9957
9922 Make summaries more accurately account user ops #9957
Conversation
Linking to issue #9922 |
if (error !== undefined) { | ||
return; | ||
} | ||
this.heuristicData.lastOpSequenceNumber = sequenceNumber; | ||
|
||
// Check for enqueued on-demand summaries; Intentionally do nothing otherwise | ||
if (!this.tryRunEnqueuedSummary()) { | ||
this.heuristicRunner?.run(); | ||
this.heuristicRunner?.run(this.numSystemOps, this.numNonSystemOps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that it's better to move all the logic around tracking ops (including op handler) into this.heuristicRunner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I'm a huge fan of this idea. I see the heuristic runner as determining, based on the current data, whether a summary should be run or not. In my opinion, it shouldn't also be the one in charge of collecting/manipulating this data.
The only thing it should be "running" is very specific heuristic runners that will self-actuate (ex: idle timer).
If we also want this SummarizeHeuristicRunner
class to be tracking the data, then we should move all the SummarizeHeuristicData
data manipulation to this class as well. It wouldn't make sense to have it split. But, in doing this we need the SummaryGenerator
to know about the ISummarizeHeuristicRunner
, which isn't ideal or a good practice.
- Implement new strategy design for more configurability
- We shouldn't rely on "run" to start the idle timer. In my opinion, it's better to explicity define a start method instead of expect this side functionality
this.opListener = (error: any, op: ISequencedDocumentMessage) => runningSummarizer.handleOp(error, op); | ||
this.runtime.on("batchEnd", this.opListener); | ||
this.opListener = (op: ISequencedDocumentMessage) => runningSummarizer.handleOp(op); | ||
this.runtime.deltaManager.on("op", this.opListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I gave you somewhat wrong advice.
What you did with ophandler is good for counting ops.
But I missed the fact that we should not run summaries in the middle of a batch. Basically, ops can be grouped into "batches". it's a mechanism to ensure that all related changes are reflected in the model before render happens (or any other async activity). All ops in a batch are always processed in one go, synchronously.
So earlier code had "batchEnd" handler and summaries were considered only at the end of batch, plus some system ops.
We can summarize after any system op. But, we should not count "summarize" and "summaryAck" ops, as that will result in positive feedback loop / infinitely summarizing summary ops :)
I'd still use more general flow as you have for counting (with exception of excluding explicitly "summarize" and "summaryAck" ops). As for potential trigger to run a summary - I'd think that using micro-task (i.e. Promose.resolve().then) after any non-summary op is a better way to deal with batches. It ensures that we will not break batches (it's a strong guarantee), but it also will find occasionally better point in time to run a summary. I.e. if we are getting 3 batches each 100 ops long, the best time to summarize after processing all 3 batches, not after just the first one. It's probably noop change today as maxOps = 1000 (so it's very unlikely we would accumulate that many), but we are going to change it to 100, and chances of hitting this condition will increse.
Or we can keep old batchEnd flow + some system ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to take a look if we have a test that validates we are not going into infinite summarization loop due to summary ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it would be great to inspect code RE possibility of summary summarizing summary ops. When I look at telemetry, I see that minimum number of ops we summarize is 2. That suggests that we somehow can summarize only summary + summary ack ops.
I believe (based on experience in other places) that timer callback can run after timer is canceled, if it was scheduled before cancelation. You sort of closing this gap with a check in runSummarize(), I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to add: It's acceptable not to filter summary ops, but simply disregard 2 system ops in our weighted formula (a though based on suggestion to use OpTracker in other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promose.resolve().then
I'm not familiar with this pattern, but I got the gist of what it's used for reading through some usages. However, I don't understand how using this pattern would help prevent against running the summary multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not. It only helps to get on a clean stack (this pattern might be useful to avoid reentrancy as well).
Usually, we add a boolean to track it. See this.pendingReconnect for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say without anything else in consideration, running summary on clean stack is better than running it in the middle of op processing pipeline. You never know where it will backfire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand these concepts correctly, but even if we're running on a clean stack wouldn't we potentially be running runningSummarizer.handleOp
a large number of times that eventually end up doing nothing? I suppose the cost of executing all that code wouldn't be too high, but could potentially be a factor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it might be difficult to track non-system ops nicely, would it make sense to hook onto the batchEnd
event again and just track the number of system ops? Then we could do some simple math to determine how many non-system ops there are per summary.
We could also try to pass along the OpTracker object for this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your concert about perf? There is no way to track just system ops, so if we want to count them, we have to listen to all ops. It not a big perf hit from these handlers, but we should definitely measure. We already were listening for all ops, so you are not adding more cost here.
It feels orthogonal to how / when we decide to run a summary.
|
||
// We shouldn't attempt a summary if there are no new processed ops | ||
const opsSinceLastAck = this.opsSinceLastAck; | ||
if (opsSinceLastAck > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check mostly to protect against timer firing when we have no ops?
If so, might be worth to move this check into timer callback, and possible replacing check here with assert - it would more clearly communicate intention and invariants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this check is just to prevent ever triggering a summary from this class when we know there are no new ops to summarize.
Moving this check assumes that we only call ISummarizeHeuristicRunner.run()
after we have processed an op. It's a trade off in my eyes. Do we want to ensure the intention of what happens before this "run" method is called or do we want to always prevent the unnecessary summary?
} | ||
|
||
this.idleTimer?.restart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this object is created on demand. That means we already have some trailing ops when we get here, and we already lost ability to separate them into user ops & system ops. We should looking into doing one of the following:
- Assume some mix (i.e. 100% of user ops, or 70% - 30%) and initiate state appropriately
- Figure out how to move op counting out of this class, i.e. have some part of this logic be running all the time.
- That's essentially the same as leveraging OpTracker class. It is created when we load from snapshot, before we process any ops, and thus it does not have same problem. Though we would need to subtract 2 from number of system ops to ignore summarize ops - I think that's fine approach.
on(event: "batchEnd", listener: (error: any, op: ISequencedDocumentMessage) => void): this; | ||
/** @deprecated 1.0, please remove all implementations and usage */ | ||
removeListener(event: "batchEnd", listener: (error: any, op: ISequencedDocumentMessage) => void): this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this interface is used only internally in this package and should not be exported (if it is exported, your build will fail with back/forward compat tests). I'd rather make sure it's not exported and not care about deprecated markup - we can delete them right away (though see my comment on integrity of batches first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're exporting the ContainerRuntime
class don't we also need to export the interfaces it implements? I believe we'd have a visibility mismatch.
- Give both system and non-system ops the same weight so there's no guessing on what type of ops will be used
⯅ @fluid-example/bundle-size-tests: +2.62 KB
Baseline commit: 4747eea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public handleOp(op: ISequencedDocumentMessage) { | ||
this.heuristicData.lastOpSequenceNumber = op.sequenceNumber; | ||
|
||
if (isSystemMessage(op)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use here isRuntimeMessage, or even better - just op.type === "op" (i.e. not count summarize op as runtime op).
Not sure if we want to have a funciton for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good - please take a look at old & new comments.
Please make sure to test it manually (across all interesting combinations) and also see if we have any gaps in UT coverage, especially around things like not summarizing summary ops, triggering summary on all triggers (idle, maxtime, max ops).
[How contribute to this repo](https://github.com/microsoft/FluidFramework/blob/main/CONTRIBUTING.md). [Guidelines for Pull Requests](https://github.com/microsoft/FluidFramework/wiki/PR-Guidelines#guidelines). ## Description These APIs were deprecated in [#9957](#9957) but weren't mentioned in BREAKING.md
[How contribute to this repo](https://github.com/microsoft/FluidFramework/blob/main/CONTRIBUTING.md). [Guidelines for Pull Requests](https://github.com/microsoft/FluidFramework/wiki/PR-Guidelines#guidelines). ## Description These APIs were deprecated in [#9957](#9957) and mentioned as such in #12702 ## Breaking Changes The `"batchEnd"` listener in `ISummarizerRuntime` has been removed. Please remove all usage and implementations of `ISummarizerRuntime.on("batchEnd", ...)` and `ISummarizerRuntime.removeListener("batchEnd", ...)`. If these methods are needed, please refer to the `IContainerRuntimeBase` interface.
[How contribute to this repo](https://github.com/microsoft/FluidFramework/blob/main/CONTRIBUTING.md). [Guidelines for Pull Requests](https://github.com/microsoft/FluidFramework/wiki/PR-Guidelines#guidelines). ## Description These APIs were deprecated in [microsoft#9957](microsoft#9957) but weren't mentioned in BREAKING.md
#9922
AB#126
Currently, we are running heuristic summaries when we observe a certain number of ops come in. The problem is that we consider system and non-system ops to have the same weight, when we really need to be treating non-system ops as more important for creating summaries.
The basic idea is to give non-system ops more weight compared to system ops in terms of determining when to summarize. We don't want to entirely get rid of handling system ops for summaries because this could adversely affect boot times.