-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf Testing: Inconsistent methodology for testing typing performance #51383
Comments
I've tried so many strategies over time. I think waiting or not waiting are both valuable but are different metrics. The waiting impacts the performance a bit because some "async work" triggers while waiting, while typing continuously might not have to finish all the "async work". (we're only interested in the "sync" time so far though). I think we basically should just pick one. I'd opt to wait a bit because in my testing (native testing), it gives a bit more stable results. |
Unfortunately the wait is still best. I'm only 80% confident on this, but we still have no way to measure the time between some initial event and the moment some particular DOM is rendered in response. Typing is ten times harder because there's a chain of indirection between the keypress and the updated visuals. We would need even more than what things like 😢 |
Have we tried using the // start tracing
let i = 20;
while ( i-- ) {
await page.keyboard.type( 'x' );
await page.evaluate(
() =>
new Promise( ( resolve ) =>
requestIdleCallback( resolve, { timeout: 1000 } )
)
);
}
// end tracing |
@WunderBart I don't think we did, that said, we use that same API internally to register the Async Tasks so you might have competing callbacks and you won't achieve the desired result: (wait for the end of all the "async tasks") |
This comment was marked as resolved.
This comment was marked as resolved.
We're using the tracing API so in theory we have most of the metrics we need: event handling, layout, painting, etc. We probably just need a way to figure out the best combination of these metrics and produce something useful for us to track. The guide of FID might inspire us in some way. |
Right, if we're typing quickly, the browser is likely queueing up these keypress events and processing them as quickly as possible, leading to shorter times because there's a consistent flow of tasks, and the JS engine is kept busy. Measuring without pauses between keystrokes would be the closest to measuring actual typing performance, which is what we're currently reporting via our custom reporter (min, average, and max time to type): gutenberg/test/performance/config/performance-reporter.ts Lines 71 to 78 in 3c32470
On the other hand, if we're doing large pauses between the keystrokes (which we are), I don't think we're measuring typing performance (as in 'how fast can the app respond to typing) but the overall responsiveness of the application. IIUC, for this metric, the browser/engine has to start up the task-processing sequence each time, which can introduce a slight delay. During the pauses, the engine likely processes other tasks (like rendering updates or other running scripts), which could cause the keypress events to take longer to process. It also makes sense that the metric is more stable this way. I guess what I'm trying to say is that (to me) the typing metrics are confusing as they are right now. For example, we're reporting the |
I did some input latency ( Here's the snippet used: const delays = [
0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 5000,
];
for ( const delay of delays ) {
await browser.startTracing( page, {
path: traceFilePath,
screenshots: false,
categories: [ 'devtools.timeline' ],
} );
const samples = 10;
const throwaway = 1;
await page.keyboard.type( 'x'.repeat( samples + throwaway ), { delay } );
await browser.stopTracing();
for ( let i = throwaway; i < keyDownEvents.length; i++ ) {
// get the average of keydown + keypress + keyup events
}
} Here are the visuals: Testing inside a paragraph within a small post consisting of 20 blocksTesting inside a paragraph within a large post consisting of 1.4K blocksTesting inside a container within a small post consisting of 23 blocks |
@WunderBart this is a really great exploration. I've been stewing on this post for a while because I needed to find some time to focus on it and consider it. first of all I think it'd be worth re-examing the graphs with waaaaaay more than 10 samples. I'm worried that with so few we may be seeing bias in the results due to potential things that are scheduled within the browser. for example, on the order of 10 seconds I can easily imagine that some tasks are deferred for idle, whereas if they haven't run by 50 second then I'd be surprised to find a repeat. that is, maybe we find a way to run the test for a minute or two of wall time and then divide the results by how many samples fit into that window. the one-second latency dip seems like an incredible data point to examine more closely. also one of the reasons I'd like to see this in a longer sustained test. is the latency low because at one point within that ten seconds, the garbage collector ran and made everything else speedy? similarly, why does spacing out the keypresses to five seconds cause them to take so much longer to process? could we be bumping into background processes within the editor that are skewing the tests? the 200ms delay, for example, is finished by the time the second 5000ms delay test runs. if the 200ms delay test also runs for 50 seconds, would we see a spike in the distribution of type latency there as well? now apart from this discussion, it raises an interesting idea I'd like to explore. I've seen a kind of test setup for monitoring input-to-display latency before where someone takes a high-speed camera (the 240 fps cameras on mobile phones is good enough) that can see both the keyboard and the monitor and they time how long it takes from input to a visible display. this is not feasible today for the CI pipeline, but I would enjoy exploring this on a bench setup. with an appropriate control we could measure the "minimum" keyboard-to-screen latency using something like a plain 🤔 |
in some profiling I was doing today where I saw 112s typing lag I noticed that a page refresh eliminated the lag. while I'm not sure why this is, it makes me think again towards our unrealistic tests. whether time based or edit based, I wonder if we could setup the performance tests to run through something like 10,000 edits before measuring the results. maybe it's undo/redo, maybe it's the store duplicating listeners for the same components, maybe it's some other memory leak. something gets worse the longer one edits a document. |
Thanks for all the insight, @dmsnell. Interesting stuff!
I think this would be a good starting point, yeah. How many samples are we talking about, though? 100? 1000? 10000? 😅 From what you're saying (the longer we type, the laggier it gets), I'd also tweak the test so that a fresh page is used for each delay value.
I'd love to see how that relates to the numbers we're reading from the perf tools. It could potentially be a good way of calibrating the Playwright tests.
If you can reproduce that, a good way of creating such a test could be using Playwright's test recorder feature that generates the code from the performed actions. |
I wasn't suggesting we run for a specific number of samples, but of seconds
this would mean that we have fewer samples for keypresses separated by longer intervals, but that may not matter. it's not like anyone types in any of these characteristic ways. one thing we could do for interest is simply record keypresses as someone types actual documents. make sure they type for a couple of minutes straight. in the test we could replay that. it wouldn't give as stable of a metric, but it would be more representative of how someone types and the impact of the performance on that. either way: running for X seconds or replaying real typing, I think these could give better data because both circumvent some bias that might be introduced by periodic processes.
I ran a test for a week in Safari and Chrome and it doesn't seem like this is present in Core - it's in Automattic's P2 theme, or some block in there. Though I know it's not present in a vanilla install and though I don't know where it's coming from, I'm not quite ready to give Core a pass either, as I could imagine this appearing only with some co-factor. That is, maybe nothing in Core directly triggers the sequence of events that lead to this, but plugins using official APIs might do so and so the defect is only exposed when people are implementing the block editor outside of the Core blocks and setup.
thanks for the link. I didn't know about it. could be a great way to capture "real typing and usage," including things like clicking on the undo button or inserting a block |
Reopening since the discussion here is still ongoing. |
Why are 1s+ between typing characters? That seems unrealistic? Imo we should simulate how a normal person types? |
@ellatrix, the 1s delay value was based on this data. We're doing this because, by design, we care about the metric stability, not the actual numbers. The measurements highly depend on the current machine running the tests, so the actual numbers would not be meaningful. The more stable the metric, the more likely we'll catch regressions. Does that make sense? cc: @dmsnell |
Makes sense, but probably both are valuable. |
I've been working on migrating the performance tests to Playwright and noticed that the way we test typing performance is different for the Post Editor vs. Site Editor. For example, for the Site Editor, we simply hit
x
200 times and record the metrics:gutenberg/packages/e2e-tests/specs/performance/site-editor.test.js
Lines 166 to 167 in 524d6ef
...while for Post Editor, we make an arbitrary pause before each stroke (depending on the test):
Typing
gutenberg/packages/e2e-tests/specs/performance/post-editor.test.js
Lines 158 to 165 in 524d6ef
Typing within containers
gutenberg/packages/e2e-tests/specs/performance/post-editor.test.js
Lines 208 to 213 in 524d6ef
I'd like to verify if that pause is necessary, and if yes, if we can replace it with waiting for some UI element. This would simulate a real-user experience better and produce more consistent results, as
wait
timers with arbitrary values are known to be unstable./cc @youknowriad @dmsnell @kevin940726
The text was updated successfully, but these errors were encountered: