Perf Testing: Inconsistent methodology for testing typing performance #51383

WunderBart · 2023-06-09T14:23:53Z

I've been working on migrating the performance tests to Playwright and noticed that the way we test typing performance is different for the Post Editor vs. Site Editor. For example, for the Site Editor, we simply hit x 200 times and record the metrics:

gutenberg/packages/e2e-tests/specs/performance/site-editor.test.js

Lines 166 to 167 in 524d6ef

    
           // Type "x" 200 times. 
        
           await page.keyboard.type( new Array( 200 ).fill( 'x' ).join( '' ) );

...while for Post Editor, we make an arbitrary pause before each stroke (depending on the test):

Typing

gutenberg/packages/e2e-tests/specs/performance/post-editor.test.js

Lines 158 to 165 in 524d6ef

    
           while ( i-- ) { 
        
           	// Wait for the browser to be idle before starting the monitoring. 
        
           	// The timeout should be big enough to allow all async tasks tor run. 
        
           	// And also to allow Rich Text to mark the change as persistent. 
        
           	// eslint-disable-next-line no-restricted-syntax 
        
           	await page.waitForTimeout( 2000 ); 
        
           	await page.keyboard.type( 'x' ); 
        
           }

Typing within containers

gutenberg/packages/e2e-tests/specs/performance/post-editor.test.js

Lines 208 to 213 in 524d6ef

    
           while ( i-- ) { 
        
           	// Wait for the browser to be idle before starting the monitoring. 
        
           	// eslint-disable-next-line no-restricted-syntax 
        
           	await page.waitForTimeout( 500 ); 
        
           	await page.keyboard.type( 'x' ); 
        
           }

I'd like to verify if that pause is necessary, and if yes, if we can replace it with waiting for some UI element. This would simulate a real-user experience better and produce more consistent results, as wait timers with arbitrary values are known to be unstable.

/cc @youknowriad @dmsnell @kevin940726

The text was updated successfully, but these errors were encountered:

youknowriad · 2023-06-09T17:02:53Z

I've tried so many strategies over time. I think waiting or not waiting are both valuable but are different metrics. The waiting impacts the performance a bit because some "async work" triggers while waiting, while typing continuously might not have to finish all the "async work". (we're only interested in the "sync" time so far though).

I think we basically should just pick one. I'd opt to wait a bit because in my testing (native testing), it gives a bit more stable results.

dmsnell · 2023-06-10T09:08:00Z

Unfortunately the wait is still best.

I'm only 80% confident on this, but we still have no way to measure the time between some initial event and the moment some particular DOM is rendered in response. Typing is ten times harder because there's a chain of indirection between the keypress and the updated visuals.

We would need even more than what things like MutationObserver give us, because we not only care about how long it takes to modify the DOM, but also how long it takes to run layout and render of that DOM.

😢

WunderBart · 2023-06-12T11:11:53Z

I'd opt to wait a bit because in my testing (native testing), it gives a bit more stable results.

Unfortunately the wait is still best.

Have we tried using the requestIdleCallback? We could use the callback to resolve when the page becomes idle, e.g.

// start tracing
let i = 20;
while ( i-- ) {
	await page.keyboard.type( 'x' );
	await page.evaluate(
		() =>
			new Promise( ( resolve ) =>
				requestIdleCallback( resolve, { timeout: 1000 } )
			)
	);
}
// end tracing

youknowriad · 2023-06-12T11:18:26Z

@WunderBart I don't think we did, that said, we use that same API internally to register the Async Tasks so you might have competing callbacks and you won't achieve the desired result: (wait for the end of all the "async tasks")

kevin940726 · 2023-06-13T07:16:55Z

We're using the tracing API so in theory we have most of the metrics we need: event handling, layout, painting, etc. We probably just need a way to figure out the best combination of these metrics and produce something useful for us to track. The guide of FID might inspire us in some way.

WunderBart · 2023-06-29T12:43:25Z

I've tried so many strategies over time. I think waiting or not waiting are both valuable but are different metrics. The waiting impacts the performance a bit because some "async work" triggers while waiting, while typing continuously might not have to finish all the "async work". (we're only interested in the "sync" time so far though).

I think we basically should just pick one. I'd opt to wait a bit because in my testing (native testing), it gives a bit more stable results.

Right, if we're typing quickly, the browser is likely queueing up these keypress events and processing them as quickly as possible, leading to shorter times because there's a consistent flow of tasks, and the JS engine is kept busy. Measuring without pauses between keystrokes would be the closest to measuring actual typing performance, which is what we're currently reporting via our custom reporter (min, average, and max time to type):

gutenberg/test/performance/config/performance-reporter.ts

Lines 71 to 78 in 3c32470

    
           ${ title( 'Typing:' ) } 
        
           Average time to type character: ${ success( round( average( type ) ) + 'ms' ) } 
        
           Slowest time to type character: ${ success( 
        
           				round( Math.max( ...type ) ) + 'ms' 
        
           			) } 
        
           Fastest time to type character: ${ success( 
        
           				round( Math.min( ...type ) ) + 'ms' 
        
           			) }` );

On the other hand, if we're doing large pauses between the keystrokes (which we are), I don't think we're measuring typing performance (as in 'how fast can the app respond to typing) but the overall responsiveness of the application. IIUC, for this metric, the browser/engine has to start up the task-processing sequence each time, which can introduce a slight delay. During the pauses, the engine likely processes other tasks (like rendering updates or other running scripts), which could cause the keypress events to take longer to process. It also makes sense that the metric is more stable this way.

I guess what I'm trying to say is that (to me) the typing metrics are confusing as they are right now. For example, we're reporting the fastest/slowest time to type a character, but we're not putting any stress on the typing. It's also not actual "typing" if we're only pressing a key every 2 seconds. I agree that both ways make sense and that the one with waiting for the response would be more valuable right now, but I'd at least call the metric differently: inputLatency maybe? It would make more sense also because we seem to be adjusting the delay between the keystrokes depending on the size of the currently edited post (large post: 2s delay, small post: 0.5s delay). Does that make sense?

WunderBart · 2023-06-29T16:00:35Z

I did some input latency (keydown + keypress + keyup) testing to nail down the best pause/delay value for metric stability. It looks like things are pretty stable between 500 and 900 ms, but oddly enough, the 1000 ms sample comes out on top in consecutive rounds, despite a weird dip in latency. Also, I'm scratching my head about the latency bump for the 2000 and 5000 ms samples. 😅 Any ideas?

Here's the snippet used:

const delays = [
    0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 5000,
];

for ( const delay of delays ) {
    await browser.startTracing( page, {
        path: traceFilePath,
        screenshots: false,
        categories: [ 'devtools.timeline' ],
    } );

    const samples = 10;
    const throwaway = 1;
    await page.keyboard.type( 'x'.repeat( samples + throwaway ), { delay } );

    await browser.stopTracing();

    for ( let i = throwaway; i < keyDownEvents.length; i++ ) {
        // get the average of keydown + keypress + keyup events
    }
}

Here are the visuals:

Testing inside a paragraph within a small post consisting of 20 blocks

Testing inside a paragraph within a large post consisting of 1.4K blocks

Testing inside a container within a small post consisting of 23 blocks

dmsnell · 2023-07-27T20:07:15Z

@WunderBart this is a really great exploration. I've been stewing on this post for a while because I needed to find some time to focus on it and consider it.

first of all I think it'd be worth re-examing the graphs with waaaaaay more than 10 samples. I'm worried that with so few we may be seeing bias in the results due to potential things that are scheduled within the browser. for example, on the order of 10 seconds I can easily imagine that some tasks are deferred for idle, whereas if they haven't run by 50 second then I'd be surprised to find a repeat. that is, maybe we find a way to run the test for a minute or two of wall time and then divide the results by how many samples fit into that window.

the one-second latency dip seems like an incredible data point to examine more closely. also one of the reasons I'd like to see this in a longer sustained test. is the latency low because at one point within that ten seconds, the garbage collector ran and made everything else speedy?

similarly, why does spacing out the keypresses to five seconds cause them to take so much longer to process? could we be bumping into background processes within the editor that are skewing the tests? the 200ms delay, for example, is finished by the time the second 5000ms delay test runs. if the 200ms delay test also runs for 50 seconds, would we see a spike in the distribution of type latency there as well?

now apart from this discussion, it raises an interesting idea I'd like to explore. I've seen a kind of test setup for monitoring input-to-display latency before where someone takes a high-speed camera (the 240 fps cameras on mobile phones is good enough) that can see both the keyboard and the monitor and they time how long it takes from input to a visible display. this is not feasible today for the CI pipeline, but I would enjoy exploring this on a bench setup.

with an appropriate control we could measure the "minimum" keyboard-to-screen latency using something like a plain TEXTAREA or a native app like Terminal.app or xterm. with that baseline data we could compare typing into the editor and report on how much additional latency we measure through React and the browser.

🤔

dmsnell · 2023-07-28T01:15:11Z

in some profiling I was doing today where I saw 112s typing lag I noticed that a page refresh eliminated the lag. while I'm not sure why this is, it makes me think again towards our unrealistic tests.

whether time based or edit based, I wonder if we could setup the performance tests to run through something like 10,000 edits before measuring the results.

maybe it's undo/redo, maybe it's the store duplicating listeners for the same components, maybe it's some other memory leak. something gets worse the longer one edits a document.

WunderBart · 2023-08-11T15:09:08Z

Thanks for all the insight, @dmsnell. Interesting stuff!

first of all I think it'd be worth re-examing the graphs with waaaaaay more than 10 samples.

I think this would be a good starting point, yeah. How many samples are we talking about, though? 100? 1000? 10000? 😅 From what you're saying (the longer we type, the laggier it gets), I'd also tweak the test so that a fresh page is used for each delay value.

now apart from this discussion, it raises an interesting idea I'd like to explore. I've seen a kind of test setup for monitoring input-to-display latency before where someone takes a high-speed camera (the 240 fps cameras on mobile phones is good enough) that can see both the keyboard and the monitor and they time how long it takes from input to a visible display. this is not feasible today for the CI pipeline, but I would enjoy exploring this on a bench setup.

I'd love to see how that relates to the numbers we're reading from the perf tools. It could potentially be a good way of calibrating the Playwright tests.

in some profiling I was doing today where I saw 112s typing lag I noticed that a page refresh eliminated the lag. while I'm not sure why this is, it makes me think again towards our unrealistic tests.

whether time based or edit based, I wonder if we could setup the performance tests to run through something like 10,000 edits before measuring the results.

maybe it's undo/redo, maybe it's the store duplicating listeners for the same components, maybe it's some other memory leak. something gets worse the longer one edits a document.

If you can reproduce that, a good way of creating such a test could be using Playwright's test recorder feature that generates the code from the performed actions.

dmsnell · 2023-08-11T18:10:26Z

How many samples are we talking about, though?

I wasn't suggesting we run for a specific number of samples, but of seconds

find a way to run the test for a minute or two of wall time and then divide the results by how many samples fit into that window.

this would mean that we have fewer samples for keypresses separated by longer intervals, but that may not matter. it's not like anyone types in any of these characteristic ways.

one thing we could do for interest is simply record keypresses as someone types actual documents. make sure they type for a couple of minutes straight. in the test we could replay that. it wouldn't give as stable of a metric, but it would be more representative of how someone types and the impact of the performance on that.

either way: running for X seconds or replaying real typing, I think these could give better data because both circumvent some bias that might be introduced by periodic processes.

If you can reproduce that, a good way of creating such a test could be using Playwright's test recorder feature that generates the code from the performed actions.

I ran a test for a week in Safari and Chrome and it doesn't seem like this is present in Core - it's in Automattic's P2 theme, or some block in there. Though I know it's not present in a vanilla install and though I don't know where it's coming from, I'm not quite ready to give Core a pass either, as I could imagine this appearing only with some co-factor. That is, maybe nothing in Core directly triggers the sequence of events that lead to this, but plugins using official APIs might do so and so the defect is only exposed when people are implementing the block editor outside of the Core blocks and setup.

test recorder

thanks for the link. I didn't know about it. could be a great way to capture "real typing and usage," including things like clicking on the undo button or inserting a block

WunderBart · 2023-08-16T13:27:17Z

Reopening since the discussion here is still ongoing.

ellatrix · 2023-12-06T07:48:12Z

Why are 1s+ between typing characters? That seems unrealistic? Imo we should simulate how a normal person types?

WunderBart · 2023-12-07T12:29:32Z

Why are 1s+ between typing characters? That seems unrealistic? IMO we should simulate how a normal person types?

@ellatrix, the 1s delay value was based on this data. We're doing this because, by design, we care about the metric stability, not the actual numbers. The measurements highly depend on the current machine running the tests, so the actual numbers would not be meaningful. The more stable the metric, the more likely we'll catch regressions. Does that make sense?

cc: @dmsnell

ellatrix · 2023-12-08T10:47:43Z

Makes sense, but probably both are valuable.

WunderBart self-assigned this Jun 9, 2023

WunderBart added the [Type] Performance Related to performance efforts label Jun 9, 2023

This comment was marked as resolved.

Sign in to view

WunderBart mentioned this issue Jul 12, 2023

Switch performance tests to Playwright #52022

Merged

WunderBart mentioned this issue Jul 21, 2023

Migrate performance tests to Playwright #51084

Merged

WunderBart closed this as completed in #52022 Aug 16, 2023

WunderBart reopened this Aug 16, 2023

WunderBart mentioned this issue Dec 5, 2023

Performance: measure typing without inspector #56753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf Testing: Inconsistent methodology for testing typing performance #51383

Perf Testing: Inconsistent methodology for testing typing performance #51383

WunderBart commented Jun 9, 2023 •

edited

Loading

youknowriad commented Jun 9, 2023

dmsnell commented Jun 10, 2023

WunderBart commented Jun 12, 2023 •

edited

Loading

youknowriad commented Jun 12, 2023 •

edited

Loading

This comment was marked as resolved.

kevin940726 commented Jun 13, 2023

WunderBart commented Jun 29, 2023

WunderBart commented Jun 29, 2023 •

edited

Loading

dmsnell commented Jul 27, 2023

dmsnell commented Jul 28, 2023

WunderBart commented Aug 11, 2023

dmsnell commented Aug 11, 2023

WunderBart commented Aug 16, 2023

ellatrix commented Dec 6, 2023

WunderBart commented Dec 7, 2023 •

edited

Loading

ellatrix commented Dec 8, 2023

Perf Testing: Inconsistent methodology for testing typing performance #51383

Perf Testing: Inconsistent methodology for testing typing performance #51383

Comments

WunderBart commented Jun 9, 2023 • edited Loading

youknowriad commented Jun 9, 2023

dmsnell commented Jun 10, 2023

WunderBart commented Jun 12, 2023 • edited Loading

youknowriad commented Jun 12, 2023 • edited Loading

This comment was marked as resolved.

kevin940726 commented Jun 13, 2023

WunderBart commented Jun 29, 2023

WunderBart commented Jun 29, 2023 • edited Loading

Testing inside a paragraph within a small post consisting of 20 blocks

Testing inside a paragraph within a large post consisting of 1.4K blocks

Testing inside a container within a small post consisting of 23 blocks

dmsnell commented Jul 27, 2023

dmsnell commented Jul 28, 2023

WunderBart commented Aug 11, 2023

dmsnell commented Aug 11, 2023

WunderBart commented Aug 16, 2023

ellatrix commented Dec 6, 2023

WunderBart commented Dec 7, 2023 • edited Loading

ellatrix commented Dec 8, 2023

WunderBart commented Jun 9, 2023 •

edited

Loading

WunderBart commented Jun 12, 2023 •

edited

Loading

youknowriad commented Jun 12, 2023 •

edited

Loading

WunderBart commented Jun 29, 2023 •

edited

Loading

WunderBart commented Dec 7, 2023 •

edited

Loading