Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf Testing: Inconsistent methodology for testing typing performance #51383

Open
WunderBart opened this issue Jun 9, 2023 · 16 comments · Fixed by #52022
Open

Perf Testing: Inconsistent methodology for testing typing performance #51383

WunderBart opened this issue Jun 9, 2023 · 16 comments · Fixed by #52022
Assignees
Labels
[Type] Performance Related to performance efforts

Comments

@WunderBart
Copy link
Member

WunderBart commented Jun 9, 2023

I've been working on migrating the performance tests to Playwright and noticed that the way we test typing performance is different for the Post Editor vs. Site Editor. For example, for the Site Editor, we simply hit x 200 times and record the metrics:

// Type "x" 200 times.
await page.keyboard.type( new Array( 200 ).fill( 'x' ).join( '' ) );

...while for Post Editor, we make an arbitrary pause before each stroke (depending on the test):

Typing

while ( i-- ) {
// Wait for the browser to be idle before starting the monitoring.
// The timeout should be big enough to allow all async tasks tor run.
// And also to allow Rich Text to mark the change as persistent.
// eslint-disable-next-line no-restricted-syntax
await page.waitForTimeout( 2000 );
await page.keyboard.type( 'x' );
}

Typing within containers

while ( i-- ) {
// Wait for the browser to be idle before starting the monitoring.
// eslint-disable-next-line no-restricted-syntax
await page.waitForTimeout( 500 );
await page.keyboard.type( 'x' );
}

I'd like to verify if that pause is necessary, and if yes, if we can replace it with waiting for some UI element. This would simulate a real-user experience better and produce more consistent results, as wait timers with arbitrary values are known to be unstable.

/cc @youknowriad @dmsnell @kevin940726

@WunderBart WunderBart self-assigned this Jun 9, 2023
@WunderBart WunderBart added the [Type] Performance Related to performance efforts label Jun 9, 2023
@youknowriad
Copy link
Contributor

I've tried so many strategies over time. I think waiting or not waiting are both valuable but are different metrics. The waiting impacts the performance a bit because some "async work" triggers while waiting, while typing continuously might not have to finish all the "async work". (we're only interested in the "sync" time so far though).

I think we basically should just pick one. I'd opt to wait a bit because in my testing (native testing), it gives a bit more stable results.

@dmsnell
Copy link
Member

dmsnell commented Jun 10, 2023

Unfortunately the wait is still best.

I'm only 80% confident on this, but we still have no way to measure the time between some initial event and the moment some particular DOM is rendered in response. Typing is ten times harder because there's a chain of indirection between the keypress and the updated visuals.

We would need even more than what things like MutationObserver give us, because we not only care about how long it takes to modify the DOM, but also how long it takes to run layout and render of that DOM.

😢

@WunderBart
Copy link
Member Author

WunderBart commented Jun 12, 2023

I'd opt to wait a bit because in my testing (native testing), it gives a bit more stable results.

Unfortunately the wait is still best.

Have we tried using the requestIdleCallback? We could use the callback to resolve when the page becomes idle, e.g.

// start tracing
let i = 20;
while ( i-- ) {
	await page.keyboard.type( 'x' );
	await page.evaluate(
		() =>
			new Promise( ( resolve ) =>
				requestIdleCallback( resolve, { timeout: 1000 } )
			)
	);
}
// end tracing

@youknowriad
Copy link
Contributor

youknowriad commented Jun 12, 2023

@WunderBart I don't think we did, that said, we use that same API internally to register the Async Tasks so you might have competing callbacks and you won't achieve the desired result: (wait for the end of all the "async tasks")

@youknowriad

This comment was marked as resolved.

@kevin940726
Copy link
Member

We're using the tracing API so in theory we have most of the metrics we need: event handling, layout, painting, etc. We probably just need a way to figure out the best combination of these metrics and produce something useful for us to track. The guide of FID might inspire us in some way.

@WunderBart
Copy link
Member Author

I've tried so many strategies over time. I think waiting or not waiting are both valuable but are different metrics. The waiting impacts the performance a bit because some "async work" triggers while waiting, while typing continuously might not have to finish all the "async work". (we're only interested in the "sync" time so far though).

I think we basically should just pick one. I'd opt to wait a bit because in my testing (native testing), it gives a bit more stable results.

Right, if we're typing quickly, the browser is likely queueing up these keypress events and processing them as quickly as possible, leading to shorter times because there's a consistent flow of tasks, and the JS engine is kept busy. Measuring without pauses between keystrokes would be the closest to measuring actual typing performance, which is what we're currently reporting via our custom reporter (min, average, and max time to type):

${ title( 'Typing:' ) }
Average time to type character: ${ success( round( average( type ) ) + 'ms' ) }
Slowest time to type character: ${ success(
round( Math.max( ...type ) ) + 'ms'
) }
Fastest time to type character: ${ success(
round( Math.min( ...type ) ) + 'ms'
) }` );

On the other hand, if we're doing large pauses between the keystrokes (which we are), I don't think we're measuring typing performance (as in 'how fast can the app respond to typing) but the overall responsiveness of the application. IIUC, for this metric, the browser/engine has to start up the task-processing sequence each time, which can introduce a slight delay. During the pauses, the engine likely processes other tasks (like rendering updates or other running scripts), which could cause the keypress events to take longer to process. It also makes sense that the metric is more stable this way.

I guess what I'm trying to say is that (to me) the typing metrics are confusing as they are right now. For example, we're reporting the fastest/slowest time to type a character, but we're not putting any stress on the typing. It's also not actual "typing" if we're only pressing a key every 2 seconds. I agree that both ways make sense and that the one with waiting for the response would be more valuable right now, but I'd at least call the metric differently: inputLatency maybe? It would make more sense also because we seem to be adjusting the delay between the keystrokes depending on the size of the currently edited post (large post: 2s delay, small post: 0.5s delay). Does that make sense?

@WunderBart
Copy link
Member Author

WunderBart commented Jun 29, 2023

I did some input latency (keydown + keypress + keyup) testing to nail down the best pause/delay value for metric stability. It looks like things are pretty stable between 500 and 900 ms, but oddly enough, the 1000 ms sample comes out on top in consecutive rounds, despite a weird dip in latency. Also, I'm scratching my head about the latency bump for the 2000 and 5000 ms samples. 😅 Any ideas?

Here's the snippet used:

const delays = [
    0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 5000,
];

for ( const delay of delays ) {
    await browser.startTracing( page, {
        path: traceFilePath,
        screenshots: false,
        categories: [ 'devtools.timeline' ],
    } );

    const samples = 10;
    const throwaway = 1;
    await page.keyboard.type( 'x'.repeat( samples + throwaway ), { delay } );

    await browser.stopTracing();

    for ( let i = throwaway; i < keyDownEvents.length; i++ ) {
        // get the average of keydown + keypress + keyup events
    }
}

Here are the visuals:

Testing inside a paragraph within a small post consisting of 20 blocks

avg  input latency (ms) vs  keypress delay (ms) (2)

Testing inside a paragraph within a large post consisting of 1.4K blocks

avg  input latency (ms) vs  keypress delay (ms) (1)

Testing inside a container within a small post consisting of 23 blocks

avg  input latency (ms) vs  keypress delay (ms)

@dmsnell
Copy link
Member

dmsnell commented Jul 27, 2023

@WunderBart this is a really great exploration. I've been stewing on this post for a while because I needed to find some time to focus on it and consider it.

first of all I think it'd be worth re-examing the graphs with waaaaaay more than 10 samples. I'm worried that with so few we may be seeing bias in the results due to potential things that are scheduled within the browser. for example, on the order of 10 seconds I can easily imagine that some tasks are deferred for idle, whereas if they haven't run by 50 second then I'd be surprised to find a repeat. that is, maybe we find a way to run the test for a minute or two of wall time and then divide the results by how many samples fit into that window.

the one-second latency dip seems like an incredible data point to examine more closely. also one of the reasons I'd like to see this in a longer sustained test. is the latency low because at one point within that ten seconds, the garbage collector ran and made everything else speedy?

similarly, why does spacing out the keypresses to five seconds cause them to take so much longer to process? could we be bumping into background processes within the editor that are skewing the tests? the 200ms delay, for example, is finished by the time the second 5000ms delay test runs. if the 200ms delay test also runs for 50 seconds, would we see a spike in the distribution of type latency there as well?


now apart from this discussion, it raises an interesting idea I'd like to explore. I've seen a kind of test setup for monitoring input-to-display latency before where someone takes a high-speed camera (the 240 fps cameras on mobile phones is good enough) that can see both the keyboard and the monitor and they time how long it takes from input to a visible display. this is not feasible today for the CI pipeline, but I would enjoy exploring this on a bench setup.

with an appropriate control we could measure the "minimum" keyboard-to-screen latency using something like a plain TEXTAREA or a native app like Terminal.app or xterm. with that baseline data we could compare typing into the editor and report on how much additional latency we measure through React and the browser.

🤔

@dmsnell
Copy link
Member

dmsnell commented Jul 28, 2023

in some profiling I was doing today where I saw 112s typing lag I noticed that a page refresh eliminated the lag. while I'm not sure why this is, it makes me think again towards our unrealistic tests.

whether time based or edit based, I wonder if we could setup the performance tests to run through something like 10,000 edits before measuring the results.

maybe it's undo/redo, maybe it's the store duplicating listeners for the same components, maybe it's some other memory leak. something gets worse the longer one edits a document.

@WunderBart
Copy link
Member Author

Thanks for all the insight, @dmsnell. Interesting stuff!

first of all I think it'd be worth re-examing the graphs with waaaaaay more than 10 samples.

I think this would be a good starting point, yeah. How many samples are we talking about, though? 100? 1000? 10000? 😅 From what you're saying (the longer we type, the laggier it gets), I'd also tweak the test so that a fresh page is used for each delay value.

now apart from this discussion, it raises an interesting idea I'd like to explore. I've seen a kind of test setup for monitoring input-to-display latency before where someone takes a high-speed camera (the 240 fps cameras on mobile phones is good enough) that can see both the keyboard and the monitor and they time how long it takes from input to a visible display. this is not feasible today for the CI pipeline, but I would enjoy exploring this on a bench setup.

I'd love to see how that relates to the numbers we're reading from the perf tools. It could potentially be a good way of calibrating the Playwright tests.

in some profiling I was doing today where I saw 112s typing lag I noticed that a page refresh eliminated the lag. while I'm not sure why this is, it makes me think again towards our unrealistic tests.

whether time based or edit based, I wonder if we could setup the performance tests to run through something like 10,000 edits before measuring the results.

maybe it's undo/redo, maybe it's the store duplicating listeners for the same components, maybe it's some other memory leak. something gets worse the longer one edits a document.

If you can reproduce that, a good way of creating such a test could be using Playwright's test recorder feature that generates the code from the performed actions.

@dmsnell
Copy link
Member

dmsnell commented Aug 11, 2023

How many samples are we talking about, though?

I wasn't suggesting we run for a specific number of samples, but of seconds

find a way to run the test for a minute or two of wall time and then divide the results by how many samples fit into that window.

this would mean that we have fewer samples for keypresses separated by longer intervals, but that may not matter. it's not like anyone types in any of these characteristic ways.

one thing we could do for interest is simply record keypresses as someone types actual documents. make sure they type for a couple of minutes straight. in the test we could replay that. it wouldn't give as stable of a metric, but it would be more representative of how someone types and the impact of the performance on that.

either way: running for X seconds or replaying real typing, I think these could give better data because both circumvent some bias that might be introduced by periodic processes.

If you can reproduce that, a good way of creating such a test could be using Playwright's test recorder feature that generates the code from the performed actions.

I ran a test for a week in Safari and Chrome and it doesn't seem like this is present in Core - it's in Automattic's P2 theme, or some block in there. Though I know it's not present in a vanilla install and though I don't know where it's coming from, I'm not quite ready to give Core a pass either, as I could imagine this appearing only with some co-factor. That is, maybe nothing in Core directly triggers the sequence of events that lead to this, but plugins using official APIs might do so and so the defect is only exposed when people are implementing the block editor outside of the Core blocks and setup.

test recorder

thanks for the link. I didn't know about it. could be a great way to capture "real typing and usage," including things like clicking on the undo button or inserting a block

@WunderBart
Copy link
Member Author

Reopening since the discussion here is still ongoing.

@ellatrix
Copy link
Member

ellatrix commented Dec 6, 2023

Why are 1s+ between typing characters? That seems unrealistic? Imo we should simulate how a normal person types?

@WunderBart
Copy link
Member Author

WunderBart commented Dec 7, 2023

Why are 1s+ between typing characters? That seems unrealistic? IMO we should simulate how a normal person types?

@ellatrix, the 1s delay value was based on this data. We're doing this because, by design, we care about the metric stability, not the actual numbers. The measurements highly depend on the current machine running the tests, so the actual numbers would not be meaningful. The more stable the metric, the more likely we'll catch regressions. Does that make sense?

cc: @dmsnell

@ellatrix
Copy link
Member

ellatrix commented Dec 8, 2023

Makes sense, but probably both are valuable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Performance Related to performance efforts
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants