[HOLD for payment 2024-06-28] LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack) #35225

kacper-mikolajczak · 2024-01-26T07:35:48Z

Intro

As a part of the Callstack performance audit, the issue tracks and summarises phases for Send a message metric.

The motivation behind measuring TTI of this particular behaviour were many reported issues of bad UX whenever user tried to send a message. By having an offline-first approach, we can also discard the network latency in measurements, but keep an eye on optimistic/actual updates in general during analysis.

Issue Owner

Current Issue Owner: @muttmuure

The text was updated successfully, but these errors were encountered:

kacper-mikolajczak · 2024-01-26T07:36:36Z

Measurements

Details

Metric: Send a message TTI (ms)
Phase: Measurements
Commit hash: 76e5b1c

Workflow

Open app
Go to report screen
Type a message with emoji
Send the message
Wait until message is displayed

Markers

Below is the list of defined markers that were used during testing:

messageProcessing: Time between Send the message(4) and Wait until message is displayed(5) steps of the workflow

Their implementation can be found in e8682e8.

The messageProcessing marker needed an extra metadata that were stored in the PERF module, which can later be used as a part of analytics. If that can be avoided, please let me know so we can improve/replace it 🔧 👍

Measurements

Measurements were gathered by running the aforementioned Workflow in 10 runs.

messageProcessing [ms]
14588.59
11782.90
20523.72
15050.67
14571.76
17800.07
13568.82
16305.16
10060.16
10869.97

The average time of messageProcessing is: 14512.18 ms

Auxiliary data

Additionally for broader perspective, the resources consumption during one of the test runs is included.


CPU usage


JS Thread usage


RAM usage

mountiny · 2024-01-26T10:37:19Z

@kacper-mikolajczak Thanks! Sorry if I am misunderstanding this but did it take 14 seconds from clicking Send button to show the message in the report in your testing? 😱

I have never experienced delay remotely close to this

adhorodyski · 2024-01-26T10:53:25Z

@kacper-mikolajczak Thanks! Sorry if I am misunderstanding this but did it take 14 seconds from clicking Send button to show the message in the report in your testing? 😱

I have never experienced delay remotely close to this

It may be the Heavy account setting we're using eduardo.graciano+newdotperf@callstack.com - I saw this myself where app was barely usable at this point (sending a message was hardly doable). The measurements provided by @kacper-mikolajczak unfortunately match my own experience in a chat on this account.

This was all on a physical device, production build and this exact account. @rinej will post a recording soon so we can see how the experience looked like.

kacper-mikolajczak · 2024-01-26T11:05:15Z

Hi @mountiny 👋 No misunderstanding whatsoever - I did not believe at first as well :D

As @adhorodyski already pointed out, measurements were aligned with the guidelines but the app with this specific setup is not performing as it should. I will post the analysis soon, just to unwrap of what is going on under the hood 👍

rinej · 2024-01-26T13:06:23Z

@mountiny I am sending video example with around 15s for sending message.
It is done on real device and on eduardo.graciano+newdotperf@callstack.com

short-flow.mp4

Additionally on that account sometimes sending message is bugy. It sends just part of the message and keeps the whole message in the input

Example videos:

bug1.mp4

bug2.mp4

mountiny · 2024-01-26T13:08:25Z

Holy Moly

adhorodyski · 2024-01-26T13:26:49Z

Holy Moly

Give us some time we're already cookin' something!! 😊

kacper-mikolajczak · 2024-02-02T15:42:20Z

Analysis

In order to analyse underlying issues on Android platform, the app was ran in release build variant and profiled using both native and JS tools (Android Studio Profiler, Chrome Profiler and React Profiler respectively).

As a brief overview, here is a high-level representation of analysis flow:


High-level picture of an analysis flow

Finding the dark side

Android Studio Profiler is used to determine if performance issues stem from native, JS or both sides of execution.

The results of native profiling when plugged into Perfetto visualiser look as follows:


General overview of CPU usage across threads, Perfetto Visualiser

The focus of the analysis is bound to the 4 main threads from React Native standpoint:

mqt_js (JS Thread)
UI Thread
mqt_native_modules
Render Thread

By looking at the CPU usage summary, there are two areas of high-density calculations. The biggest concern seems to be the overload of mqt_js thread in the two aforementioned timeframes, which analysis will look into the most.

JS Profiling

The detailed JS profiling trace is obtained thanks to react-native-release-profiler library, which enables to profile the Hermes (JS) thread in release build. The trace will be presented via Chrome Devtools, which in turn, is going to be helpful determining the exact root cause(s) of an javascript-land issues.


Baseline hermes profiling trace, Chrome Devtools

From the trace observed above, the key take-away were made:

Process of sending a message requires calculations that block the JS thread for more than 27 seconds
This means the app is unresponsive for such time from user perspective.

The main goal of the analysis is the reduce the amount of calculations as much as possible.

Each section down below contains information about specific finding and its results on overall performance. The results will be summed up between sections to eventually present the final outcome in very last part.

After each section with meaningful changes in timings, the table will be presented like one below. First (baseline) and the last (final) tables consist of 10 samples, while the intermediate ones have 5 samples for brevity.

#	Duration [ms]
1	14588.59
2	11782.90
3	20523.72
4	15050.67
5	14571.76
6	17800.07
7	13568.82
8	16305.16
9	10060.16
10	10869.97
Avg	14811.64

Hermes Intl `localeCompare`

Highest amount of work is performed by native string method localeCompare which, summarized from all calls, takes about 23 seconds to complete. The issue stems from Hermes engine Intl issues which are described more in detail in this proposal by Hur who found the original fix.

Result of fixing the issue is drastic reduction of time required to complete all the calculations. We were able to reduce workload from 27 seconds to about 5 seconds.


Stack trace after Hermes Intl changes shows significant improvement

#	Duration [ms]
1	3572.07
2	3512.99
3	3542.68
4	3710.00
5	3672.54
Avg	3602,06

The 3 bumps

While previous step carried out enormous gains, we still have quite a lot of work going on under the hood. Even with such reduction we still can observe the pattern when looking at the JS profiling trace - there are three significant overheads when sending the message. After investigating, those “bumps” seem to be:

update of hasDraft property in report object when starting typing a message
optimistic update when sending the message
actual Onyx update after message was successfully sent

Even though those bumps are a result of the different application flows, they have the same root which is tied to the SidebarLinksData component and its internal calculations.

The SidebarLinksData subscribes to collection of all chatReports from Onyx (in testing account about 14k of reports). Collection is invalidated every time one of the report changes, which happens in aforementioned bumps (updating the hasDraft property and sending the actual message flow).

Whenever chatReports collection is invalidated, it breaks up the memoization of heavy optionListItems object. Keep in mind that the memoization might also break because of other dependencies. It was not the case during the analysis of sending the message, but they appear as fragile as chatReports.


Brief explanation of `SidebarLinksData` ordering mechanism and its pitfalls

New Order

The major suspect of optionListItems bad performance seems to be the getOrderedReportIDs method from SidebarUtils. It is called to calculate new order of reports for SidebarLinksData whenever dependencies change.

Looking at the stack trace, the getOrderedReportIDs function call is responsible for biggest time overhead of about 2.5-3s per call.


Stack trace presenting `getOrderedReportIDs` overhead inside SidebarLinksData

Each call consists of number of operations performed on set of reports (and other arguments) to determine correct order of reports:

caching reports (JSON.stringify)
filtering out reports (arrayPrototypeFilter)
modifying reports’ display names (arrayPrototypeForEach)
sorting reports (arrayPrototypeSort)

Those calculations are done for all of thousands reports every time collection is invalidated, thus resulting in high overhead. Let’s look closely at those in sections below.

The Cache

As old joke says, the three hardest things in computer science are cache invalidation, naming things and off-by-one errors. The audit, based on previous findings, will try to address them in this section.

The getOrderedReportIDs function tries to avoid calculations by introducing a cache. While it is generally a good practice, in this specific example it might add more overhead that it tries to mitigate.

App spends about 150ms performing JSON.stringify on heavy list of data in order to compute the cachedReportsKey.

Then the generated key is checked against existing cache. If cache entry is hit, the cached ordered reportIDs are returned. The issue here is that the cache entry is unlikely to be hit, especially given maximum cache size of 5 elements and quite big amount of input variables:

currentReportId
allReports
betas
policies
priorityMode
allReportActions

By removing the caching mechanism, the time required to calculate getOrderedReportIDs gets decreased by about 0.3s (~15%).


By removing the caching mechanism getOrderedReportIDs gets 15% faster

Additionally, the allReportActions are solely used to compute the cache, so when removing the cache implementation, the entire allReportActions retrieval process can be omitted as well (including request to Onyx), which is another potential gain for entire SidebarLinksData. By comparing the stack traces there is ~40ms gain in performance:


Before removing `allReportAction` retrieval


After removing `allReportAction` retrieval

🚧 When conducting the analysis, there were no single observed case of a cache hit. Having that said, we are not sure about its performance benefits in wild - please treat this finding as an entry point for the discussion instead of hardcore statement.

#	Message	Duration [ms]
1	testing1	3032.67
2	testing2	3216.57
3	testing3	3232.38
4	testing4	3295.99
5	testing5	3195.70
Avg	-	3194.86

The naming

Another heavy lifter inside getOrderedReportIDs is a computation of report’s displayName property via getReportName.


Overhead of `forEach` that is mostly related to `getReportName` calculations

In order to calculate the display name, getReportName calls internally many different helpers, but the ones that stand out the most are getDisplayNameForParticipant and getPolicyExpenseChatName.


`getDisplayNameForParticipant` and `getPolicyExpenseChatName` overhead of ~200ms

While one could try to squeeze out gains by looking for a more performant library or doing micro-optimisations, the best way to save time is to calculate static properties once (on server-side preferably), as they are not intended to change and derive dynamic values from them.

For example, in getDisplayNameForParticipant implementation we can see that the value returned from formatPhoneNumber call can be re-used:

The formattedLogin field could be part of the personalDetails itself or might live in a separate entity. As an experiment, the formatted login was stored in a makeshift map and it resulted in about 500ms of reduced overhead:


Still costly forEach with a reduced runtime performance of about 500ms

The example brought up here are meant to illustrate what can be done on broader scale in the implementation phase with app’s data flow in general. The higher in the tree of data sources we can reduce the number of calculation the better and this is something we will do in future.

Order	Message	Duration [ms]
1	testing1	2482.31
2	testing2	2512.94
3	testing3	2474.32
4	testing4	2461.08
5	testing5	2493.86
Avg	-	2484.12

Report’s `hasDraft` property

An example of such high-level improvement could be process of determining whether report has a draft message. It is fired whenever user starts typing a new message and is responsible for the first of three bumps described above.

Updating draft state is mainly used to show correct order and UI in LHN. While it is an important part of the UX, we don’t necessarily want to perform 1.5s of calculations to update one property.

Unfortunately, the hasDraft property is strictly bound to the report object. Changing the (part of) report will result in expensive calculations of aforementioned getOrderedReportIDs.

In order to fix this, the hasDraft property could be extracted away from the core report structure and be served as a separate dependency.

Here are the results of mitigating the hasDraft impact - the first bump of calculation that was present in one of the first measurments is now almost completely gone. The reduction goes from about 1.5s to under 100ms of an overhead. In comparison of other bumps, it now looks negligible.

![Results of avoiding the unnecessary getOrderedReportsIDs updates on draft message]


Results of avoiding the unnecessary `getOrderedReportsIDs` updates on draft message

This was achieved thanks to replacing the hasDraft property with a separate Onyx key REPORT_HAS_DRAFT_ which is then used in the component of interest (e.g. SidebarLinksData , OptionRowLHN). The most significant improvement is made by skipping the costly call of getOrderedReportIDs whenever the draft status changes - instead of calculating entire list from scratch, the newly created draft is injected into previous list.

Here is the partial implementation seen from SidebarLinksData component:


`SidebarLinksData` improvements

Similar pattern was applied to other draft properties, like REPORT_DRAFT_COMMENT, so it might be a worthwhile to spark a discussion about handling such structures in the system.

#	Message	Duration [ms]
1	testing5	2485.97
2	testing 😄	2314.00
3	testing6	2237.64
4	testing7	2247.78
5	testing8	2390.47
6	testing9	2271.90
7	testing1	2255.88
8	testing3	2309.42
9	testing5:smile:	2319.85
10	testing6	2280.91
Avg	-	2310.77

Summary

While there is still a room for improvements, as the goal the we are striving for is 100ms of execution time, with the proposed changes involved, sending a message time went down from ~15s to ~2.5s.

Such changes are meant to be taken as a general direction or example of implementations which will be continued in next phase of the audit.

The main goal is to reduce costly calculations that freeze the thread making app unresponsive for the user - which is even worse UX than inaccessible inputs! Following that, as presented with smaller improvements we aim to improve both performance and development experience through writing more reliable code.

Hope you like the analysis - see you soon in the implementation phase! :)

CC @mountiny @roryabraham

mountiny · 2024-02-02T19:24:05Z

These analysis are really impressive and great opportunity for everyone to learn from too. Its also great to see that this is in the open source channel as new engineers/ anyone can get inspired by these.

I am very excited for bringing the proposals from this into the App!

kacper-mikolajczak · 2024-02-06T09:30:05Z

Love to hear that @mountiny!

We are also eager to implement the proposal as soon as possible to get the benefits out of it. Before that, would you like to break down analysis with me into some Actionable Points we can treat as entries into Implementation Phase?

For convenience, here is a quick summary of main points we can try to make into AP and prioritize:

Hermes Intl improvements (order of magnitude improvement, ~12s)
getOrderedReportsIDs tweaks: (~1.2s)
- cache removal
- displayName formatting overhead
- sorting improvements
Draft message optimisations (depends on previous optimisations, ~200ms-~800ms)
- moving draft-related properties from Report object into separate Metadata structure
- improving dependency management

Please keep in mind the actual time gains might roughly differ - their purpose here is to have a general overview of at what level presented change might impact the performance 👍

mountiny · 2024-02-06T13:51:44Z

Thank you Kacper, I think my reasoning for ordering would be: make the biggest impact changes first and also prioritize the lowest hanging fruit. Seems like your list achieves that so let's do that

kacper-mikolajczak · 2024-02-06T14:08:25Z

Great, I find such ordering best as well. Just to sum up, here are ordered Action Items:

Hermes Intl improvements
getOrderedReportsIDs tweaks
2.1. cache removal
2.2. displayName formatting overhead
2.3. sorting improvements
Draft message optimisations
3.1. moving draft-related properties from Report object into separate Metadata structure
3.2. Improving dependency management (implementation of useMemo)

We are going to use those items as an input into Implementation Phase

CC: @mountiny

mountiny · 2024-02-06T16:52:57Z

Thanks!

muttmuure · 2024-04-18T10:23:13Z

We're working on it

muttmuure · 2024-05-08T08:04:48Z

Working through it

roryabraham · 2024-05-15T19:30:58Z

Most recent slack convo: https://expensify.slack.com/archives/C05LX9D6E07/p1713358946087599

TL;DR currently we're talking about standardizing on moize as our go-to solution for memoization outside of React.

kirillzyusko · 2024-06-06T15:46:09Z

This issue was mentioned in https://expensify.slack.com/archives/C05LX9D6E07/p1711455442814489?thread_ts=1711010712.155509&cid=C05LX9D6E07

I'm going to write e2e test for that (or extend a current one) to assure the performance for such critical flows is not degrading over time 👍

muttmuure · 2024-06-12T09:33:37Z

@kacper-mikolajczak are you still on this one?

kacper-mikolajczak · 2024-06-12T10:34:06Z

Hi @muttmuure it was part of the last performance audit and is completed for now.

muttmuure · 2024-06-12T12:00:21Z

Closing

melvin-bot · 2024-06-21T16:43:14Z

Reviewing label has been removed, please complete the "BugZero Checklist".

melvin-bot · 2024-06-21T16:43:17Z

The solution for this issue has been 🚀 deployed to production 🚀 in version 1.4.85-7 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue:

[NoQA] e2e: fixed typing test #43178

If no regressions arise, payment will be issued on 2024-06-28. 🎊

For reference, here are some details about the assignees on this issue:

@kacper-mikolajczak does not require payment (Contractor)

kacper-mikolajczak mentioned this issue Jan 26, 2024

[Tracking] Callstack performance audit of the App #33070

Closed

melvin-bot bot added the Monthly KSv2 label Jan 29, 2024

muttmuure self-assigned this Feb 13, 2024

adhorodyski mentioned this issue Feb 14, 2024

[Tracking] Metric - Report screen load time (Performance audit by Callstack) #34494

Closed

mountiny mentioned this issue Feb 19, 2024

[Payment due March 7] [Performance] Remove cache from the getOrderedReportIDs method #36799

Closed

muttmuure assigned kacper-mikolajczak Feb 22, 2024

kacper-mikolajczak mentioned this issue Mar 7, 2024

[HOLD for payment 2024-08-09] [HOLD for payment 2024-08-01] [$250] CRITICAL: Native - Message stays in the Composer after being sent #37896

Closed

6 tasks

melvin-bot bot added the Overdue label Mar 25, 2024

melvin-bot bot removed the Overdue label Apr 18, 2024

quinthar changed the title ~~[Tracking] Metric - Send a message TTI (Performance audit by Callstack)~~ LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack) May 12, 2024

This was referenced Jun 7, 2024

[NoQA] e2e: fixed typing test #43178

Merged

[NoQA] e2e: record messageSent interaction #43302

Merged

melvin-bot bot added the Reviewing Has a PR in review label Jun 7, 2024

melvin-bot bot added Weekly KSv2 and removed Monthly KSv2 labels Jun 7, 2024

mountiny closed this as completed Jun 12, 2024

mountiny self-assigned this Jun 17, 2024

melvin-bot bot added Weekly KSv2 Awaiting Payment Auto-added when associated PR is deployed to production and removed Weekly KSv2 labels Jun 17, 2024

melvin-bot bot changed the title ~~LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack)~~ [HOLD for payment 2024-06-28] LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack) Jun 21, 2024

melvin-bot bot removed the Reviewing Has a PR in review label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOLD for payment 2024-06-28] LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack) #35225

[HOLD for payment 2024-06-28] LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack) #35225

kacper-mikolajczak commented Jan 26, 2024 •

edited by melvin-bot bot

Loading

kacper-mikolajczak commented Jan 26, 2024 •

edited

Loading

Measurements

Details

Workflow

Markers

Measurements

Auxiliary data

mountiny commented Jan 26, 2024

adhorodyski commented Jan 26, 2024 •

edited

Loading

kacper-mikolajczak commented Jan 26, 2024 •

edited

Loading

rinej commented Jan 26, 2024

mountiny commented Jan 26, 2024

adhorodyski commented Jan 26, 2024

kacper-mikolajczak commented Feb 2, 2024 •

edited

Loading

mountiny commented Feb 2, 2024

kacper-mikolajczak commented Feb 6, 2024

mountiny commented Feb 6, 2024

kacper-mikolajczak commented Feb 6, 2024

mountiny commented Feb 6, 2024

muttmuure commented Apr 18, 2024

muttmuure commented May 8, 2024

roryabraham commented May 15, 2024

kirillzyusko commented Jun 6, 2024

muttmuure commented Jun 12, 2024

kacper-mikolajczak commented Jun 12, 2024

muttmuure commented Jun 12, 2024 •

edited

Loading

melvin-bot bot commented Jun 21, 2024

melvin-bot bot commented Jun 21, 2024

[HOLD for payment 2024-06-28] LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack) #35225

[HOLD for payment 2024-06-28] LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack) #35225

Comments

kacper-mikolajczak commented Jan 26, 2024 • edited by melvin-bot bot Loading

Intro

kacper-mikolajczak commented Jan 26, 2024 • edited Loading

Measurements

Details

Workflow

Markers

Measurements

Auxiliary data

mountiny commented Jan 26, 2024

adhorodyski commented Jan 26, 2024 • edited Loading

kacper-mikolajczak commented Jan 26, 2024 • edited Loading

rinej commented Jan 26, 2024

mountiny commented Jan 26, 2024

adhorodyski commented Jan 26, 2024

kacper-mikolajczak commented Feb 2, 2024 • edited Loading

Analysis

Finding the dark side

JS Profiling

Hermes Intl localeCompare

The 3 bumps

New Order

The Cache

The naming

Report’s hasDraft property

Summary

mountiny commented Feb 2, 2024

kacper-mikolajczak commented Feb 6, 2024

mountiny commented Feb 6, 2024

kacper-mikolajczak commented Feb 6, 2024

mountiny commented Feb 6, 2024

muttmuure commented Apr 18, 2024

muttmuure commented May 8, 2024

roryabraham commented May 15, 2024

kirillzyusko commented Jun 6, 2024

muttmuure commented Jun 12, 2024

kacper-mikolajczak commented Jun 12, 2024

muttmuure commented Jun 12, 2024 • edited Loading

melvin-bot bot commented Jun 21, 2024

melvin-bot bot commented Jun 21, 2024

kacper-mikolajczak commented Jan 26, 2024 •

edited by melvin-bot bot

Loading

kacper-mikolajczak commented Jan 26, 2024 •

edited

Loading

adhorodyski commented Jan 26, 2024 •

edited

Loading

kacper-mikolajczak commented Jan 26, 2024 •

edited

Loading

kacper-mikolajczak commented Feb 2, 2024 •

edited

Loading

Hermes Intl `localeCompare`

Report’s `hasDraft` property

muttmuure commented Jun 12, 2024 •

edited

Loading