[Performance] Grafana logs don't seem to make sense anymore #3889

marcaaron · 2021-07-06T17:33:09Z

If you haven’t already, check out our contributing guidelines for onboarding and email contributors@expensify.com to request to join our Slack channel!

Action Performed:

Checked Grafana and key metrics appear to be wildly inaccurate

Expected Result:

That logs would more or less be the same as before with perhaps some small changes in performance

Actual Result:

The timings have deviated by 10 minutes or more

Workaround:

Doesn't affect app usage. May or may not be indicative of a performance regression.

Platform:

N/A

Version Number:
Logs: https://stackoverflow.com/c/expensify/questions/4856
Notes/Photos/Videos: Any additional supporting documentation
Expensify/Expensify Issue URL:

View all open jobs on Upwork

marcaaron · 2021-07-06T18:18:56Z

Took a quick look here and I'm not too sure what's happening. The way Graphite timings work in the front end we store a local map of timing values and then clear them on sign out.

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Timing.js#L4-L13
https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Session.js#L69-L70

If for some reason these values were not cleared and we "signed in" again I'd expect the values in the map to be replaced by new "start" event before being paired with an "end" event. So it's unclear how these values can be so large 🤔

e.g. for the TIMING.HOMEPAGE_REPORTS_LOADED "start" event we only record this when constructing AuthScreens component here:

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/Navigation/AppNavigator/AuthScreens.js#L118-L119

Then when the component mounts we make the API request

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/Navigation/AppNavigator/AuthScreens.js#L138-L139

And finally record the "end" event when it returns successfully

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Report.js#L920-L922

So we are surely reaching the above code otherwise we would not see these logs in Grafana at all. But why are they suddenly so much larger? @Julesssss do you have any ideas?

jsamr · 2021-07-09T19:00:40Z

@marcaaron Just preparing a ticket for a performance audit and I stumbled on this one.

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Timing.js#L4-L13

Note that the Date API is pretty inaccurate for performance metrics, and React Native has shipped a minimal Performance API in version 0.63.0 facebook/react-native@232517a (see changelog https://github.com/react-native-community/releases/blob/master/CHANGELOG.md#v0630). performance.now should be preferred over new Date() for accuracy ;-)

marcaaron · 2021-07-09T19:13:19Z

Nice one! We should get this fixed up (maybe in a new issue) and it's good to note.

But I think the issue was working for a while at least with the same code. There is a clear chasm with ms/seconds on one side and then suddenly minutes on the other. So probably we made a serious mistake somewhere else or there is an API request somehow taking 10 minutes for some of us.

Julesssss · 2021-07-13T11:37:17Z

Oh wow, nice spot.

It's odd that the increase seems to occur on different days for the 'Switch report' & 'Homepage reports loaded' events. The most recent PR touching the timing code was this one, which seems to be aligned date-wise. But I can't see any obvious issues at first glance. Happy to look further into this once I've cleared higher priority tasks.

rdjuric · 2021-07-13T11:40:17Z

Just to add some data:

I had some Timing logs with negative time (-392ms) for the Switch Report. Tried to reproduce it when I noticed but couldn't find out how.

marcaaron · 2021-07-13T21:37:07Z

I had some Timing logs with negative time (-392ms) for the Switch Report. Tried to reproduce it when I noticed but couldn't find out how.

That's interesting. Maybe someone can look into what happens if we call Grafana with a negative value? Perhaps that can somehow explain the very large values.

iwiznia · 2021-07-15T12:54:14Z

Without reading any of this (sorry) I think I fixed the graphs. What I did:

Added aliases to all graphs to make the labels make sense
Hid any metric with no data
Changed the summarize aggregation method from sum to avg, you were adding all the times from all the 24hs you were summarizing

Do they make sense now?

For some reason, there are some weird metrics in the homepage reports loaded, not sure where they are coming from...

Julesssss · 2021-07-15T14:13:55Z

That's looking better 👍

While you're assigned, do you know if it's possible to remove these junk platforms from the 'Homepage Reports Loaded' table?

iwiznia · 2021-07-15T14:15:54Z

I can remove them manually, I am a bit more interested to know why they are there.

iwiznia · 2021-07-15T14:17:53Z

Seems all the crap data happened only once at 2021-07-12 00:00:00 so maybe it was someone doing something weird in dev, I bet it won't happen again so the labels will be gone on their own.

iwiznia · 2021-07-15T14:25:29Z

This adding 1500ms extra to the timing of the action... right? I think that if we want the debounce we need to subtract the debounce time from the measured time.
But I am not really sure I get why we need the debounce?

iwiznia · 2021-07-15T14:40:43Z

Oh, it already is subtracting it 😄
I wonder why the graphs show such a high time.

iwiznia · 2021-07-15T14:44:00Z

O, I think I know why, we are subtracting the debounce time, but if the event fires many times, the debounce time gets reset, so we are not subtracting the correct amount of time.
Now before coming up with a fix, I'd like to understand why we use the debounce at all... we want to track when all things are rendered or when the first one is?

marcaaron · 2021-07-15T17:57:14Z

Now before coming up with a fix, I'd like to understand why we use the debounce at all... we want to track when all things are rendered or when the first one is?

I'm realizing now that maybe this was not a perfect solution. It's confusing for certain and could have a bug, but let's see. I'll try to explain the reasoning...

Previously, we would record the "end" timing when the report mounts. This didn't give us an accurate picture of whether all visible chat items are rendered and on the screen. So we added this onLayout callback here:

https://github.com/Expensify/Expensify.cash/blob/a4424c65ae89030e97f6f36e6911cfe5295da5f2/src/pages/home/report/ReportActionsView.js#L377

Which is debounced so that when all the visible items have finished laying out we will record that moment and measure the total time to layout all visible items.

if the event fires many times, the debounce time gets reset, so we are not subtracting the correct amount of time

Maybe I have this wrong, but my intention was that the debounce should wait 1.5 seconds after the last time it is called to execute the callback.

From the docs:

Creates and returns a new debounced version of the passed function which will postpone its execution until after wait milliseconds have elapsed since the last time it was invoked.

So, we are subtracting 1.5 seconds from the last time an item layout occurred because the actual function will be executed 1.5 seconds after the last time the debounced function was called.

So e.g. if there is only one item... it would layout then the debounce function would get called 1.5 seconds later then we can subtract 1.5 seconds from the Date.now() to get the true time. But maybe we are wrong about this?

There might be a better way to do this. Or we could just stop tracking this metric entirely if we can't come up with a good solution and this one seems broken.

iwiznia · 2021-07-15T18:04:14Z

ok, I see. First, is good to know that the debounce exists only to track the time appropriately, I was fearing it was needed for some other reason 😄

So, I think what you say is correct, but the thing is, I was looking at the screen and the event to end the timer seemed to be fired a lot later than 1.5s of when I saw the screen rendered... I'll do a bit more tests. Maybe we just care about the rendering of visible items of the chat.

marcaaron · 2021-07-15T18:28:35Z

seemed to be fired a lot later than 1.5s of when I saw the screen rendered

It could be that many more items are rendering offscreen than we'd expect.

There might be a better way to do this as well. I stumbled on this ->

https://stackoverflow.com/questions/60210693/react-native-flatlist-how-to-know-when-the-rendering-of-visible-items-has-fin

Still seems a little fiddly, but maybe not much more than what we have now. Perhaps we can stop after 10 items have rendered and treat that as "done" instead of the debounce method.

There is also this suggested by @kidroca at some point in the past, but I wasn't too sure how to implement it

https://reactnative.dev/docs/flatlist#viewabilityconfig

iwiznia · 2021-07-15T18:30:55Z

Confirmed that it is firing several seconds after what I see in screen is rendered, will take a look at those options on monday.

kidroca · 2021-07-15T21:34:33Z

The SO approach is not any better than the onLayout one
Instead of implementing it inside the item's didMount, and then having to relay that to the parent
you can just do something like

renderItem({
        item,
        index,
    }) {
  this.renderedItemsCount++;

  return <ReportActionItem 
    onLayout={this.renderedItemsCount < 10 ? this.recordTimeToMeasureItemLayout : undefined}
  />
}

Count 10 invokations of the renderItem and stop subscribing for the onLayout event

This way you get the bonus of stop invoking a cb function after your done with timing

marcaaron · 2021-07-15T21:46:03Z

Sure that seems like a clever way to do the same thing but only pass the callback to the items necessary.

marcaaron · 2021-07-19T20:41:35Z

Did some testing on web and it looks like a single onLayout callback attached to the FlatList may also be good enough. I just used my eyeballs and waited for a console log to appear when the callback fires. The messages appear on screen at the same time the callback fires so it's good enough for me.

marcaaron · 2021-07-19T20:41:56Z

Gonna send a PR to clean this up + simplify.

MelvinBot · 2021-07-23T19:02:46Z

@iwiznia Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

iwiznia · 2021-07-26T11:34:14Z

Waiting on a deploy really, making this weekly.

marcaaron added Engineering Daily KSv2 Internal Requires API changes or must be handled by Expensify staff labels Jul 6, 2021

marcaaron assigned marcaaron and unassigned marcaaron Jul 6, 2021

marcaaron changed the title ~~Performance logs don't seem to make sense anymore~~ [Performance] Grafana logs don't seem to make sense anymore Jul 14, 2021

iwiznia self-assigned this Jul 15, 2021

marcaaron mentioned this issue Jul 16, 2021

Update Onyx.printMetrics format Expensify/react-native-onyx#89

Merged

marcaaron mentioned this issue Jul 19, 2021

[No QA] Use onLayout to record chat switching time #4140

Merged

5 tasks

MelvinBot added the Overdue label Jul 23, 2021

MelvinBot removed the Overdue label Jul 26, 2021

iwiznia added Weekly KSv2 and removed Daily KSv2 labels Jul 26, 2021

MelvinBot added the Overdue label Aug 3, 2021

iwiznia closed this as completed Aug 4, 2021

MelvinBot removed the Overdue label Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Grafana logs don't seem to make sense anymore #3889

[Performance] Grafana logs don't seem to make sense anymore #3889

marcaaron commented Jul 6, 2021

marcaaron commented Jul 6, 2021

jsamr commented Jul 9, 2021

marcaaron commented Jul 9, 2021

Julesssss commented Jul 13, 2021

rdjuric commented Jul 13, 2021

marcaaron commented Jul 13, 2021

iwiznia commented Jul 15, 2021 •

edited

Loading

Julesssss commented Jul 15, 2021

iwiznia commented Jul 15, 2021

iwiznia commented Jul 15, 2021

iwiznia commented Jul 15, 2021

iwiznia commented Jul 15, 2021

iwiznia commented Jul 15, 2021 •

edited

Loading

marcaaron commented Jul 15, 2021

iwiznia commented Jul 15, 2021

marcaaron commented Jul 15, 2021

iwiznia commented Jul 15, 2021

kidroca commented Jul 15, 2021

marcaaron commented Jul 15, 2021

marcaaron commented Jul 19, 2021

marcaaron commented Jul 19, 2021

MelvinBot commented Jul 23, 2021

iwiznia commented Jul 26, 2021

[Performance] Grafana logs don't seem to make sense anymore #3889

[Performance] Grafana logs don't seem to make sense anymore #3889

Comments

marcaaron commented Jul 6, 2021

Action Performed:

Expected Result:

Actual Result:

Workaround:

Platform:

marcaaron commented Jul 6, 2021

jsamr commented Jul 9, 2021

marcaaron commented Jul 9, 2021

Julesssss commented Jul 13, 2021

rdjuric commented Jul 13, 2021

marcaaron commented Jul 13, 2021

iwiznia commented Jul 15, 2021 • edited Loading

Julesssss commented Jul 15, 2021

iwiznia commented Jul 15, 2021

iwiznia commented Jul 15, 2021

iwiznia commented Jul 15, 2021

iwiznia commented Jul 15, 2021

iwiznia commented Jul 15, 2021 • edited Loading

marcaaron commented Jul 15, 2021

iwiznia commented Jul 15, 2021

marcaaron commented Jul 15, 2021

iwiznia commented Jul 15, 2021

kidroca commented Jul 15, 2021

marcaaron commented Jul 15, 2021

marcaaron commented Jul 19, 2021

marcaaron commented Jul 19, 2021

MelvinBot commented Jul 23, 2021

iwiznia commented Jul 26, 2021

iwiznia commented Jul 15, 2021 •

edited

Loading

iwiznia commented Jul 15, 2021 •

edited

Loading