Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Grafana logs don't seem to make sense anymore #3889

Closed
marcaaron opened this issue Jul 6, 2021 · 23 comments
Closed

[Performance] Grafana logs don't seem to make sense anymore #3889

marcaaron opened this issue Jul 6, 2021 · 23 comments
Assignees
Labels
Engineering Internal Requires API changes or must be handled by Expensify staff Weekly KSv2

Comments

@marcaaron
Copy link
Contributor

If you haven’t already, check out our contributing guidelines for onboarding and email contributors@expensify.com to request to join our Slack channel!


Action Performed:

Checked Grafana and key metrics appear to be wildly inaccurate

Expected Result:

That logs would more or less be the same as before with perhaps some small changes in performance

Actual Result:

The timings have deviated by 10 minutes or more

image (6)

Workaround:

Doesn't affect app usage. May or may not be indicative of a performance regression.

Platform:

N/A

Version Number:
Logs: https://stackoverflow.com/c/expensify/questions/4856
Notes/Photos/Videos: Any additional supporting documentation
Expensify/Expensify Issue URL:

View all open jobs on Upwork

@marcaaron marcaaron added Engineering Daily KSv2 Internal Requires API changes or must be handled by Expensify staff labels Jul 6, 2021
@marcaaron marcaaron assigned marcaaron and unassigned marcaaron Jul 6, 2021
@marcaaron
Copy link
Contributor Author

Took a quick look here and I'm not too sure what's happening. The way Graphite timings work in the front end we store a local map of timing values and then clear them on sign out.

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Timing.js#L4-L13
https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Session.js#L69-L70

If for some reason these values were not cleared and we "signed in" again I'd expect the values in the map to be replaced by new "start" event before being paired with an "end" event. So it's unclear how these values can be so large 🤔

e.g. for the TIMING.HOMEPAGE_REPORTS_LOADED "start" event we only record this when constructing AuthScreens component here:

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/Navigation/AppNavigator/AuthScreens.js#L118-L119

Then when the component mounts we make the API request

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/Navigation/AppNavigator/AuthScreens.js#L138-L139

And finally record the "end" event when it returns successfully

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Report.js#L920-L922

So we are surely reaching the above code otherwise we would not see these logs in Grafana at all. But why are they suddenly so much larger? @Julesssss do you have any ideas?

@jsamr
Copy link
Collaborator

jsamr commented Jul 9, 2021

@marcaaron Just preparing a ticket for a performance audit and I stumbled on this one.

https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Timing.js#L4-L13

Note that the Date API is pretty inaccurate for performance metrics, and React Native has shipped a minimal Performance API in version 0.63.0 facebook/react-native@232517a (see changelog https://github.com/react-native-community/releases/blob/master/CHANGELOG.md#v0630). performance.now should be preferred over new Date() for accuracy ;-)

@marcaaron
Copy link
Contributor Author

Nice one! We should get this fixed up (maybe in a new issue) and it's good to note.

But I think the issue was working for a while at least with the same code. There is a clear chasm with ms/seconds on one side and then suddenly minutes on the other. So probably we made a serious mistake somewhere else or there is an API request somehow taking 10 minutes for some of us.

@Julesssss
Copy link
Contributor

Oh wow, nice spot.

It's odd that the increase seems to occur on different days for the 'Switch report' & 'Homepage reports loaded' events. The most recent PR touching the timing code was this one, which seems to be aligned date-wise. But I can't see any obvious issues at first glance. Happy to look further into this once I've cleared higher priority tasks.

@rdjuric
Copy link
Contributor

rdjuric commented Jul 13, 2021

Just to add some data:

I had some Timing logs with negative time (-392ms) for the Switch Report. Tried to reproduce it when I noticed but couldn't find out how.

@marcaaron
Copy link
Contributor Author

I had some Timing logs with negative time (-392ms) for the Switch Report. Tried to reproduce it when I noticed but couldn't find out how.

That's interesting. Maybe someone can look into what happens if we call Grafana with a negative value? Perhaps that can somehow explain the very large values.

@marcaaron marcaaron changed the title Performance logs don't seem to make sense anymore [Performance] Grafana logs don't seem to make sense anymore Jul 14, 2021
@iwiznia
Copy link
Contributor

iwiznia commented Jul 15, 2021

Without reading any of this (sorry) I think I fixed the graphs. What I did:

  • Added aliases to all graphs to make the labels make sense
  • Hid any metric with no data
  • Changed the summarize aggregation method from sum to avg, you were adding all the times from all the 24hs you were summarizing

Do they make sense now?

For some reason, there are some weird metrics in the homepage reports loaded, not sure where they are coming from...

@iwiznia iwiznia self-assigned this Jul 15, 2021
@Julesssss
Copy link
Contributor

That's looking better 👍

While you're assigned, do you know if it's possible to remove these junk platforms from the 'Homepage Reports Loaded' table?

Screenshot 2021-07-15 at 14 27 40

@iwiznia
Copy link
Contributor

iwiznia commented Jul 15, 2021

I can remove them manually, I am a bit more interested to know why they are there.

@iwiznia
Copy link
Contributor

iwiznia commented Jul 15, 2021

Seems all the crap data happened only once at 2021-07-12 00:00:00 so maybe it was someone doing something weird in dev, I bet it won't happen again so the labels will be gone on their own.

@iwiznia
Copy link
Contributor

iwiznia commented Jul 15, 2021

This adding 1500ms extra to the timing of the action... right? I think that if we want the debounce we need to subtract the debounce time from the measured time.
But I am not really sure I get why we need the debounce?

@iwiznia
Copy link
Contributor

iwiznia commented Jul 15, 2021

Oh, it already is subtracting it 😄
I wonder why the graphs show such a high time.

@iwiznia
Copy link
Contributor

iwiznia commented Jul 15, 2021

O, I think I know why, we are subtracting the debounce time, but if the event fires many times, the debounce time gets reset, so we are not subtracting the correct amount of time.
Now before coming up with a fix, I'd like to understand why we use the debounce at all... we want to track when all things are rendered or when the first one is?

@marcaaron
Copy link
Contributor Author

Now before coming up with a fix, I'd like to understand why we use the debounce at all... we want to track when all things are rendered or when the first one is?

I'm realizing now that maybe this was not a perfect solution. It's confusing for certain and could have a bug, but let's see. I'll try to explain the reasoning...

Previously, we would record the "end" timing when the report mounts. This didn't give us an accurate picture of whether all visible chat items are rendered and on the screen. So we added this onLayout callback here:

https://github.com/Expensify/Expensify.cash/blob/a4424c65ae89030e97f6f36e6911cfe5295da5f2/src/pages/home/report/ReportActionsView.js#L377

Which is debounced so that when all the visible items have finished laying out we will record that moment and measure the total time to layout all visible items.

if the event fires many times, the debounce time gets reset, so we are not subtracting the correct amount of time

Maybe I have this wrong, but my intention was that the debounce should wait 1.5 seconds after the last time it is called to execute the callback.

From the docs:

Creates and returns a new debounced version of the passed function which will postpone its execution until after wait milliseconds have elapsed since the last time it was invoked.

So, we are subtracting 1.5 seconds from the last time an item layout occurred because the actual function will be executed 1.5 seconds after the last time the debounced function was called.

So e.g. if there is only one item... it would layout then the debounce function would get called 1.5 seconds later then we can subtract 1.5 seconds from the Date.now() to get the true time. But maybe we are wrong about this?

There might be a better way to do this. Or we could just stop tracking this metric entirely if we can't come up with a good solution and this one seems broken.

@iwiznia
Copy link
Contributor

iwiznia commented Jul 15, 2021

ok, I see. First, is good to know that the debounce exists only to track the time appropriately, I was fearing it was needed for some other reason 😄

So, I think what you say is correct, but the thing is, I was looking at the screen and the event to end the timer seemed to be fired a lot later than 1.5s of when I saw the screen rendered... I'll do a bit more tests. Maybe we just care about the rendering of visible items of the chat.

@marcaaron
Copy link
Contributor Author

seemed to be fired a lot later than 1.5s of when I saw the screen rendered

It could be that many more items are rendering offscreen than we'd expect.

There might be a better way to do this as well. I stumbled on this ->

https://stackoverflow.com/questions/60210693/react-native-flatlist-how-to-know-when-the-rendering-of-visible-items-has-fin

Still seems a little fiddly, but maybe not much more than what we have now. Perhaps we can stop after 10 items have rendered and treat that as "done" instead of the debounce method.

There is also this suggested by @kidroca at some point in the past, but I wasn't too sure how to implement it

https://reactnative.dev/docs/flatlist#viewabilityconfig

@iwiznia
Copy link
Contributor

iwiznia commented Jul 15, 2021

Confirmed that it is firing several seconds after what I see in screen is rendered, will take a look at those options on monday.

@kidroca
Copy link
Contributor

kidroca commented Jul 15, 2021

The SO approach is not any better than the onLayout one
Instead of implementing it inside the item's didMount, and then having to relay that to the parent
you can just do something like

renderItem({
        item,
        index,
    }) {
  this.renderedItemsCount++;

  return <ReportActionItem 
    onLayout={this.renderedItemsCount < 10 ? this.recordTimeToMeasureItemLayout : undefined}
  />
}

Count 10 invokations of the renderItem and stop subscribing for the onLayout event

This way you get the bonus of stop invoking a cb function after your done with timing

@marcaaron
Copy link
Contributor Author

Sure that seems like a clever way to do the same thing but only pass the callback to the items necessary.

@marcaaron
Copy link
Contributor Author

Did some testing on web and it looks like a single onLayout callback attached to the FlatList may also be good enough. I just used my eyeballs and waited for a console log to appear when the callback fires. The messages appear on screen at the same time the callback fires so it's good enough for me.

@marcaaron
Copy link
Contributor Author

Gonna send a PR to clean this up + simplify.

@MelvinBot
Copy link

@iwiznia Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

@iwiznia
Copy link
Contributor

iwiznia commented Jul 26, 2021

Waiting on a deploy really, making this weekly.

@MelvinBot MelvinBot removed the Overdue label Jul 26, 2021
@iwiznia iwiznia added Weekly KSv2 and removed Daily KSv2 labels Jul 26, 2021
@iwiznia iwiznia closed this as completed Aug 4, 2021
@MelvinBot MelvinBot removed the Overdue label Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Engineering Internal Requires API changes or must be handled by Expensify staff Weekly KSv2
Projects
None yet
Development

No branches or pull requests

7 participants