-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Grafana logs don't seem to make sense anymore #3889
Comments
Took a quick look here and I'm not too sure what's happening. The way Graphite timings work in the front end we store a local map of timing values and then clear them on sign out. https://github.com/Expensify/Expensify.cash/blob/36a336400efc060df2ebc21d74ca1ee19ee5afff/src/libs/actions/Timing.js#L4-L13 If for some reason these values were not cleared and we "signed in" again I'd expect the values in the map to be replaced by new "start" event before being paired with an "end" event. So it's unclear how these values can be so large 🤔 e.g. for the Then when the component mounts we make the API request And finally record the "end" event when it returns successfully So we are surely reaching the above code otherwise we would not see these logs in Grafana at all. But why are they suddenly so much larger? @Julesssss do you have any ideas? |
@marcaaron Just preparing a ticket for a performance audit and I stumbled on this one. Note that the |
Nice one! We should get this fixed up (maybe in a new issue) and it's good to note. But I think the issue was working for a while at least with the same code. There is a clear chasm with ms/seconds on one side and then suddenly minutes on the other. So probably we made a serious mistake somewhere else or there is an API request somehow taking 10 minutes for some of us. |
Oh wow, nice spot. It's odd that the increase seems to occur on different days for the 'Switch report' & 'Homepage reports loaded' events. The most recent PR touching the timing code was this one, which seems to be aligned date-wise. But I can't see any obvious issues at first glance. Happy to look further into this once I've cleared higher priority tasks. |
Just to add some data: I had some Timing logs with negative time (-392ms) for the |
That's interesting. Maybe someone can look into what happens if we call Grafana with a negative value? Perhaps that can somehow explain the very large values. |
Without reading any of this (sorry) I think I fixed the graphs. What I did:
Do they make sense now? For some reason, there are some weird metrics in the homepage reports loaded, not sure where they are coming from... |
I can remove them manually, I am a bit more interested to know why they are there. |
Seems all the crap data happened only once at 2021-07-12 00:00:00 so maybe it was someone doing something weird in dev, I bet it won't happen again so the labels will be gone on their own. |
This adding 1500ms extra to the timing of the action... right? I think that if we want the debounce we need to subtract the debounce time from the measured time. |
Oh, it already is subtracting it 😄 |
O, I think I know why, we are subtracting the debounce time, but if the event fires many times, the debounce time gets reset, so we are not subtracting the correct amount of time. |
I'm realizing now that maybe this was not a perfect solution. It's confusing for certain and could have a bug, but let's see. I'll try to explain the reasoning... Previously, we would record the "end" timing when the report mounts. This didn't give us an accurate picture of whether all visible chat items are rendered and on the screen. So we added this Which is debounced so that when all the visible items have finished laying out we will record that moment and measure the total time to layout all visible items.
Maybe I have this wrong, but my intention was that the debounce should wait 1.5 seconds after the last time it is called to execute the callback. From the docs:
So, we are subtracting 1.5 seconds from the last time an item layout occurred because the actual function will be executed 1.5 seconds after the last time the debounced function was called. So e.g. if there is only one item... it would layout then the debounce function would get called 1.5 seconds later then we can subtract 1.5 seconds from the There might be a better way to do this. Or we could just stop tracking this metric entirely if we can't come up with a good solution and this one seems broken. |
ok, I see. First, is good to know that the debounce exists only to track the time appropriately, I was fearing it was needed for some other reason 😄 So, I think what you say is correct, but the thing is, I was looking at the screen and the event to end the timer seemed to be fired a lot later than 1.5s of when I saw the screen rendered... I'll do a bit more tests. Maybe we just care about the rendering of visible items of the chat. |
It could be that many more items are rendering offscreen than we'd expect. There might be a better way to do this as well. I stumbled on this -> Still seems a little fiddly, but maybe not much more than what we have now. Perhaps we can stop after 10 items have rendered and treat that as "done" instead of the debounce method. There is also this suggested by @kidroca at some point in the past, but I wasn't too sure how to implement it |
Confirmed that it is firing several seconds after what I see in screen is rendered, will take a look at those options on monday. |
The SO approach is not any better than the renderItem({
item,
index,
}) {
this.renderedItemsCount++;
return <ReportActionItem
onLayout={this.renderedItemsCount < 10 ? this.recordTimeToMeasureItemLayout : undefined}
/>
} Count 10 invokations of the This way you get the bonus of stop invoking a cb function after your done with timing |
Sure that seems like a clever way to do the same thing but only pass the callback to the items necessary. |
Did some testing on web and it looks like a single |
Gonna send a PR to clean this up + simplify. |
@iwiznia Uh oh! This issue is overdue by 2 days. Don't forget to update your issues! |
Waiting on a deploy really, making this weekly. |
If you haven’t already, check out our contributing guidelines for onboarding and email contributors@expensify.com to request to join our Slack channel!
Action Performed:
Checked Grafana and key metrics appear to be wildly inaccurate
Expected Result:
That logs would more or less be the same as before with perhaps some small changes in performance
Actual Result:
The timings have deviated by 10 minutes or more
Workaround:
Doesn't affect app usage. May or may not be indicative of a performance regression.
Platform:
N/A
Version Number:
Logs: https://stackoverflow.com/c/expensify/questions/4856
Notes/Photos/Videos: Any additional supporting documentation
Expensify/Expensify Issue URL:
View all open jobs on Upwork
The text was updated successfully, but these errors were encountered: