-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Infra Monitoring UI] Uptime for Pod/Container in Metrics table is wrong #136047
Comments
Pinging @elastic/apm-ui (Team:apm) |
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
I kind of expected this, we did discuss it in the original ticket but no one had a clear answer and we had so many other questions going on so we just went with The second issue seems easy enough to fix, rather than using The first issue is more tricky, for several reasons, but let's see what options we might have. Quick "fix": Real fix: What does that algorithm look like, to actually determine the uptime? This is not clear to me, which documents should be involved to make this as easy as possible. Long fix: For performance reasons but also to gain flexibility in the data we show we might want to build a dedicated API for this component. Going further, thinking about a dashboard driven future, how would a dashboard display such information? |
@alex-fedotyev lets sync on this, do you have a meta issue to help me understand the reasoning behind this one? I'd like to get more context on who we are solving for and what problems so that I can form a more informed opinion. |
As a stop gap we're going to hide this column for now and follow up on this issue with either a real fix or some other metrics. For the follow up, it would be good if we can use metrics that are easy to consume and if we want to display uptime we should look to extend the Kubernetes module to collect that in a different way (similar to how the system module does). |
* [Infra] Hide Uptime column in Pod/Container metrics table (#136047) * Remove code instead of commenting it out * Fully remove uptime related code
@miltonhultgren @smith moved it to "in progress" judging by the latest commit reference |
@pmeresanu85 do you know who has is in progress at the moment? I'm having trouble finding anything that looks like active work on this issue. |
@matschaffer i was referring to the previous pull from the 19th of July , when I set this issue to "in progress". Is there a reason why this has been put on hold? Is it complete? |
The previous PR just hid the column while we figure out what to actually show in the column and how to calculate the actual uptime. |
@miltonhultgren i assume this means we are actively working on this, based on your statement. |
we = product people, yes. No engineer is working on this but this ties into the larger discussion around what we're gonna do with our multiple tables. |
Moving back to “ready” |
I'll put this back into refining since there isn't a clear choice of what to do. |
@miltonhultgren I assume when you mean unclear what to do, you mean it still needs a bit of reflecting on the solution approach. |
Yes, I see 3 options:
From my view, this is a decision to be made by PMs within APM. |
@miltonhultgren thank you for sharing your thought! Let me take it into the 1:1 with @alex-fedotyev and see how to proceed on this topic. I will update this issue after the 1:1 |
@miltonhultgren after my 1:1 with @alex-fedotyev for the Infra view in APM we agreed that:
As for the uptime fix, I believe we should invest into actually calculating and showing the real uptime. |
Sounds good to me 👍 const apmEventClient = createApmEventClient({ request, context }); More info about
Will infra make react components available or API's? I prefer react components. |
@sqren I prefer react components too, but I leave it up to the team to decide what is best. |
At this time putting the uptime back on the table in APM is not a huge priority for APM. Removed from "Refining" on our board and added We probably want to see if this can be handled closer to ingest time, since calculating these on the client could be problematic. The next best times to revisit this might be when we're adding container-oriented views to host observability, or when APM is making updates to the metrics tab and changes to how infrastructure data is displayed. |
@miltonhultgren / @neptunian when you have time (no rush), can one or both of you walk me through what's missing in current documents to calculate the uptime we want to see? Based on @alex-fedotyev's diagram in the issue description, it looks like we need "stop time" for both scenarios. I see that @smith has de-prioritized this so don't take time away from other things to consider this, but I want to wrap my head around whether this is something we expect to be able to answer from signal documents or if it's something we should be considering from an inventory perspective, or both. Anyway, the idea: is there a way to, based on the query parameters, look at the bucket count for documents that are returned for each node and compare it to what we expect vs. how far the start time is from the beginning of the time window? For example:
We'd currently calculate that uptime as 73 min, as I understand it (from start time to now, disregarding time range end) Could we do something like:
The main thing I'm not sure of is whether the current queries we are doing already return the number of buckets found within the given range, and if we can easily compare that to the number of buckets we expect in that range based on the metricset.period -- if those aren't possible/simple, then this is likely too complicated to attempt. |
Closing - uptime is being deprecated I believe @drewpost ? |
@roshan-elastic this isn't talking about the Uptime app but about how long a container or pod has been "up". We removed that column from the table and I don't think the underlying issue has been solved. Since we're not showing these values in the UI at this time I don't think we need to solve this so it's ok to remain closed. |
Ah thanks for the clarification @smith - I didn't realise. |
Stack version: 8.4.0-SNAPSHOT
The infrastructure view in the APM UI lists container metrics for a service like this:
The first issue with uptime here is that containers that were started within the timepicker window seem to be assumed to still be running. For example with container
ebc4507ff1bd2ff800029c347f07d2b18d7f54179f2203d75b8e1cc7708c4078
, it was indeed started 20 hours before loading this page but terminated 10 seconds later, entered crash loop backoff for 5 min and then was never heard from again. We know this because the metrics captured it:Query and results:
returns
The second issue is also illustrated here for containers that are still running. The uptime displayed is relative to now, rather than to the times selected in the date picker. @alex-fedotyev inspired this nice visual to demonstrate the issue:
The text was updated successfully, but these errors were encountered: