Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: basic start unit job reporting #265

Merged
merged 1 commit into from
May 30, 2023
Merged

Conversation

mkemel
Copy link
Member

@mkemel mkemel commented May 2, 2023

This change allows the user to enable/disable metrics, and then if metrics are enabled - on every StartUnit job hirte manager will measure gross method execution time, and also will fetch unit start net time from properties - and report them both via StartUnitJobMetrics signal.
In addition, the hirte agent will measure systemd job net time and report it via AgentJobMetrics signal to the hirte manager, which would relay it to the user.

Sample hirtectl output

[mkemel@lptp ~]$ sudo hirtectl metrics enable
Done
[mkemel@lptp ~]$ sudo hirtectl metrics listen
Waiting for metrics signals...
[laptop] Agent systemd StopUnit job on httpd.service net measured time: 1.6ms
[laptop] Agent systemd StartUnit job on httpd.service net measured time: 76.1ms
[laptop] Start job /org/containers/hirte/job/2 on unit httpd.service:
	Hirte job gross measured time: 76.4ms
	Unit net start time (from properties): 55.5ms

Sample gdbus output

[mkemel@lptp ~]$ sudo gdbus monitor --system --dest org.containers.hirte --object-path /org/containers/hirte/metrics
Monitoring signals on object /org/containers/hirte/metrics owned by org.containers.hirte
The name org.containers.hirte is owned by :1.1417
/org/containers/hirte/metrics: org.containers.hirte.Metrics.StartUnitJobMetrics ('/org/containers/hirte/job/14', uint64 55, uint64 46, uint64 9)

src/libhirte/common/util.c Outdated Show resolved Hide resolved
src/manager/metrics.c Outdated Show resolved Hide resolved
src/libhirte/common/util.c Outdated Show resolved Hide resolved
src/libhirte/common/util.c Outdated Show resolved Hide resolved
src/manager/manager.c Outdated Show resolved Hide resolved
src/client/client.c Show resolved Hide resolved
src/client/client.h Outdated Show resolved Hide resolved
src/manager/node.c Outdated Show resolved Hide resolved
src/client/client.c Outdated Show resolved Hide resolved
src/client/client.c Outdated Show resolved Hide resolved
@mkemel mkemel force-pushed the metrics branch 4 times, most recently from b1bd76d to a4360f6 Compare May 24, 2023 08:44
@dougsland
Copy link
Contributor

/cc @dougsland

Copy link
Member

@engelmi engelmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Only got a few minor comments.

src/libhirte/common/time-util.c Outdated Show resolved Hide resolved
src/manager/metrics.c Outdated Show resolved Hide resolved
src/manager/metrics.c Outdated Show resolved Hide resolved
src/manager/metrics.c Outdated Show resolved Hide resolved
src/client/client.c Show resolved Hide resolved
src/agent/agent.c Show resolved Hide resolved
src/agent/agent.c Show resolved Hide resolved
data/org.containers.hirte.Metrics.xml Outdated Show resolved Hide resolved
src/libhirte/common/time-util.c Outdated Show resolved Hide resolved
src/client/client.c Outdated Show resolved Hide resolved
@engelmi
Copy link
Member

engelmi commented May 26, 2023

After manually testing a bit I found two interesting aspects:

  1. On two subsequent starts of a unit, the times for both - hirte total and hirte agent - are much lower than the time from the systemd properties on the second call. Which makes sense since on the second call the unit is already running and we only extract the systemd properties for the net time (and those don't change). Example:
[laptop] Agent systemd StartUnit job on cow.service net measured time: 87.3ms
[laptop] Start job /org/containers/hirte/job/8 on unit cow.service:
        Hirte job gross measured time: 89.2ms
        Unit net start time (from properties): 46.4ms
[laptop] Agent systemd StartUnit job on cow.service net measured time: 1.2ms
[laptop] Start job /org/containers/hirte/job/9 on unit cow.service:
        Hirte job gross measured time: 3.6ms
        Unit net start time (from properties): 46.4ms
  1. Currently, we only track and listen for metrics of units that have explicitly been started/stopped/etc. by hirte, e.g. via hirtectl. This means for the proxy service we only get an overall measurement which includes the time it took to start the proxy services. In the following example, requesting.service depended on the cow.service:
[pi] Agent systemd StartUnit job on requesting.service net measured time: 153.4ms
[pi] Start job /org/containers/hirte/job/1 on unit requesting.service:
        Hirte job gross measured time: 178.7ms
        Unit net start time (from properties): 0.0ms

I think it would be great to have a more detailed view - which can be done in a separate PR, of course. What do you think? @pypingou @mkemel

@mkemel
Copy link
Member Author

mkemel commented May 29, 2023

@engelmi Thanks :)

  1. Yeah, this is known and expected behavior. There are two ways to improve it: (a) check unit status and do not report if it is already started (extra action) (b) do not report if props time is larger than measured time (basically do not report inconsistent data only because it's inconsistent). We don't want to do neither this nor that. That's why I added "from properties". I think that user that knows what he's doing can live with that :)
  2. Regarding the 0.0ms that you're getting - this is because you are using a simple unit that does sleep. These do not update the state entry/exit time in properties. If you run a container or any other real service - it shows correctly.
    As for more detailed data on time to start unit with dependencies - yeah, this is what I want to do next.

This change allows the user to enable/disable metrics, and then if
metrics are enabled - on every StartUnit job hirte manager will fetch
unit start net time from properties, and subtract it from measured
job time and report all the data via signal

Signed-off-by: Mark Kemel <mkemel@redhat.com>
@engelmi
Copy link
Member

engelmi commented May 30, 2023

1. Yeah, this is known and expected behavior. There are two ways to improve it: (a) check unit status and do not report if it is already started (extra action) (b) do not report if props time is larger than measured time (basically do not report inconsistent data only because it's inconsistent). We don't want to do neither this nor that. That's why I added "from properties". I think that user that knows what he's doing can live with that :)

Yes, it is definitely fine like it is now :) I just wanted to mention it here so we are aware and know the cause - since others that are (not yet) familiar with it will likely ask.

2. Regarding the 0.0ms that you're getting - this is because you are using a simple unit that does `sleep`. These do not update the state entry/exit time in properties. If you run a container or any other real service - it shows correctly.
   As for more detailed data on time to start unit with dependencies - yeah, this is what I want to do next.

Sounds good! But before starting, we should probably talk about it in our sync meeting.
And yes, the requesting.service had just a simple sleep and called for the cow.service which then started an apache... but the 0.0ms wasn't part of my question, it was just by chance like this 😁

Copy link
Member

@engelmi engelmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants